Model Serialization

Model Serializaton is a process which can save a large language model and dataset. For the purposes of the serialization I used the Pickle Serializer. This is a process which takes takes a “Python object hierarchy”, and converts it into a byte steam which ‘flattens’ the dataset. Alternative names for pickling are ‘serialization’, ‘marshalling’, or ‘flattening’.

Loading previously made models must require a save for the scope of the data we are working on. In the documentation for BERTopic, the serialization page provides additional information. The pickle method will preserve the entire model, where other serialization methods like safe tensores or pytorch will only preserve some of the information and is best for very small models.

Pulling pickled data must be clear that the environment of Python is the same. Using a repository manager like Huggingface (integrated with Github), can provide easy version control, and automatically generate other metadata we may need to correctly unpickle both the model and the data.

Reccomended in the BERTopic documentation, is to push the model to a HuggingFace Repository. This option has both public, private, and ‘organzational’ options, so we would be able to push a trained model on the VEC data while still remaining in any copyright or ownership compliance. While the model is pushed to Huggingface, we still have the autonomy to have the data private to just our team using an organizational setting. The repository that is created is a wrapper of Github, and there are guides on how to sync Hugging Face and Github reposiotries. This functionality is something that will make the transition to Alpine much easier!

Notebook & Repository

https://colab.research.google.com/drive/1e9ARzcRtExZKu8yxLt86Y0KUqkYA7ndm?usp=sharing

⬆️BERTopic Model Saving.ipynb

https://huggingface.co/

⬆️ Hugging Face Data/Model Repository with Github Integration

In the below code, I will be going through the following steps to prototype and push a custom model to Huggingface. I test two corpora: the Leviathan short text, and the prototype corpus with 20 political texts and Hamlet. Both are pushed to a private Huggingface repository, so they will not be accessible from this document. For both corpora, I was able to push and pull the model from the repository, and store whichever dataframes I have created in the process as well. I believe later when we transition to Alpine we should only need to pull from

Table of Contents

Setting Up The Library
1. Library Import
2. Data Import & Preprocessing
Fitting The Model
Recomposing Documents To Their Books
🫙 Storing the Pickle Jar on Hugging Face
🫙 Storing the Dataset on the Repository
Validating Model Pull to Model Push