https://colab.research.google.com/drive/12_uDyHFdxMfWe2-l5nNXkk3KpOyRmw_r?usp=sharing

← notebook link (shared with you)

This notebook was created to originally orient around how to set up a BERT Model in a Jupyter Notebook Environment. The corpus used in this text was ‘Leviathans’ copy and pasted from Gutenberg. To access the text file used, you can save the text document from this link. It must be noted that this file is not saved in accordance to Gutenberg standards, and the metadata provided by Gutenberg has been removed for more accurate topic modeling (on a shorter corpus).

The table of contents for this notebook are:

  1. Setting Up The Library
    1. Library Import
    2. Data Import
  2. Preprocessing Text
    1. Generating Book/Chapter and Individual Documents
    2. Fitting the Model to Leviathan
    3. Assessing the Topics Created by BERT
    4. Recomposing the Documents to Books
  3. Topic Probability Distribution
  4. BERTopic Visualizations
  5. Saving a BERTModel

I will proceed to explain the thought process and intentionality between the code here, in the above structured format. The code snippets are already formatted in Python, so after setting up the library you can paste them into your own notebook if you would like to try testing it.


1. Setting Up The Library

'''DOWNLOADING THE LIBRARIES'''
pip install bertopic
pip install nltk
pip install ipywidgets

'''IMPORTING LIBRARIES'''

'''TOPIC EXTRACTION'''
from bertopic import BERTopic

''' MODEL TRAINING '''
from transformers import AutoModel
model = AutoModel.from_pretrained("emanjavacas/MacBERTh")

'''TEXT PREPROCESSING'''
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
import regex as re

''' DATA ANALYSIS'''
import pandas as pd
import numpy as np
import ipywidgets as widgets
import plotly.graph_objects as go
from IPython.display import display
import plotly.io as pio
pio.renderers.default = 'colab'
from IPython.display import clear_output

''' PRETTY PRINTING '''
import os

Downloading the libraries are a process which sets up the environment to run certain programs that are outside of the native Python environment. BERTopic is installed as it is the topic model we are working with, NLTK is a natural language processing model that can help with stop word removal and other helpful NLP methods. Ipywidgets are installed to create ways to interface with the data without having to run additional code blocks.

When a library is imported, it will add the functionalities of the package. It creates objects that can be interacted with, and lowers the amount of coding that needs to be done (along the logic of, why reinvent the wheel!)

Library Import Explanation:

The libraries that are imported are BERTtopic used to form the language model. The text preprocessing libraries I chose to import were primarily in NLTK. These models are able to remove any stop words, tokenize by word or sentence. Regex is used to help to clean any text files if there are small irregularities. I have also used it in a later notebook to extract full file names from path, essentially it is like an advanced ctrl+f. The model training section of the code is pulling in the MacBERTh pretrained model. This model is trained on historical text between 1450-1950. Many present day language models are trained on Tweets, Reddit, or Wikipedia data, so while they are not inaccurate they may not support the sentence structure, spelling, or cultural meaning of a text.

The next libraries which are imported are used for data analysis. Pandas is akin to the Excel spreadsheet of Python, and supports structured data manipulation in an easy to understand format. This is used primarily in document (de/re)composition as it can aggregate down rows to preform a repeated action. The next library imported is Numpy, is supports arrays and other mathmatical operations. Both Numpy, Pandas, and Regex are imported using aliases. This is a shortened call to the library so it can be referenced by using a pd.attribute, instead of pandas.attribute. These aliases are standard in the Python community, and will not be an issue of interpretability if the code is shared. The remainder of the libraries that are imported are used to support graphical interfacing with the code, which I will explain more as they are used in practice.

Data Import: