Note
Go to the end to download the full example code
Training a First Topic model¶
This example shows how to train a simple neural variational topic model on the widely used 20 Newsgroups Dataset.
Start with various imports
from tmnt.preprocess.vectorizer import TMNTVectorizer
import torch
Let’s fetch the 20 newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
data, y = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
Next step involves creating a vectorizer that maps text in the list of strings, data
,
to a document-term matrix, X
tf_vectorizer = TMNTVectorizer(vocab_size=2000, count_vectorizer_kwargs=dict(max_df=0.8, token_pattern=r'[A-Aa-z][A-Za-z][A-Za-z]+'))
X, _ = tf_vectorizer.fit_transform(data)
Setup logging which is good practice
import logging
from tmnt.utils.log_utils import logging_config
logging_config(folder='.', name='train_20news', level='info', console_level='info')
Fitting a model involves creating an instance of the tmnt.estimator.BowEstimator
class
We use the LogisticGaussian
latent distribution here with 25 latent dimensions or topics
The fit method applied to the term-document matrix will estimate the model parameters.
from tmnt.estimator import BowEstimator
from tmnt.distribution import LogisticGaussianDistribution, GaussianDistribution, VonMisesDistribution
device = torch.device('cpu')
#device = torch.device('cuda')
distribution = LogisticGaussianDistribution(100,20,dr=0.2,alpha=0.5, device=device)
estimator = BowEstimator(vocabulary=tf_vectorizer.get_vocab(), latent_distribution=distribution, device=device,
log_method='log', lr=0.0075, batch_size=400, embedding_source='random', embedding_size=200,
epochs=96, enc_hidden_dim=100, validate_each_epoch=False, quiet=False)
#estimator = BowEstimator.from_config(config='../data/configs/train_model/model.config', vocabulary=tf_vectorizer.get_vocab())
#tr_X, val_X = X[:1000], X[:1000] # in this case, use same data for training and validation
tr_X, val_X = X, X # in this case, use same data for training and validation
tr_y, val_y = None, None # dependent variables (labels) aren't used
_ = estimator.fit_with_validation(tr_X, tr_y, val_X, val_y)
An inference object is then created which enables the application of the model to raw text data and/or directly to document-term matrices
from tmnt.inference import BowVAEInferencer
inferencer = BowVAEInferencer(estimator, pre_vectorizer=tf_vectorizer)
encodings = inferencer.encode_texts(['Greater Armenia would stretch from Karabakh, to the Black Sea, to the Mediterranean, so if you use the term Greater Armenia use it with care.','I have two pairs of headphones I\'d like to sell. These are excellent, and both in great condition'])
The model can be saved to disk and reloaded for model deployment
inferencer.save(model_dir='_model_dir')
reloaded_inferencer = BowVAEInferencer.from_saved(model_dir='_model_dir')
We can visualize the topics and associated topic terms using PyLDAvis
import pyLDAvis
import funcy
full_model_dict = inferencer.get_pyldavis_details(X)
pylda_opts = funcy.merge(full_model_dict, {'mds': 'mmds'})
vis_data = pyLDAvis.prepare(**pylda_opts)
The topic model terms and topic-term proportions will be written
to the file m1.html
import numpy as np
pyLDAvis.save_html(vis_data, 'm1.html')
As we’ve already preprocessed the entire training set, we can use the method tmnt.inference.BowVAEInferencer.encode_data()
to derive encodings from the already pre-processed sparse matrix X
:
enc_list, _ = reloaded_inferencer.encode_data(X)
encodings = np.array(enc_list)
top_k_topics = inferencer.get_top_k_words_per_topic(10)
Now let’s visualize the encodings for the training set using UMAP:
#import umap
#import matplotlib.pyplot as plt
- We leverage UMAP to fit (another) embedding from the topic encodings appropriate
- for visualizing the data. See UMAP for more here
umap_model = umap.UMAP(n_neighbors=4, min_dist=0.5, metric=’euclidean’) embeddings = umap_model.fit_transform(encodings)
- We can plot the UMAP embeddings as a scatter plot.
- For this dataset, although we did not use the provided labels
y
to help fit the topic model, it can be helpful to use the labels to color-code the documents in order to see how documents with the same label are encoded.
plt.scatter(*embeddings.T, c=y, s=0.8, alpha=0.9, cmap=’coolwarm’) plt.show()