Quickstart¶
Preprocessing¶
For topic models that make use of bag-of-word (or bag-of-n-gram) representations, TMNT provides
some routines to help pre-process text data. These routines map text documents into (sparse) vectors
that represent the (possibly weighted) counts of terms that appear. The functionalitiy is
contained within the class tmnt.preprocess.vectorizer.TMNTVectorizer
.
Below provides a code snippet highlighting how to process a simple set of strings/documents:
from tmnt.preprocess.vectorizer import TMNTVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TMNTVectorizer()
X, _ = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_vocab().token_to_idx)
The resulting document-term matrix X can be used for training a topic model.
TMNTVectorizer
simply wraps (rather than extends) sklearn.feature_extraction.text.CountVectorizer
and provides some additional functionality useful for handling JSON list input representations.
When not using JSON input formats and/or working with purely unlabeled data, simply using
CountVectorizer
makes sense. The above example is very similar:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names())
The CountVectorizer
class provides a number of different keyword argument options to pre-process text in
various ways which can be passed through TMNTVectorizer
as a dictionary argument count_vectorizer_kwargs
:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer2 = TMNTVectorizer(count_vectorizer_kwargs={'ngram_range':(2,2)})
X, _ = vectorizer2.fit_transform(corpus)
print(X.toarray())
print(vectorizer2.get_vocab().token_to_idx)
The above snippet uses the CountVectorizer
argument ngram_range
to specify that
bi-grams (pairs of adjacent words) should be used as features rather than single words.
Training a Topic Model¶
Topic models using a bag of words (ngrams) model are estimated from a document-term matrix X
The following example shows how to fit a topic model using the tmnt.estimator.BowEstimator
class. The first step is to get a sample corpus and vectorize it:
from sklearn.datasets import fetch_20newsgroups
from tmnt.preprocess.vectorizer import TMNTVectorizer
data, y = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
tf_vectorizer = TMNTVectorizer(vocab_size=2000)
X, _ = tf_vectorizer.fit_transform(data)
Next, a bag-of-words topic model estimator tmnt.estimator.BowEstimator
is created. This class
has many options (see the API documentation), but the single required argument is the vocabulary associated
with the dataset:
from tmnt.estimator import BowEstimator
vocabulary = tf_vectorizer.get_vocab()
estimator = BowEstimator(vocabulary)
The model is the trained or fit using the tmnt.estimator.BowEstimator.fit()
method with the document-term
matrix X
provided as an argument:
_ = estimator.fit(X)
Using the Model for Inference¶
Given an estimator that has been fit, we can instantiate the result for inference by
creating a tmnt.inferencer.BowVAEInferencer
object:
from tmnt.inferencer import BowVAEInferencer
inferencer = BowVAEInferencer(estimator, vectorizer=tf_vectorizer)
The BowVAEInferencer
object encapsulates the trained model, the estimator used to fit the
model as well as additional methods for applying the model to new data. It optionally contains
the TMNTVectorizer
object that maps text data into the appropriate vector representation. The snippet below
makes use of the tmnt.inferencer.BowVAEInferencer.encode_texts()
method to take raw text,
apply the model to raw text, map each document string to a vector representation and apply the trained
model encoder to get back document encodings, one for each input document string:
encodings = \
inferencer.encode_texts(['Greater Armenia would stretch from Karabakh, to the Black Sea',
'I have two pairs of headphones I\'d like to sell. These are both excellent.'])
The BowVAEInferencer
object can be saved to disk and reloaded for model deployment:
inferencer.save(model_dir='_model_dir')
reloaded_estimator = BowVAEInferencer.from_saved(model_dir='_model_dir')
A more complete example contains the code in this section and some additional code here