tmnt.estimator

Estimator module to train/fit/estimate individual models with fixed hyperparameters. Estimators are used by trainers to manage training with specific datasets; in addition, the estimator API supports inference/encoding with fitted models.

Classes

BaseBowEstimator([n_labels, gamma, ...]) Bag of words variational autoencoder algorithm
BaseEstimator([vocabulary, log_method, ...]) Base class for all VAE-based estimators.
BowEstimator(*args, **kwargs)
BowMetricEstimator(*args[, ...])
CovariateBowEstimator(*args[, n_covars])
SeqBowEstimator(*args[, llm_model_name, ...])
SeqBowMetricEstimator(*args[, ...])
class BaseEstimator(vocabulary=None, log_method='log', quiet=False, coherence_coefficient=8.0, device='cpu', latent_distribution=None, lr=0.005, coherence_reg_penalty=0.0, redundancy_reg_penalty=0.0, batch_size=128, epochs=40, coherence_via_encoder=False, pretrained_param_file=None, warm_start=False, test_batch_size=0)[source]

Bases: object

Base class for all VAE-based estimators.

Parameters:
  • log_method (str) – Method for logging. ‘print’ | ‘log’, optional (default=’log’)
  • quiet (bool) – Flag for whether to force minimal logging/ouput. optional (default=False)
  • coherence_coefficient (float) – Weight to tradeoff influence of coherence vs perplexity in model selection objective (default = 8.0)
  • device (Optional[str]) – pytorch device
  • latent_distribution (Optional[BaseDistribution]) – Latent distribution of the variational autoencoder - defaults to LogisticGaussian with 20 dimensions
  • optimizer – optimizer (default = “adam”)
  • lr (float) – Learning rate of training. (default=0.005)
  • coherence_reg_penalty (float) – Regularization penalty for topic coherence. optional (default=0.0)
  • redundancy_reg_penalty (float) – Regularization penalty for topic redundancy. optional (default=0.0)
  • batch_size (int) – Batch training size. optional (default=128)
  • epochs (int) – Number of training epochs. optional(default=40)
  • coherence_via_encoder (bool) – Flag to use encoder to derive coherence scores (via gradient attribution)
  • pretrained_param_file (Optional[str]) – Path to pre-trained parameter file to initialize weights
  • warm_start (bool) – Subsequent calls to fit will use existing model weights rather than reinitializing
  • test_batch_size (int) –
fit(X, y)[source]

Fit VAE model according to the given training data X with optional co-variates y.

Parameters:
  • X (Tensor) – representing input data
  • y (Tensor) – representing covariate/labels associated with data elements
Return type:

NoReturn

fit_with_validation(X, y, val_X, val_Y)[source]

Fit VAE model according to the given training data X with optional co-variates y; validate (potentially each epoch) with validation data val_X and optional co-variates val_Y

Parameters:
  • X (Tensor) – representing training data
  • y (Tensor) – representing covariate/labels associated with data elements in training data
  • val_X (Tensor) – representing validation data
  • val_y – representing covariate/labels associated with data elements in validation data
  • val_Y (Tensor) –
Return type:

NoReturn

class BaseBowEstimator(n_labels=0, gamma=1.0, multilabel=False, validate_each_epoch=False, enc_hidden_dim=150, embedding_source='random', embedding_size=128, fixed_embedding=False, num_enc_layers=1, enc_dr=0.1, classifier_dropout=0.1, *args, **kwargs)[source]

Bases: BaseEstimator

Bag of words variational autoencoder algorithm

Parameters:
  • n_labels (int) – Number of possible labels/classes when provided supervised data
  • gamma (float) – Coefficient that controls how supervised and unsupervised losses are weighted against each other
  • enc_hidden_dim (int) – Size of hidden encoder layers. optional (default=150)
  • embedding_source (str) – Word embedding source for vocabulary. ‘random’ | ‘glove’ | ‘fasttext’ | ‘word2vec’, optional (default=’random’)
  • embedding_size (int) – Word embedding size, ignored if embedding_source not ‘random’. optional (default=128)
  • fixed_embedding (bool) – Enable fixed embeddings. optional(default=False)
  • num_enc_layers (int) – Number of layers in encoder. optional(default=1)
  • enc_dr (float) – Dropout probability in encoder. optional(default=0.1)
  • coherence_via_encoder – Flag
  • validate_each_epoch (bool) – Perform validation of model against heldout validation data after each training epoch
  • multilabel (bool) – Assume labels are vectors denoting label sets associated with each document
  • classifier_dropout (float) –
classmethod from_saved(model_dir, device='cpu')[source]

Instantiate a BaseBowEstimator object from a saved model

Parameters:
  • model_dir (str) – String representing the path to the saved model directory
  • device (Optional[str]) –
Return type:

BaseBowEstimator

Returns:

BaseBowEstimator object

classmethod from_config(config, vocabulary, n_labels=0, coherence_coefficient=8.0, coherence_via_encoder=False, validate_each_epoch=False, pretrained_param_file=None, device='cpu')[source]

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:
  • config (Union[str, dict]) – Path to a json representation of a configuation or TMNT config dictionary
  • vocabulary (Union[str, Vocab]) – Path to a json representation of a vocabulary or vocabulary object
  • pretrained_param_file (Optional[str]) – Path to pretrained parameter file if using pretrained model
  • device (str) – PyTorch Device
  • n_labels (int) –
  • coherence_coefficient (float) –
  • coherence_via_encoder (bool) –
  • validate_each_epoch (bool) –
Return type:

BaseBowEstimator

Returns:

An estimator for training and evaluation of a single model

fit_with_validation(X, y, val_X, val_y, aux_X=None, opt_trial=None)[source]

Fit a model according to the options of this estimator and optionally evaluate on validation data

Parameters:
Return type:

Tuple[float, dict]

Returns:

sc_obj, v_res

fit(X, y=None)[source]

Fit VAE model according to the given training data X with optional co-variates y.

Parameters:
  • X (csr_matrix) – representing input data
  • y (Optional[ndarray]) – representing covariate/labels associated with data elements
Return type:

BaseBowEstimator

Returns:

self

class BowEstimator(*args, **kwargs)[source]

Bases: BaseBowEstimator

classmethod from_config(*args, **kwargs)[source]

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:
  • config – Path to a json representation of a configuation or TMNT config dictionary
  • vocabulary – Path to a json representation of a vocabulary or vocabulary object
  • pretrained_param_file – Path to pretrained parameter file if using pretrained model
  • device – PyTorch Device
Returns:

An estimator for training and evaluation of a single model

classmethod from_saved(*args, **kwargs)[source]

Instantiate a BaseBowEstimator object from a saved model

Parameters:model_dir – String representing the path to the saved model directory
Returns:BaseBowEstimator object
perplexity(X)[source]

Calculate approximate perplexity for data X and y

Parameters:X (csr_matrix) – Document word matrix of shape [n_samples, vocab_size]
Return type:float
Returns:Perplexity score.
get_topic_vectors()[source]

Get topic vectors of the fitted model.

Returns:topic_distribution[i, j] represents word j in topic i. shape=(n_latent, vocab_size)
Return type:topic_distribution
transform(X)[source]

Transform data X according to the fitted model.

Parameters:X (csr_matrix) – Document word matrix of shape {n_samples, n_features}
Returns:shape=(n_samples, n_latent) Document topic distribution for X
Return type:topic_distribution
class BowMetricEstimator(*args, sdml_smoothing_factor=0.3, non_scoring_index=-1, **kwargs)[source]

Bases: BowEstimator

classmethod from_config(*args, **kwargs)[source]

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:
  • config – Path to a json representation of a configuation or TMNT config dictionary
  • vocabulary – Path to a json representation of a vocabulary or vocabulary object
  • pretrained_param_file – Path to pretrained parameter file if using pretrained model
  • device – PyTorch Device
Returns:

An estimator for training and evaluation of a single model

class CovariateBowEstimator(*args, n_covars=0, **kwargs)[source]

Bases: BaseBowEstimator

classmethod from_config(n_covars, *args, **kwargs)[source]

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:
  • config – Path to a json representation of a configuation or TMNT config dictionary
  • vocabulary – Path to a json representation of a vocabulary or vocabulary object
  • pretrained_param_file – Path to pretrained parameter file if using pretrained model
  • device – PyTorch Device
Returns:

An estimator for training and evaluation of a single model

get_topic_vectors()[source]

Get topic vectors of the fitted model.

Returns:
Topic word distribution. topic_distribution[i, j] represents word j in topic i.
shape=(n_latent, vocab_size)
Return type:topic_vectors
transform(X, y)[source]

Transform data X and y according to the fitted model.

Parameters:
  • X (csr_matrix) – Document word matrix of shape {n_samples, n_features)
  • y (ndarray) – Covariate matrix of shape (n_train_samples, n_covars)
Returns:

Document topic distribution for X and y of shape=(n_samples, n_latent)

class SeqBowEstimator(*args, llm_model_name='distilbert-base-uncased', n_labels=0, log_interval=5, warmup_ratio=0.1, gamma=1.0, multilabel=False, decoder_lr=0.01, checkpoint_dir=None, classifier_dropout=0.0, pure_classifier_objective=False, validate_each_epoch=False, entropy_loss_coef=0.0, pool_encoder=True, **kwargs)[source]

Bases: BaseEstimator

classmethod from_config(config, vocabulary, log_interval=1, pretrained_param_file=None, n_labels=None, device='cpu')[source]

Instantiate an object of this class using the provided config

Parameters:
  • config (Union[str, dict]) – String to configuration path (in json format) or an autogluon dictionary representing the config
  • log_interval (int) – Logging frequency (default = 1)
  • pretrained_param_file (Optional[str]) – Parameter file
  • device (str) – pytorch device
  • vocabulary (Vocab) –
  • n_labels (Optional[int]) –
Return type:

SeqBowEstimator

Returns:

An object of this class

write_model(model_dir, suffix='', vectorizer=None)[source]

Writes the model within this estimator to disk.

Parameters:
  • model_dir (str) – Output directory for model parameters, config and vocabulary
  • suffix (str) – Suffix to use for model (e.g. at different checkpoints)
Return type:

None

log_train(batch_id, batch_num, step_loss, rec_loss, red_loss, class_loss, log_interval, epoch_id, learning_rate)[source]

Generate and print out the log message for training.

fit_with_validation(train_data, dev_data, aux_data)[source]

Training function.

Parameters:
  • train_data (DataLoader) – Dataloader with training data.
  • dev_data (DataLoader) – Dataloader with dev/validation data.
  • aux_data (DataLoader) – Dataloader with auxilliary data.
class SeqBowMetricEstimator(*args, sdml_smoothing_factor=0.3, metric_loss_temp=0.1, use_sdml=False, non_scoring_index=-1, **kwargs)[source]

Bases: SeqBowEstimator

classmethod from_config(*args, **kwargs)[source]

Instantiate an object of this class using the provided config

Parameters:
  • config – String to configuration path (in json format) or an autogluon dictionary representing the config
  • log_interval – Logging frequency (default = 1)
  • pretrained_param_file – Parameter file
  • device – pytorch device
Returns:

An object of this class