tmnt.estimator¶

Estimator module to train/fit/estimate individual models with fixed hyperparameters. Estimators are used by trainers to manage training with specific datasets; in addition, the estimator API supports inference/encoding with fitted models.

Classes

`BaseBowEstimator`([n_labels, gamma, ...])	Bag of words variational autoencoder algorithm
`BaseEstimator`([vocabulary, log_method, ...])	Base class for all VAE-based estimators.
`BowEstimator`(args, *kwargs)
`BowMetricEstimator`(*args[, ...])
`CovariateBowEstimator`(*args[, n_covars])
`SeqBowEstimator`(*args[, llm_model_name, ...])
`SeqBowMetricEstimator`(*args[, ...])

class BaseEstimator(vocabulary=None, log_method='log', quiet=False, coherence_coefficient=8.0, device='cpu', latent_distribution=None, lr=0.005, coherence_reg_penalty=0.0, redundancy_reg_penalty=0.0, batch_size=128, epochs=40, coherence_via_encoder=False, pretrained_param_file=None, warm_start=False, test_batch_size=0)[source]¶

Bases: object

Base class for all VAE-based estimators.

Parameters:

log_method (str) – Method for logging. ‘print’ | ‘log’, optional (default=’log’)
quiet (bool) – Flag for whether to force minimal logging/ouput. optional (default=False)
coherence_coefficient (float) – Weight to tradeoff influence of coherence vs perplexity in model selection objective (default = 8.0)
device (Optional[str]) – pytorch device
latent_distribution (Optional[BaseDistribution]) – Latent distribution of the variational autoencoder - defaults to LogisticGaussian with 20 dimensions
optimizer – optimizer (default = “adam”)
lr (float) – Learning rate of training. (default=0.005)
coherence_reg_penalty (float) – Regularization penalty for topic coherence. optional (default=0.0)
redundancy_reg_penalty (float) – Regularization penalty for topic redundancy. optional (default=0.0)
batch_size (int) – Batch training size. optional (default=128)
epochs (int) – Number of training epochs. optional(default=40)
coherence_via_encoder (bool) – Flag to use encoder to derive coherence scores (via gradient attribution)
pretrained_param_file (Optional[str]) – Path to pre-trained parameter file to initialize weights
warm_start (bool) – Subsequent calls to fit will use existing model weights rather than reinitializing
test_batch_size (int) –

fit(X, y)[source]¶

Fit VAE model according to the given training data X with optional co-variates y.

Parameters:	X (`Tensor`) – representing input data y (`Tensor`) – representing covariate/labels associated with data elements
Return type:	`NoReturn`

fit_with_validation(X, y, val_X, val_Y)[source]¶

Fit VAE model according to the given training data X with optional co-variates y; validate (potentially each epoch) with validation data val_X and optional co-variates val_Y

Parameters:	X (`Tensor`) – representing training data y (`Tensor`) – representing covariate/labels associated with data elements in training data val_X (`Tensor`) – representing validation data val_y – representing covariate/labels associated with data elements in validation data val_Y (Tensor) –
Return type:	`NoReturn`

class BaseBowEstimator(n_labels=0, gamma=1.0, multilabel=False, validate_each_epoch=False, enc_hidden_dim=150, embedding_source='random', embedding_size=128, fixed_embedding=False, num_enc_layers=1, enc_dr=0.1, classifier_dropout=0.1, *args, **kwargs)[source]¶

Bases: BaseEstimator

Bag of words variational autoencoder algorithm

Parameters:

n_labels (int) – Number of possible labels/classes when provided supervised data
gamma (float) – Coefficient that controls how supervised and unsupervised losses are weighted against each other
enc_hidden_dim (int) – Size of hidden encoder layers. optional (default=150)
embedding_source (str) – Word embedding source for vocabulary. ‘random’ | ‘glove’ | ‘fasttext’ | ‘word2vec’, optional (default=’random’)
embedding_size (int) – Word embedding size, ignored if embedding_source not ‘random’. optional (default=128)
fixed_embedding (bool) – Enable fixed embeddings. optional(default=False)
num_enc_layers (int) – Number of layers in encoder. optional(default=1)
enc_dr (float) – Dropout probability in encoder. optional(default=0.1)
coherence_via_encoder – Flag
validate_each_epoch (bool) – Perform validation of model against heldout validation data after each training epoch
multilabel (bool) – Assume labels are vectors denoting label sets associated with each document
classifier_dropout (float) –

classmethod from_saved(model_dir, device='cpu')[source]¶

Instantiate a BaseBowEstimator object from a saved model

Parameters:	model_dir (`str`) – String representing the path to the saved model directory device (Optional[str]) –
Return type:	`BaseBowEstimator`
Returns:	BaseBowEstimator object

classmethod from_config(config, vocabulary, n_labels=0, coherence_coefficient=8.0, coherence_via_encoder=False, validate_each_epoch=False, pretrained_param_file=None, device='cpu')[source]¶

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:	config (`Union`[`str`, `dict`]) – Path to a json representation of a configuation or TMNT config dictionary vocabulary (`Union`[`str`, `Vocab`]) – Path to a json representation of a vocabulary or vocabulary object pretrained_param_file (`Optional`[`str`]) – Path to pretrained parameter file if using pretrained model device (`str`) – PyTorch Device n_labels (int) – coherence_coefficient (float) – coherence_via_encoder (bool) – validate_each_epoch (bool) –
Return type:	`BaseBowEstimator`
Returns:	An estimator for training and evaluation of a single model

fit_with_validation(X, y, val_X, val_y, aux_X=None, opt_trial=None)[source]¶

Fit a model according to the options of this estimator and optionally evaluate on validation data

Parameters:	X (`Union`[`Tensor`, `coo_matrix`, `csr_matrix`]) – Input training tensor y (`Union`[`Tensor`, `ndarray`]) – Input labels/co-variates to use (optionally) for co-variate models val_X (`Union`[`Tensor`, `coo_matrix`, `csr_matrix`, `None`]) – Validateion input tensor val_y (`Union`[`Tensor`, `ndarray`, `None`]) – Validation co-variates aux_X (`Union`[`Tensor`, `coo_matrix`, `csr_matrix`, `None`]) – Auxilliary unlabeled data for semi-supervised training opt_trial (Optional[Trial]) –
Return type:	`Tuple`[`float`, `dict`]
Returns:	sc_obj, v_res

fit(X, y=None)[source]¶

Fit VAE model according to the given training data X with optional co-variates y.

Parameters:	X (`csr_matrix`) – representing input data y (`Optional`[`ndarray`]) – representing covariate/labels associated with data elements
Return type:	`BaseBowEstimator`
Returns:	self

class BowEstimator(*args, **kwargs)[source]¶

Bases: BaseBowEstimator

classmethod from_config(*args, **kwargs)[source]¶

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:	config – Path to a json representation of a configuation or TMNT config dictionary vocabulary – Path to a json representation of a vocabulary or vocabulary object pretrained_param_file – Path to pretrained parameter file if using pretrained model device – PyTorch Device
Returns:	An estimator for training and evaluation of a single model

classmethod from_saved(*args, **kwargs)[source]¶

Instantiate a BaseBowEstimator object from a saved model

Parameters:	model_dir – String representing the path to the saved model directory
Returns:	BaseBowEstimator object

perplexity(X)[source]¶

Calculate approximate perplexity for data X and y

Parameters:	X (`csr_matrix`) – Document word matrix of shape [n_samples, vocab_size]
Return type:	`float`
Returns:	Perplexity score.

get_topic_vectors()[source]¶

Get topic vectors of the fitted model.

Returns:	topic_distribution[i, j] represents word j in topic i. shape=(n_latent, vocab_size)
Return type:	topic_distribution

transform(X)[source]¶

Transform data X according to the fitted model.

Parameters:	X (`csr_matrix`) – Document word matrix of shape {n_samples, n_features}
Returns:	shape=(n_samples, n_latent) Document topic distribution for X
Return type:	topic_distribution

class BowMetricEstimator(*args, sdml_smoothing_factor=0.3, non_scoring_index=-1, **kwargs)[source]¶

Bases: BowEstimator

classmethod from_config(*args, **kwargs)[source]¶

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:	config – Path to a json representation of a configuation or TMNT config dictionary vocabulary – Path to a json representation of a vocabulary or vocabulary object pretrained_param_file – Path to pretrained parameter file if using pretrained model device – PyTorch Device
Returns:	An estimator for training and evaluation of a single model

class CovariateBowEstimator(*args, n_covars=0, **kwargs)[source]¶

Bases: BaseBowEstimator

classmethod from_config(n_covars, *args, **kwargs)[source]¶

Create an estimator from a configuration file/object rather than by keyword arguments

Parameters:	config – Path to a json representation of a configuation or TMNT config dictionary vocabulary – Path to a json representation of a vocabulary or vocabulary object pretrained_param_file – Path to pretrained parameter file if using pretrained model device – PyTorch Device
Returns:	An estimator for training and evaluation of a single model

get_topic_vectors()[source]¶

Get topic vectors of the fitted model.

Returns:	Topic word distribution. topic_distribution[i, j] represents word j in topic i. shape=(n_latent, vocab_size)
Return type:	topic_vectors

transform(X, y)[source]¶

Transform data X and y according to the fitted model.

Parameters:	X (`csr_matrix`) – Document word matrix of shape {n_samples, n_features) y (`ndarray`) – Covariate matrix of shape (n_train_samples, n_covars)
Returns:	Document topic distribution for X and y of shape=(n_samples, n_latent)

class SeqBowEstimator(*args, llm_model_name='distilbert-base-uncased', n_labels=0, log_interval=5, warmup_ratio=0.1, gamma=1.0, multilabel=False, decoder_lr=0.01, checkpoint_dir=None, classifier_dropout=0.0, pure_classifier_objective=False, validate_each_epoch=False, entropy_loss_coef=0.0, pool_encoder=True, **kwargs)[source]¶

Bases: BaseEstimator

classmethod from_config(config, vocabulary, log_interval=1, pretrained_param_file=None, n_labels=None, device='cpu')[source]¶

Instantiate an object of this class using the provided config

Parameters:	config (`Union`[`str`, `dict`]) – String to configuration path (in json format) or an autogluon dictionary representing the config log_interval (`int`) – Logging frequency (default = 1) pretrained_param_file (`Optional`[`str`]) – Parameter file device (`str`) – pytorch device vocabulary (Vocab) – n_labels (Optional[int]) –
Return type:	`SeqBowEstimator`
Returns:	An object of this class

write_model(model_dir, suffix='', vectorizer=None)[source]¶

Writes the model within this estimator to disk.

Parameters:	model_dir (`str`) – Output directory for model parameters, config and vocabulary suffix (`str`) – Suffix to use for model (e.g. at different checkpoints)
Return type:	`None`

log_train(batch_id, batch_num, step_loss, rec_loss, red_loss, class_loss, log_interval, epoch_id, learning_rate)[source]¶: Generate and print out the log message for training.

fit_with_validation(train_data, dev_data, aux_data)[source]¶

Training function.

Parameters:	train_data (`DataLoader`) – Dataloader with training data. dev_data (`DataLoader`) – Dataloader with dev/validation data. aux_data (`DataLoader`) – Dataloader with auxilliary data.

class SeqBowMetricEstimator(*args, sdml_smoothing_factor=0.3, metric_loss_temp=0.1, use_sdml=False, non_scoring_index=-1, **kwargs)[source]¶

Bases: SeqBowEstimator

classmethod from_config(*args, **kwargs)[source]¶

Instantiate an object of this class using the provided config

Parameters:	config – String to configuration path (in json format) or an autogluon dictionary representing the config log_interval – Logging frequency (default = 1) pretrained_param_file – Parameter file device – pytorch device
Returns:	An object of this class