tmnt.estimator¶
Estimator module to train/fit/estimate individual models with fixed hyperparameters. Estimators are used by trainers to manage training with specific datasets; in addition, the estimator API supports inference/encoding with fitted models.
Classes
BaseBowEstimator ([n_labels, gamma, ...]) |
Bag of words variational autoencoder algorithm |
BaseEstimator ([vocabulary, log_method, ...]) |
Base class for all VAE-based estimators. |
BowEstimator (*args, **kwargs) |
|
BowMetricEstimator (*args[, ...]) |
|
CovariateBowEstimator (*args[, n_covars]) |
|
SeqBowEstimator (*args[, llm_model_name, ...]) |
|
SeqBowMetricEstimator (*args[, ...]) |
-
class
BaseEstimator
(vocabulary=None, log_method='log', quiet=False, coherence_coefficient=8.0, device='cpu', latent_distribution=None, lr=0.005, coherence_reg_penalty=0.0, redundancy_reg_penalty=0.0, batch_size=128, epochs=40, coherence_via_encoder=False, pretrained_param_file=None, warm_start=False, test_batch_size=0)[source]¶ Bases:
object
Base class for all VAE-based estimators.
Parameters: - log_method (
str
) – Method for logging. ‘print’ | ‘log’, optional (default=’log’) - quiet (
bool
) – Flag for whether to force minimal logging/ouput. optional (default=False) - coherence_coefficient (
float
) – Weight to tradeoff influence of coherence vs perplexity in model selection objective (default = 8.0) - device (
Optional
[str
]) – pytorch device - latent_distribution (
Optional
[BaseDistribution
]) – Latent distribution of the variational autoencoder - defaults to LogisticGaussian with 20 dimensions - optimizer – optimizer (default = “adam”)
- lr (
float
) – Learning rate of training. (default=0.005) - coherence_reg_penalty (
float
) – Regularization penalty for topic coherence. optional (default=0.0) - redundancy_reg_penalty (
float
) – Regularization penalty for topic redundancy. optional (default=0.0) - batch_size (
int
) – Batch training size. optional (default=128) - epochs (
int
) – Number of training epochs. optional(default=40) - coherence_via_encoder (
bool
) – Flag to use encoder to derive coherence scores (via gradient attribution) - pretrained_param_file (
Optional
[str
]) – Path to pre-trained parameter file to initialize weights - warm_start (
bool
) – Subsequent calls to fit will use existing model weights rather than reinitializing - test_batch_size (int) –
-
fit
(X, y)[source]¶ Fit VAE model according to the given training data X with optional co-variates y.
Parameters: - X (
Tensor
) – representing input data - y (
Tensor
) – representing covariate/labels associated with data elements
Return type: - X (
-
fit_with_validation
(X, y, val_X, val_Y)[source]¶ Fit VAE model according to the given training data X with optional co-variates y; validate (potentially each epoch) with validation data val_X and optional co-variates val_Y
Parameters: - X (
Tensor
) – representing training data - y (
Tensor
) – representing covariate/labels associated with data elements in training data - val_X (
Tensor
) – representing validation data - val_y – representing covariate/labels associated with data elements in validation data
- val_Y (Tensor) –
Return type: - X (
- log_method (
-
class
BaseBowEstimator
(n_labels=0, gamma=1.0, multilabel=False, validate_each_epoch=False, enc_hidden_dim=150, embedding_source='random', embedding_size=128, fixed_embedding=False, num_enc_layers=1, enc_dr=0.1, classifier_dropout=0.1, *args, **kwargs)[source]¶ Bases:
BaseEstimator
Bag of words variational autoencoder algorithm
Parameters: - n_labels (
int
) – Number of possible labels/classes when provided supervised data - gamma (
float
) – Coefficient that controls how supervised and unsupervised losses are weighted against each other - enc_hidden_dim (int) – Size of hidden encoder layers. optional (default=150)
- embedding_source (str) – Word embedding source for vocabulary. ‘random’ | ‘glove’ | ‘fasttext’ | ‘word2vec’, optional (default=’random’)
- embedding_size (int) – Word embedding size, ignored if embedding_source not ‘random’. optional (default=128)
- fixed_embedding (bool) – Enable fixed embeddings. optional(default=False)
- num_enc_layers (
int
) – Number of layers in encoder. optional(default=1) - enc_dr (
float
) – Dropout probability in encoder. optional(default=0.1) - coherence_via_encoder – Flag
- validate_each_epoch (
bool
) – Perform validation of model against heldout validation data after each training epoch - multilabel (
bool
) – Assume labels are vectors denoting label sets associated with each document - classifier_dropout (float) –
-
classmethod
from_saved
(model_dir, device='cpu')[source]¶ Instantiate a BaseBowEstimator object from a saved model
Parameters: Return type: Returns: BaseBowEstimator object
-
classmethod
from_config
(config, vocabulary, n_labels=0, coherence_coefficient=8.0, coherence_via_encoder=False, validate_each_epoch=False, pretrained_param_file=None, device='cpu')[source]¶ Create an estimator from a configuration file/object rather than by keyword arguments
Parameters: - config (
Union
[str
,dict
]) – Path to a json representation of a configuation or TMNT config dictionary - vocabulary (
Union
[str
,Vocab
]) – Path to a json representation of a vocabulary or vocabulary object - pretrained_param_file (
Optional
[str
]) – Path to pretrained parameter file if using pretrained model - device (
str
) – PyTorch Device - n_labels (int) –
- coherence_coefficient (float) –
- coherence_via_encoder (bool) –
- validate_each_epoch (bool) –
Return type: Returns: An estimator for training and evaluation of a single model
- config (
-
fit_with_validation
(X, y, val_X, val_y, aux_X=None, opt_trial=None)[source]¶ Fit a model according to the options of this estimator and optionally evaluate on validation data
Parameters: - X (
Union
[Tensor
,coo_matrix
,csr_matrix
]) – Input training tensor - y (
Union
[Tensor
,ndarray
]) – Input labels/co-variates to use (optionally) for co-variate models - val_X (
Union
[Tensor
,coo_matrix
,csr_matrix
,None
]) – Validateion input tensor - val_y (
Union
[Tensor
,ndarray
,None
]) – Validation co-variates - aux_X (
Union
[Tensor
,coo_matrix
,csr_matrix
,None
]) – Auxilliary unlabeled data for semi-supervised training - opt_trial (Optional[Trial]) –
Return type: Returns: sc_obj, v_res
- X (
-
fit
(X, y=None)[source]¶ Fit VAE model according to the given training data X with optional co-variates y.
Parameters: - X (
csr_matrix
) – representing input data - y (
Optional
[ndarray
]) – representing covariate/labels associated with data elements
Return type: Returns: self
- X (
- n_labels (
-
class
BowEstimator
(*args, **kwargs)[source]¶ Bases:
BaseBowEstimator
-
classmethod
from_config
(*args, **kwargs)[source]¶ Create an estimator from a configuration file/object rather than by keyword arguments
Parameters: - config – Path to a json representation of a configuation or TMNT config dictionary
- vocabulary – Path to a json representation of a vocabulary or vocabulary object
- pretrained_param_file – Path to pretrained parameter file if using pretrained model
- device – PyTorch Device
Returns: An estimator for training and evaluation of a single model
-
classmethod
from_saved
(*args, **kwargs)[source]¶ Instantiate a BaseBowEstimator object from a saved model
Parameters: model_dir – String representing the path to the saved model directory Returns: BaseBowEstimator object
-
perplexity
(X)[source]¶ Calculate approximate perplexity for data X and y
Parameters: X ( csr_matrix
) – Document word matrix of shape [n_samples, vocab_size]Return type: float
Returns: Perplexity score.
-
get_topic_vectors
()[source]¶ Get topic vectors of the fitted model.
Returns: topic_distribution[i, j] represents word j in topic i. shape=(n_latent, vocab_size) Return type: topic_distribution
-
transform
(X)[source]¶ Transform data X according to the fitted model.
Parameters: X ( csr_matrix
) – Document word matrix of shape {n_samples, n_features}Returns: shape=(n_samples, n_latent) Document topic distribution for X Return type: topic_distribution
-
classmethod
-
class
BowMetricEstimator
(*args, sdml_smoothing_factor=0.3, non_scoring_index=-1, **kwargs)[source]¶ Bases:
BowEstimator
-
classmethod
from_config
(*args, **kwargs)[source]¶ Create an estimator from a configuration file/object rather than by keyword arguments
Parameters: - config – Path to a json representation of a configuation or TMNT config dictionary
- vocabulary – Path to a json representation of a vocabulary or vocabulary object
- pretrained_param_file – Path to pretrained parameter file if using pretrained model
- device – PyTorch Device
Returns: An estimator for training and evaluation of a single model
-
classmethod
-
class
CovariateBowEstimator
(*args, n_covars=0, **kwargs)[source]¶ Bases:
BaseBowEstimator
-
classmethod
from_config
(n_covars, *args, **kwargs)[source]¶ Create an estimator from a configuration file/object rather than by keyword arguments
Parameters: - config – Path to a json representation of a configuation or TMNT config dictionary
- vocabulary – Path to a json representation of a vocabulary or vocabulary object
- pretrained_param_file – Path to pretrained parameter file if using pretrained model
- device – PyTorch Device
Returns: An estimator for training and evaluation of a single model
-
get_topic_vectors
()[source]¶ Get topic vectors of the fitted model.
Returns: - Topic word distribution. topic_distribution[i, j] represents word j in topic i.
- shape=(n_latent, vocab_size)
Return type: topic_vectors
-
transform
(X, y)[source]¶ Transform data X and y according to the fitted model.
Parameters: - X (
csr_matrix
) – Document word matrix of shape {n_samples, n_features) - y (
ndarray
) – Covariate matrix of shape (n_train_samples, n_covars)
Returns: Document topic distribution for X and y of shape=(n_samples, n_latent)
- X (
-
classmethod
-
class
SeqBowEstimator
(*args, llm_model_name='distilbert-base-uncased', n_labels=0, log_interval=5, warmup_ratio=0.1, gamma=1.0, multilabel=False, decoder_lr=0.01, checkpoint_dir=None, classifier_dropout=0.0, pure_classifier_objective=False, validate_each_epoch=False, entropy_loss_coef=0.0, pool_encoder=True, **kwargs)[source]¶ Bases:
BaseEstimator
-
classmethod
from_config
(config, vocabulary, log_interval=1, pretrained_param_file=None, n_labels=None, device='cpu')[source]¶ Instantiate an object of this class using the provided config
Parameters: - config (
Union
[str
,dict
]) – String to configuration path (in json format) or an autogluon dictionary representing the config - log_interval (
int
) – Logging frequency (default = 1) - pretrained_param_file (
Optional
[str
]) – Parameter file - device (
str
) – pytorch device - vocabulary (Vocab) –
- n_labels (Optional[int]) –
Return type: Returns: An object of this class
- config (
-
write_model
(model_dir, suffix='', vectorizer=None)[source]¶ Writes the model within this estimator to disk.
Parameters: Return type:
-
classmethod
-
class
SeqBowMetricEstimator
(*args, sdml_smoothing_factor=0.3, metric_loss_temp=0.1, use_sdml=False, non_scoring_index=-1, **kwargs)[source]¶ Bases:
SeqBowEstimator
-
classmethod
from_config
(*args, **kwargs)[source]¶ Instantiate an object of this class using the provided config
Parameters: - config – String to configuration path (in json format) or an autogluon dictionary representing the config
- log_interval – Logging frequency (default = 1)
- pretrained_param_file – Parameter file
- device – pytorch device
Returns: An object of this class
-
classmethod