tmnt.data_loading¶

File/module contains routines for loading in text documents to sparse matrix representations for efficient neural variational model training.

Functions

`get_llm`(model_name)
`get_llm_dataloader`(data, bow_vectorizer, ...)
`get_llm_model`(model_name)
`get_llm_paired_dataloader`(data_a, data_b, ...)
`get_llm_tokenizer`(model_name)
`get_unwrapped_llm_dataloader`(data, ...[, ...])
`load_vocab`(vocab_file[, encoding])	Load a pre-derived vocabulary, assumes format consisting of a single word on each line.
`sparse_batch_collate`(batch)	Collate function which to transform scipy coo matrix to pytorch sparse tensor
`sparse_coo_to_tensor`(coo)	Transform scipy coo matrix to pytorch sparse tensor
`to_label_matrix`(yvs[, num_labels])	Convert [(id1, id2, ...), (id1,id2,...) .

Classes

`PairedDataLoader`(data_loader1, data_loader2)
`RoundRobinDataLoader`(data_loaders)
`SingletonWrapperLoader`(data_loader)
`SparseDataLoader`(X, y[, shuffle, drop_last, ...])
`SparseDataset`(data, targets)	Custom Dataset class for scipy sparse matrix
`StratifiedDualBatchSampler`(y_a, y_b, ...[, ...])	Stratified batch sampling Provides equal representation of target classes in each batch
`StratifiedPairedLLMLoader`(data_a, data_b, ...)

to_label_matrix(yvs, num_labels=0)[source]¶: Convert [(id1, id2, …), (id1,id2,…) … ] to Numpy matrix with multi-labels

class SparseDataset(data, targets)[source]¶

Custom Dataset class for scipy sparse matrix

Parameters:	data (Union[ndarray, coo_matrix, csr_matrix]) – targets (Optional[Union[ndarray, coo_matrix, csr_matrix]]) –

sparse_coo_to_tensor(coo)[source]¶

Transform scipy coo matrix to pytorch sparse tensor

Parameters:	coo (coo_matrix) –

sparse_batch_collate(batch)[source]¶: Collate function which to transform scipy coo matrix to pytorch sparse tensor

class SparseDataLoader(X, y, shuffle=False, drop_last=False, batch_size=1024, device='cpu')[source]¶

Bases: DataLoader

Parameters:	X (Union[csr_matrix, coo_matrix]) – y (array) – drop_last (bool) – batch_size (Optional[int]) –

load_vocab(vocab_file, encoding='utf-8')[source]¶: Load a pre-derived vocabulary, assumes format consisting of a single word on each line. Note: this is a bit of a hack to use a counter to sort the vocab items IN THE ORDER THEY ARE FOUND IN THE FILE.

class StratifiedDualBatchSampler(y_a, y_b, batch_size, num_batches, shuffle=True)[source]¶

Stratified batch sampling Provides equal representation of target classes in each batch