tmnt.preprocess.vectorizer¶

Classes

TMNTVectorizer([text_key, label_key, ...]) Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions.

class TMNTVectorizer(text_key='body', label_key=None, min_doc_size=1, label_remap=None, json_out_dir=None, vocab_size=2000, file_pat='*.json', encoding='utf-8', initial_vocabulary=None, additional_feature_keys=None, stop_word_file=None, split_char=',', max_ws_tokens=-1, count_vectorizer_kwargs={'max_df': 0.95, 'min_df': 0.0, 'stop_words': 'english'})[source]¶

Bases: object

Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions.

Parameters:

text_key (str) – Json key for text to use as document content
label_key (Optional[str]) – Json key to use for label/covariate
min_doc_size (int) – Minimum number of tokens for inclusion in the dataset
label_remap (Optional[Dict[str, str]]) – Dictionary mapping input label strings to alternative label set
json_out_dir (Optional[str]) – Output directory for resulting JSON files when using inline JSON processing
vocab_size (int) – Number of vocabulary items (default=2000)
file_pat (str) – File pattern for input json files (default = ‘*.json’)
encoding (str) – Character encoding (default = ‘utf-8’)
initial_vocabulary (Optional[Vocab]) – Use existing vocabulary rather than deriving one from the data
additional_feature_keys (Optional[List[str]]) – List of strings for json keys that correspond to additional features to use alongside vocabulary
stop_word_file (Optional[str]) – Path to a file containing stop words (newline separated)
split_char (str) – Single character string used to split label string into multiple labels (for multilabel classification tasks)
max_ws_tokens (int) – Maximum number of (whitespace deliniated) tokens to use
count_vectorizer_kwargs (Dict[str, Any]) – Dictionary of parameter values to pass to sklearn.feature_extraction.text.CountVectorizer

classmethod from_vocab_file(vocab_file)[source]¶

Class method that creates a TMNTVectorizer from a vocab file

Parameters:	vocab_file (`str`) – String to vocabulary file path.
Return type:	`TMNTVectorizer`
Returns:	TMNTVectorizer

get_vocab()[source]¶

Returns the Torchtext vocabulary associated with the vectorizer

Return type:	`Vocab`
Returns:	Torchtext vocabulary

write_to_vec_file(X, y, vec_file)[source]¶

Write document-term matrix and optional label vector to file in svmlight format.

Parameters:	X (`csr_matrix`) – document-term (sparse) matrix y (`Optional`[`ndarray`]) – optional label vector (or matrix for multilabel documents) vec_file (`str`) – string denoting path to output vector file
Return type:	`None`

write_vocab(vocab_file)[source]¶

Write vocabulary to disk.

Parameters:	vocab_file (`str`) – Write out vocabulary to this file (one word per line)
Return type:	`None`
Returns:	None

transform(str_list)[source]¶

Transforms a list of strings into a sparse matrix.

Transforms a single json list file into a tuple, the first element of which is a single sparse matrix X and the second element is always None.

Parameters:	str_list (`List`[`str`]) – List of document strings
Return type:	`Tuple`[`csr_matrix`, `None`]
Returns:	Tuple of X,None - sparse matrix of the input, second element is always None in this case

transform_json(json_file)[source]¶

Transforms a single json list file into matrix/vector format(s).

Transforms a single json list file into a tuple, the first element being a single sparse matrix X and the second an (optional) label vector y.

Parameters:	json_file (`str`) – Input file containing one document per line in serialized json format
Return type:	`Tuple`[`csr_matrix`, `Optional`[`ndarray`]]
Returns:	Tuple containing sparse document-term matrix X and optional label vector y

transform_json_dir(json_dir)[source]¶

Transforms a the specified directory’s json list files into matrix formats.

Parameters:	json_dir (`str`) – A string denoting the path to a directory containing json list files to process
Return type:	`Tuple`[`csr_matrix`, `Optional`[`ndarray`]]
Returns:	Tuple containing sparse document-term matrix X and optional label vector y

fit_transform(str_list)[source]¶

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:	str_list (`List`[`str`]) – List of document strings
Return type:	`Tuple`[`csr_matrix`, `None`]
Returns:	Tuple containing sparse document-term matrix X and optional label vector y

fit_transform_json(json_file)[source]¶

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:	json_file (`str`) – Input file containing one document per line in serialized json format
Return type:	`Tuple`[`csr_matrix`, `Optional`[`ndarray`]]
Returns:	Tuple containing sparse document-term matrix X and optional label vector y

fit_transform_json_dir(json_dir)[source]¶

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:	json_dir (`str`) – A string denoting the path to a directory containing json list files to process
Return type:	`Tuple`[`csr_matrix`, `Optional`[`ndarray`]]
Returns:	Tuple containing sparse document-term matrix X and optional label vector y