tmnt.preprocess.vectorizer

Copyright (c) 2019-2021 The MITRE Corporation.

Classes

TMNTVectorizer([text_key, label_key, ...]) Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions.
class TMNTVectorizer(text_key='body', label_key=None, min_doc_size=1, label_remap=None, json_out_dir=None, vocab_size=2000, file_pat='*.json', encoding='utf-8', initial_vocabulary=None, additional_feature_keys=None, stop_word_file=None, split_char=',', max_ws_tokens=-1, count_vectorizer_kwargs={'max_df': 0.95, 'min_df': 0.0, 'stop_words': 'english'})[source]

Bases: object

Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions.

Parameters:
  • text_key (str) – Json key for text to use as document content
  • label_key (Optional[str]) – Json key to use for label/covariate
  • min_doc_size (int) – Minimum number of tokens for inclusion in the dataset
  • label_remap (Optional[Dict[str, str]]) – Dictionary mapping input label strings to alternative label set
  • json_out_dir (Optional[str]) – Output directory for resulting JSON files when using inline JSON processing
  • vocab_size (int) – Number of vocabulary items (default=2000)
  • file_pat (str) – File pattern for input json files (default = ‘*.json’)
  • encoding (str) – Character encoding (default = ‘utf-8’)
  • initial_vocabulary (Optional[Vocab]) – Use existing vocabulary rather than deriving one from the data
  • additional_feature_keys (Optional[List[str]]) – List of strings for json keys that correspond to additional features to use alongside vocabulary
  • stop_word_file (Optional[str]) – Path to a file containing stop words (newline separated)
  • split_char (str) – Single character string used to split label string into multiple labels (for multilabel classification tasks)
  • max_ws_tokens (int) – Maximum number of (whitespace deliniated) tokens to use
  • count_vectorizer_kwargs (Dict[str, Any]) – Dictionary of parameter values to pass to sklearn.feature_extraction.text.CountVectorizer
classmethod from_vocab_file(vocab_file)[source]

Class method that creates a TMNTVectorizer from a vocab file

Parameters:vocab_file (str) – String to vocabulary file path.
Return type:TMNTVectorizer
Returns:TMNTVectorizer
get_vocab()[source]

Returns the Torchtext vocabulary associated with the vectorizer

Return type:Vocab
Returns:Torchtext vocabulary
write_to_vec_file(X, y, vec_file)[source]

Write document-term matrix and optional label vector to file in svmlight format.

Parameters:
  • X (csr_matrix) – document-term (sparse) matrix
  • y (Optional[ndarray]) – optional label vector (or matrix for multilabel documents)
  • vec_file (str) – string denoting path to output vector file
Return type:

None

write_vocab(vocab_file)[source]

Write vocabulary to disk.

Parameters:vocab_file (str) – Write out vocabulary to this file (one word per line)
Return type:None
Returns:None
transform(str_list)[source]

Transforms a list of strings into a sparse matrix.

Transforms a single json list file into a tuple, the first element of which is a single sparse matrix X and the second element is always None.

Parameters:str_list (List[str]) – List of document strings
Return type:Tuple[csr_matrix, None]
Returns:Tuple of X,None - sparse matrix of the input, second element is always None in this case
transform_json(json_file)[source]

Transforms a single json list file into matrix/vector format(s).

Transforms a single json list file into a tuple, the first element being a single sparse matrix X and the second an (optional) label vector y.

Parameters:json_file (str) – Input file containing one document per line in serialized json format
Return type:Tuple[csr_matrix, Optional[ndarray]]
Returns:Tuple containing sparse document-term matrix X and optional label vector y
transform_json_dir(json_dir)[source]

Transforms a the specified directory’s json list files into matrix formats.

Parameters:json_dir (str) – A string denoting the path to a directory containing json list files to process
Return type:Tuple[csr_matrix, Optional[ndarray]]
Returns:Tuple containing sparse document-term matrix X and optional label vector y
fit_transform(str_list)[source]

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:str_list (List[str]) – List of document strings
Return type:Tuple[csr_matrix, None]
Returns:Tuple containing sparse document-term matrix X and optional label vector y
fit_transform_json(json_file)[source]

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:json_file (str) – Input file containing one document per line in serialized json format
Return type:Tuple[csr_matrix, Optional[ndarray]]
Returns:Tuple containing sparse document-term matrix X and optional label vector y
fit_transform_json_dir(json_dir)[source]

Learns a vocabulary and transforms the input into into matrix formats.

As a side-effect, this function induces a vocabulary of the inputs.

Parameters:json_dir (str) – A string denoting the path to a directory containing json list files to process
Return type:Tuple[csr_matrix, Optional[ndarray]]
Returns:Tuple containing sparse document-term matrix X and optional label vector y