tmnt.preprocess.vectorizer¶
Copyright (c) 2019-2021 The MITRE Corporation.
Classes
TMNTVectorizer ([text_key, label_key, ...]) |
Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions. |
-
class
TMNTVectorizer
(text_key='body', label_key=None, min_doc_size=1, label_remap=None, json_out_dir=None, vocab_size=2000, file_pat='*.json', encoding='utf-8', initial_vocabulary=None, additional_feature_keys=None, stop_word_file=None, split_char=',', max_ws_tokens=-1, count_vectorizer_kwargs={'max_df': 0.95, 'min_df': 0.0, 'stop_words': 'english'})[source]¶ Bases:
object
Utility vectorizer that wraps
sklearn.feature_extraction.text.CountVectorizer
for use with TMNT dataset conventions.Parameters: - text_key (
str
) – Json key for text to use as document content - label_key (
Optional
[str
]) – Json key to use for label/covariate - min_doc_size (
int
) – Minimum number of tokens for inclusion in the dataset - label_remap (
Optional
[Dict
[str
,str
]]) – Dictionary mapping input label strings to alternative label set - json_out_dir (
Optional
[str
]) – Output directory for resulting JSON files when using inline JSON processing - vocab_size (
int
) – Number of vocabulary items (default=2000) - file_pat (
str
) – File pattern for input json files (default = ‘*.json’) - encoding (
str
) – Character encoding (default = ‘utf-8’) - initial_vocabulary (
Optional
[Vocab
]) – Use existing vocabulary rather than deriving one from the data - additional_feature_keys (
Optional
[List
[str
]]) – List of strings for json keys that correspond to additional features to use alongside vocabulary - stop_word_file (
Optional
[str
]) – Path to a file containing stop words (newline separated) - split_char (
str
) – Single character string used to split label string into multiple labels (for multilabel classification tasks) - max_ws_tokens (
int
) – Maximum number of (whitespace deliniated) tokens to use - count_vectorizer_kwargs (
Dict
[str
,Any
]) – Dictionary of parameter values to pass tosklearn.feature_extraction.text.CountVectorizer
-
classmethod
from_vocab_file
(vocab_file)[source]¶ Class method that creates a TMNTVectorizer from a vocab file
Parameters: vocab_file ( str
) – String to vocabulary file path.Return type: TMNTVectorizer
Returns: TMNTVectorizer
-
get_vocab
()[source]¶ Returns the Torchtext vocabulary associated with the vectorizer
Return type: Vocab
Returns: Torchtext vocabulary
-
write_to_vec_file
(X, y, vec_file)[source]¶ Write document-term matrix and optional label vector to file in svmlight format.
Parameters: - X (
csr_matrix
) – document-term (sparse) matrix - y (
Optional
[ndarray
]) – optional label vector (or matrix for multilabel documents) - vec_file (
str
) – string denoting path to output vector file
Return type: - X (
-
write_vocab
(vocab_file)[source]¶ Write vocabulary to disk.
Parameters: vocab_file ( str
) – Write out vocabulary to this file (one word per line)Return type: None
Returns: None
-
transform
(str_list)[source]¶ Transforms a list of strings into a sparse matrix.
Transforms a single json list file into a tuple, the first element of which is a single sparse matrix X and the second element is always None.
Parameters: str_list ( List
[str
]) – List of document stringsReturn type: Tuple
[csr_matrix
,None
]Returns: Tuple of X,None - sparse matrix of the input, second element is always None in this case
-
transform_json
(json_file)[source]¶ Transforms a single json list file into matrix/vector format(s).
Transforms a single json list file into a tuple, the first element being a single sparse matrix X and the second an (optional) label vector y.
Parameters: json_file ( str
) – Input file containing one document per line in serialized json formatReturn type: Tuple
[csr_matrix
,Optional
[ndarray
]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
transform_json_dir
(json_dir)[source]¶ Transforms a the specified directory’s json list files into matrix formats.
Parameters: json_dir ( str
) – A string denoting the path to a directory containing json list files to processReturn type: Tuple
[csr_matrix
,Optional
[ndarray
]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform
(str_list)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: str_list ( List
[str
]) – List of document stringsReturn type: Tuple
[csr_matrix
,None
]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform_json
(json_file)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: json_file ( str
) – Input file containing one document per line in serialized json formatReturn type: Tuple
[csr_matrix
,Optional
[ndarray
]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform_json_dir
(json_dir)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: json_dir ( str
) – A string denoting the path to a directory containing json list files to processReturn type: Tuple
[csr_matrix
,Optional
[ndarray
]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
- text_key (