tmnt.preprocess.vectorizer¶
Copyright (c) 2019-2021 The MITRE Corporation.
Classes
TMNTVectorizer([text_key, label_key, ...]) |
Utility vectorizer that wraps sklearn.feature_extraction.text.CountVectorizer for use with TMNT dataset conventions. |
-
class
TMNTVectorizer(text_key='body', label_key=None, min_doc_size=1, label_remap=None, json_out_dir=None, vocab_size=2000, file_pat='*.json', encoding='utf-8', initial_vocabulary=None, additional_feature_keys=None, stop_word_file=None, split_char=',', max_ws_tokens=-1, count_vectorizer_kwargs={'max_df': 0.95, 'min_df': 0.0, 'stop_words': 'english'})[source]¶ Bases:
objectUtility vectorizer that wraps
sklearn.feature_extraction.text.CountVectorizerfor use with TMNT dataset conventions.Parameters: - text_key (
str) – Json key for text to use as document content - label_key (
Optional[str]) – Json key to use for label/covariate - min_doc_size (
int) – Minimum number of tokens for inclusion in the dataset - label_remap (
Optional[Dict[str,str]]) – Dictionary mapping input label strings to alternative label set - json_out_dir (
Optional[str]) – Output directory for resulting JSON files when using inline JSON processing - vocab_size (
int) – Number of vocabulary items (default=2000) - file_pat (
str) – File pattern for input json files (default = ‘*.json’) - encoding (
str) – Character encoding (default = ‘utf-8’) - initial_vocabulary (
Optional[Vocab]) – Use existing vocabulary rather than deriving one from the data - additional_feature_keys (
Optional[List[str]]) – List of strings for json keys that correspond to additional features to use alongside vocabulary - stop_word_file (
Optional[str]) – Path to a file containing stop words (newline separated) - split_char (
str) – Single character string used to split label string into multiple labels (for multilabel classification tasks) - max_ws_tokens (
int) – Maximum number of (whitespace deliniated) tokens to use - count_vectorizer_kwargs (
Dict[str,Any]) – Dictionary of parameter values to pass tosklearn.feature_extraction.text.CountVectorizer
-
classmethod
from_vocab_file(vocab_file)[source]¶ Class method that creates a TMNTVectorizer from a vocab file
Parameters: vocab_file ( str) – String to vocabulary file path.Return type: TMNTVectorizerReturns: TMNTVectorizer
-
get_vocab()[source]¶ Returns the Torchtext vocabulary associated with the vectorizer
Return type: VocabReturns: Torchtext vocabulary
-
write_to_vec_file(X, y, vec_file)[source]¶ Write document-term matrix and optional label vector to file in svmlight format.
Parameters: - X (
csr_matrix) – document-term (sparse) matrix - y (
Optional[ndarray]) – optional label vector (or matrix for multilabel documents) - vec_file (
str) – string denoting path to output vector file
Return type: - X (
-
write_vocab(vocab_file)[source]¶ Write vocabulary to disk.
Parameters: vocab_file ( str) – Write out vocabulary to this file (one word per line)Return type: NoneReturns: None
-
transform(str_list)[source]¶ Transforms a list of strings into a sparse matrix.
Transforms a single json list file into a tuple, the first element of which is a single sparse matrix X and the second element is always None.
Parameters: str_list ( List[str]) – List of document stringsReturn type: Tuple[csr_matrix,None]Returns: Tuple of X,None - sparse matrix of the input, second element is always None in this case
-
transform_json(json_file)[source]¶ Transforms a single json list file into matrix/vector format(s).
Transforms a single json list file into a tuple, the first element being a single sparse matrix X and the second an (optional) label vector y.
Parameters: json_file ( str) – Input file containing one document per line in serialized json formatReturn type: Tuple[csr_matrix,Optional[ndarray]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
transform_json_dir(json_dir)[source]¶ Transforms a the specified directory’s json list files into matrix formats.
Parameters: json_dir ( str) – A string denoting the path to a directory containing json list files to processReturn type: Tuple[csr_matrix,Optional[ndarray]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform(str_list)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: str_list ( List[str]) – List of document stringsReturn type: Tuple[csr_matrix,None]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform_json(json_file)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: json_file ( str) – Input file containing one document per line in serialized json formatReturn type: Tuple[csr_matrix,Optional[ndarray]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
-
fit_transform_json_dir(json_dir)[source]¶ Learns a vocabulary and transforms the input into into matrix formats.
As a side-effect, this function induces a vocabulary of the inputs.
Parameters: json_dir ( str) – A string denoting the path to a directory containing json list files to processReturn type: Tuple[csr_matrix,Optional[ndarray]]Returns: Tuple containing sparse document-term matrix X and optional label vector y
- text_key (