matchzoo.preprocessors¶
Subpackages¶
matchzoo.preprocessors.unitsmatchzoo.preprocessors.units.character_indexmatchzoo.preprocessors.units.digit_removalmatchzoo.preprocessors.units.frequency_filtermatchzoo.preprocessors.units.lemmatizationmatchzoo.preprocessors.units.lowercasematchzoo.preprocessors.units.matching_histogrammatchzoo.preprocessors.units.ngram_lettermatchzoo.preprocessors.units.punc_removalmatchzoo.preprocessors.units.stateful_unitmatchzoo.preprocessors.units.stemmingmatchzoo.preprocessors.units.stop_removalmatchzoo.preprocessors.units.tokenizematchzoo.preprocessors.units.truncated_lengthmatchzoo.preprocessors.units.unitmatchzoo.preprocessors.units.vocabularymatchzoo.preprocessors.units.word_exact_matchmatchzoo.preprocessors.units.word_hashing
Submodules¶
Package Contents¶
-
class
matchzoo.preprocessors.NaivePreprocessor¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorNaive preprocessor.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data() >>> test_data = mz.datasets.toy.load_data(stage='test') >>> preprocessor = mz.preprocessors.NaivePreprocessor() >>> train_data_processed = preprocessor.fit_transform(train_data, ... verbose=0) >>> type(train_data_processed) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit(self, data_pack: DataPack, verbose: int = 1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:NaivePreprocessor instance.
-
transform(self, data_pack: DataPack, verbose: int = 1)¶ Apply transformation on data, create truncated length representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPackobject.
-
-
class
matchzoo.preprocessors.BasicPreprocessor(truncated_mode: str = 'pre', truncated_length_left: int = None, truncated_length_right: int = None, filter_mode: str = 'df', filter_low_freq: float = 1, filter_high_freq: float = float('inf'), remove_stop_words: bool = False, ngram_size: typing.Optional[int] = None)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorBaisc preprocessor helper.
Parameters: - truncated_mode – String, mode used by
TruncatedLength. Can be ‘pre’ or ‘post’. - truncated_length_left – Integer, maximize length of
leftin the data_pack. - truncated_length_right – Integer, maximize length of
rightin the data_pack. - filter_mode – String, mode used by
FrequenceFilterUnit. Can be ‘df’, ‘cf’, and ‘idf’. - filter_low_freq – Float, lower bound value used by
FrequenceFilterUnit. - filter_high_freq – Float, upper bound value used by
FrequenceFilterUnit. - remove_stop_words – Bool, use
StopRemovalUnitunit or not.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data('train') >>> test_data = mz.datasets.toy.load_data('test') >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... truncated_length_left=10, ... truncated_length_right=20, ... filter_mode='df', ... filter_low_freq=2, ... filter_high_freq=1000, ... remove_stop_words=True ... ) >>> preprocessor = preprocessor.fit(train_data, verbose=0) >>> preprocessor.context['vocab_size'] 226 >>> processed_train_data = preprocessor.transform(train_data, ... verbose=0) >>> type(processed_train_data) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit(self, data_pack: DataPack, verbose: int = 1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:BasicPreprocessor instance.
-
transform(self, data_pack: DataPack, verbose: int = 1)¶ Apply transformation on data, create truncated length representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPackobject.
- truncated_mode – String, mode used by
-
class
matchzoo.preprocessors.BertPreprocessor(mode: str = 'bert-base-uncased')¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorBaisc preprocessor helper.
Parameters: mode – String, supported mode can be referred https://huggingface.co/pytorch-transformers/pretrained_models.html. -
fit(self, data_pack: DataPack, verbose: int = 1)¶ Tokenizer is all BertPreprocessor’s need.
-
transform(self, data_pack: DataPack, verbose: int = 1)¶ Apply transformation on data.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPackobject.
-
-
matchzoo.preprocessors.list_available() → list¶