matchzoo.dataloader
¶
Subpackages¶
Submodules¶
Package Contents¶
-
class
matchzoo.dataloader.
Dataset
(data_pack: mz.DataPack, mode='point', num_dup: int = 1, num_neg: int = 1, callbacks: typing.List[BaseCallback] = None)¶ Bases:
torch.utils.data.Dataset
Dataset that is built from a data pack.
Parameters: - data_pack – DataPack to build the dataset.
- mode – One of “point”, “pair”, and “list”. (default: “point”)
- num_dup – Number of duplications per instance, only effective when mode is “pair”. (default: 1)
- num_neg – Number of negative samples per instance, only effective when mode is “pair”. (default: 1)
- callbacks – Callbacks. See matchzoo.data_generator.callbacks for more details.
Examples
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data(stage='train') >>> preprocessor = mz.preprocessors.BasicPreprocessor() >>> data_processed = preprocessor.fit_transform(data_pack) >>> dataset_point = mz.dataloader.Dataset(data_processed, mode='point') >>> len(dataset_point) 100 >>> dataset_pair = mz.dataloader.Dataset( ... data_processed, mode='pair', num_neg=2) >>> len(dataset_pair) 5
-
data_pack
¶ data_pack getter.
-
callbacks
¶ callbacks getter.
-
num_neg
¶ num_neg getter.
-
num_dup
¶ num_dup getter.
-
mode
¶ mode getter.
-
index_pool
¶ index_pool getter.
-
__len__
(self)¶ Get the total number of instances.
-
__getitem__
(self, item: int)¶ Get a set of instances from index idx.
Parameters: item – the index of the instance.
-
_handle_callbacks_on_batch_data_pack
(self, batch_data_pack)¶
-
_handle_callbacks_on_batch_unpacked
(self, x, y)¶
-
get_index_pool
(self)¶ Set the:attr:_index_pool.
Here the
_index_pool
records the index of all the instances.
-
sample
(self)¶ Resample the instances from data pack.
-
shuffle
(self)¶ Shuffle the instances.
-
sort
(self)¶ Sort the instances by length_right.
-
classmethod
_reorganize_pair_wise
(cls, relation: pd.DataFrame, num_dup: int = 1, num_neg: int = 1)¶ Re-organize the data pack as pair-wise format.
-
class
matchzoo.dataloader.
DataLoader
(dataset: data.Dataset, batch_size: int = 32, device: typing.Union[torch.device, int, list, None] = None, stage='train', resample: bool = True, shuffle: bool = False, sort: bool = True, callback: BaseCallback = None, pin_memory: bool = False, timeout: int = 0, num_workers: int = 0, worker_init_fn=None)¶ Bases:
object
DataLoader that loads batches of data from a Dataset.
Parameters: - dataset – The Dataset object to load data from.
- batch_size – Batch_size. (default: 32)
- device – The desired device of returned tensor. Default: if None, use the current device. If torch.device or int, use device specified by user. If list, the first item will be used.
- stage – One of “train”, “dev”, and “test”. (default: “train”)
- resample – Whether to resample data between epochs. only effective when mode of dataset is “pair”. (default: True)
- shuffle – Whether to shuffle data between epochs. (default: False)
- sort – Whether to sort data according to length_right. (default: True)
- callback – BaseCallback. See matchzoo.engine.base_callback.BaseCallback for more details.
- pin_momory – If set to True, tensors will be copied into pinned memory. (default: False)
- timeout – The timeout value for collecting a batch from workers. ( default: 0)
- num_workers – The number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
- worker_init_fn – If not
None
, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)
Examples
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data(stage='train') >>> preprocessor = mz.preprocessors.BasicPreprocessor() >>> data_processed = preprocessor.fit_transform(data_pack) >>> dataset = mz.dataloader.Dataset(data_processed, mode='point') >>> padding_callback = mz.dataloader.callbacks.BasicPadding() >>> dataloader = mz.dataloader.DataLoader( ... dataset, stage='train', callback=padding_callback) >>> len(dataloader) 4
-
id_left
:np.ndarray¶ id_left getter.
-
label
:np.ndarray¶ label getter.
-
__len__
(self)¶ Get the total number of batches.
-
init_epoch
(self)¶ Resample, shuffle or sort the dataset for a new epoch.
-
__iter__
(self)¶ Iteration.
-
_handle_callbacks_on_batch_unpacked
(self, x, y)¶
-
class
matchzoo.dataloader.
DataLoaderBuilder
(**kwargs)¶ Bases:
object
DataLoader Bulider. In essense a wrapped partial function.
Example
>>> import matchzoo as mz >>> padding_callback = mz.dataloader.callbacks.BasicPadding() >>> builder = mz.dataloader.DataLoaderBuilder( ... stage='train', callback=padding_callback ... ) >>> data_pack = mz.datasets.toy.load_data() >>> preprocessor = mz.preprocessors.BasicPreprocessor() >>> data_processed = preprocessor.fit_transform(data_pack) >>> dataset = mz.dataloader.Dataset(data_processed, mode='point') >>> dataloder = builder.build(dataset) >>> type(dataloder) <class 'matchzoo.dataloader.dataloader.DataLoader'>
-
build
(self, dataset, **kwargs)¶ Build a DataLoader.
Parameters: - dataset – Dataset to build upon.
- kwargs – Additional keyword arguments to override the keyword arguments passed in __init__.
-
-
class
matchzoo.dataloader.
DatasetBuilder
(**kwargs)¶ Bases:
object
Dataset Bulider. In essense a wrapped partial function.
Example
>>> import matchzoo as mz >>> builder = mz.dataloader.DatasetBuilder( ... mode='point' ... ) >>> data = mz.datasets.toy.load_data() >>> gen = builder.build(data) >>> type(gen) <class 'matchzoo.dataloader.dataset.Dataset'>
-
build
(self, data_pack, **kwargs)¶ Build a Dataset.
Parameters: - data_pack – DataPack to build upon.
- kwargs – Additional keyword arguments to override the keyword arguments passed in __init__.
-