matchzoo.data_pack
¶
Package Contents¶
-
class
matchzoo.data_pack.
DataPack
(relation: pd.DataFrame, left: pd.DataFrame, right: pd.DataFrame)¶ Bases:
object
Matchzoo
DataPack
data structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
Parameters: - relation – Store the relation between left document and right document use ids.
- left – Store the content or features for id_left.
- right – Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
-
class
FrameView
(data_pack: DataPack)¶ Bases:
object
FrameView.
-
__getitem__
(self, index: typing.Union[int, slice, np.array])¶ Slicer.
-
__call__
(self)¶ Returns: A full copy. Equivalant to frame[:].
-
-
DATA_FILENAME
= data.dill¶
-
has_label
:bool¶ True if label column exists, False other wise.
Type: return
-
frame
:'DataPack.FrameView'¶ View the data pack as a
pandas.DataFrame
.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
Returns: A matchzoo.DataPack.FrameView
instance.Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
-
relation
¶ relation getter.
-
__len__
(self)¶ Get numer of rows in the class:DataPack object.
-
unpack
(self)¶ Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
Returns: A tuple of (X, y). y is None if self has no label. Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
-
__getitem__
(self, index: typing.Union[int, slice, np.array])¶ Get specific item(s) as a new
DataPack
.The returned
DataPack
will be a copy of the subset of the originalDataPack
.Parameters: index – Index of the item(s) to get. Returns: An instance of DataPack
.
-
copy
(self)¶ Returns: A deep copy.
-
save
(self, dirpath: typing.Union[str, Path])¶ Save the
DataPack
object.A saved
DataPack
is represented as a directory with aDataPack
object (transformed user input as features and context), it will be saved by pickle.Parameters: dirpath – directory path of the saved DataPack
.
-
_optional_inplace
(func)¶ Decorator that adds inplace key word argument to a method.
Decorate any method that modifies inplace to make that inplace change optional.
-
drop_empty
(self)¶ Process empty data by removing corresponding rows.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False)
-
shuffle
(self)¶ Shuffle the data pack by shuffling the relation column.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
-
drop_label
(self)¶ Remove label column from the data pack.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
-
append_text_length
(self, verbose=1)¶ Append length_left and length_right columns.
Parameters: - inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose – Verbosity.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length(verbose=0) >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True, verbose=0) >>> 'length_left' in data_pack.frame[0].columns True
-
apply_on_text
(self, func: typing.Callable, mode: str = 'both', rename: typing.Optional[str] = None, verbose: int = 1)¶ Apply func to text columns based on mode.
Parameters: - func – The function to apply.
- mode – One of “both”, “left” and “right”.
- rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
- inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose – Verbosity.
- Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
-
_apply_on_text_right
(self, func, rename, verbose=1)¶
-
_apply_on_text_left
(self, func, rename, verbose=1)¶
-
_apply_on_text_both
(self, func, rename, verbose=1)¶
-
matchzoo.data_pack.
load_data_pack
(dirpath: typing.Union[str, Path]) → DataPack¶ Load a
DataPack
. The reverse function ofsave()
.Parameters: dirpath – directory path of the saved model. Returns: a DataPack
instance.
-
matchzoo.data_pack.
pack
(df: pd.DataFrame, task: typing.Union[str, BaseTask] = 'ranking') → 'matchzoo.DataPack'¶ Pack a
DataPack
using df.The df must have text_left and text_right columns. Optionally, the df can have id_left, id_right to index text_left and text_right respectively. id_left, id_right will be automatically generated if not specified.
Parameters: - df – Input
pandas.DataFrame
to use. - task – Could be one of ranking, classification or a
matchzoo.engine.BaseTask
instance.
- Examples::
>>> import matchzoo as mz >>> import pandas as pd >>> df = pd.DataFrame(data={'text_left': list('AABC'), ... 'text_right': list('abbc'), ... 'label': [0, 1, 1, 0]}) >>> mz.pack(df, task='classification').frame() id_left text_left id_right text_right label 0 L-0 A R-0 a 0 1 L-0 A R-1 b 1 2 L-1 B R-1 b 1 3 L-2 C R-2 c 0 >>> mz.pack(df, task='ranking').frame() id_left text_left id_right text_right label 0 L-0 A R-0 a 0.0 1 L-0 A R-1 b 1.0 2 L-1 B R-1 b 1.0 3 L-2 C R-2 c 0.0
- df – Input