Does `Dataset.map(..., batched=True, batch_size=N)` save the original order?

feeeper · March 28, 2023, 5:25am

Hi. I have a dataset:

Dataset({
    features: ['text', 'request_index'],
    num_rows: 1000
})

The dataset contains 1000 rows for N request_index. I want to build embeddings using batched Dataset.map:

def _get_embeddings(self, texts: t.List[str]) -> t.Dict[str, Tensor]:
    encoded_input = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        encoded_input = { k: v.to(self.device) for k, v in encoded_input.items() }
        model_output = self.model(**encoded_input)

    return model_output.pooler_output.tolist()

predictions = dataset.map(
    lambda x: {
        'embeddings': self._get_embeddings(x['text']),
        'request_index': x['request_index'],
    },
    batched=True,
    batch_size=4,)

After that I have to group embeddings by request_index:

{
    0: [ embedding1, embedding2, ... ],
    1: [ embedding3 ],
    ...
}

The problem is I couldn’t find any information about order in the dataset after batched map method.

As the documentation says batched map method calls map callback for each batch in parallel. So I’m not sure that every time I will get the same order as in the original dataset.

mariosasko · March 28, 2023, 12:18pm

Yes, map always preserves the original order, even with num_proc > 1.

aiden-leong · June 28, 2024, 2:19am

Hi,

Is it possible to disable such behavior?

I’m concatenating text for MLM tasks. The input order matters but we don’t care order after the concatenation. However, preserving order takes too much time during the reduce phase of map-reduceprocess.

Cheers,
Aiden

Topic		Replies	Views
Clarification on Batch mapping 🤗Datasets	2	913	November 2, 2023
Odd dataset.map() behavior with PyTorch dataloader 🤗Datasets	2	226	March 25, 2024
Dataset.map() with batching and multiprocessing 🤗Datasets	1	287	March 5, 2024
How does `datasets.Dataset.map` parallelize data? Beginners	3	3086	August 5, 2024
Dataset.map saves list as numpy array instead of as list 🤗Datasets	2	1418	January 3, 2023

Does `Dataset.map(..., batched=True, batch_size=N)` save the original order?

Related topics