Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline

vikalex · June 16, 2023, 5:49pm

It’s hard to have a good grasp of how various libraries and their components interact.

Here are my requirements.

I want to pad my texts to maximum length in a batch.
I want to shuffle my data after each epoch.

Right now I use the datasets library and use datasets map function to pre tokenize the data and convert it into torch format like

ds.map(...).with_format("torch")

But the issue is if I call the dataloader like this with suffle on (I am using Pytorch Lightning)

    def train_dataloader(self):
        """The training data loader."""
        return DataLoader(
            self.dataset["train"],  # type: ignore
            shuffle=True,
            collate_fn=self.data_collator, # Default data collator
            batch_size=self.batch_sizes["train"],
        )

Then the original padding is useless and you will get batches with unequal number of tokens.

Maybe I can address this using a different data collator? What is the purpose of a data collator, can I use it somehow to do the padding? The Transformers library warns that if you use a fast tokenizer, it is much faster to pad with the original call instead of separately tokenizing and padding.

What is the most efficient way of doing this pipeline keeping in mind DDP scenarios. If someone has any experience in Pytorch Lightning, I would appreciate if the answer is in that context. But even a general purpose answer should be helpful.

mariosasko · June 21, 2023, 1:05pm

These concepts are explained in the course or Transformers’ task pages/examples. The idea is to tokenize the samples without padding in map and then pass a task’s data collator to Trainer/DataLoader to pad the batches.

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5093	October 21, 2021
DataCollator vs. Tokenizers 🤗Transformers	1	3822	May 1, 2021
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2539	May 9, 2022
Can't iterate a DataLoader 🤗Datasets	3	1432	February 25, 2022
How to deal with DataCollator and DataLoaders in Huggingface? DeepSpeed	0	1157	February 2, 2023

Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline

Related topics