Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline

It’s hard to have a good grasp of how various libraries and their components interact.

Here are my requirements.

  1. I want to pad my texts to maximum length in a batch.
  2. I want to shuffle my data after each epoch.

Right now I use the datasets library and use datasets map function to pre tokenize the data and convert it into torch format like

ds.map(...).with_format("torch")

But the issue is if I call the dataloader like this with suffle on (I am using Pytorch Lightning)

    def train_dataloader(self):
        """The training data loader."""
        return DataLoader(
            self.dataset["train"],  # type: ignore
            shuffle=True,
            collate_fn=self.data_collator, # Default data collator
            batch_size=self.batch_sizes["train"],
        )

Then the original padding is useless and you will get batches with unequal number of tokens.

Maybe I can address this using a different data collator? What is the purpose of a data collator, can I use it somehow to do the padding? The Transformers library warns that if you use a fast tokenizer, it is much faster to pad with the original call instead of separately tokenizing and padding.

What is the most efficient way of doing this pipeline keeping in mind DDP scenarios. If someone has any experience in Pytorch Lightning, I would appreciate if the answer is in that context. But even a general purpose answer should be helpful.

These concepts are explained in the course or Transformers’ task pages/examples. The idea is to tokenize the samples without padding in map and then pass a task’s data collator to Trainer/DataLoader to pad the batches.