Fine-tuning - tokenize before or when doing a forward pass over batches

Hello everyone,

I have one question regarding performance when fine-tuning over a QA dataset like MSMARCO where you have ~8.8M passages.

My question comes after analyzing several fine-tune tutorials like Fine-tuning with custom datasets.

What is the best approach regarding performance when to tokenize when doing fine-tuning?

  1. Tokenize the entire dataset (calculate input_ids and attention_mask before iterating over Dataloader batches when doing a model forward pass). This approach is the one used in the tutorial above (Question Answering with SQuAD 2.0).

  2. Tokenize and do a model forward pass when iterating over Dataloader batches.

The first approach has the advantage of not performing tokenization for every epoch of the loop. However, it leads to greater disk allocation (since we need to store these input_ids) and higher disk seeks and reads, which for such a dataset like MSMARCO can harm performance.

Also, I am not sure if you can use Hugging Face datasets, which helps with loading data throughput (“loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.”) since you wrap it to a dataloader as we can observe in the datasets Quick tour

import torch
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
next(iter(dataloader))

The second approach has the disadvantage of performing tokenization for every epoch, but since it calculates these tokens for every batch it does not need to perform disk seeks and reads because it can probably store these vectors in RAM.

Honestly, I have not found many fine-tunes with the second approach. What is your opinion about these two approaches to maximize performance when training?