Fine-tuning - tokenize before or when doing a forward pass over batches

Hello everyone,

I have one question regarding performance when fine-tuning over a QA dataset like MSMARCO where you have ~8.8M passages.

My question comes after analyzing several fine-tune tutorials like Fine-tuning with custom datasets.

What is the best approach regarding performance when to tokenize when doing fine-tuning?

  1. Tokenize the entire dataset (calculate input_ids and attention_mask before iterating over Dataloader batches when doing a model forward pass). This approach is the one used in the tutorial above (Question Answering with SQuAD 2.0).

  2. Tokenize and do a model forward pass when iterating over Dataloader batches.

The first approach has the advantage of not performing tokenization for every epoch of the loop. However, it leads to greater disk allocation (since we need to store these input_ids) and higher disk seeks and reads, which for such a dataset like MSMARCO can harm performance.

Also, I am not sure if you can use Hugging Face datasets, which helps with loading data throughput (“loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.”) since you wrap it to a dataloader as we can observe in the datasets Quick tour

import torch
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
next(iter(dataloader))

The second approach has the disadvantage of performing tokenization for every epoch, but since it calculates these tokens for every batch it does not need to perform disk seeks and reads because it can probably store these vectors in RAM.

Honestly, I have not found many fine-tunes with the second approach. What is your opinion about these two approaches to maximize performance when training?

1 Like

Hi, is there any answer about this yet?
I’d like to know aboit this issue too.
If anyone have answer please help.

I would love to see some clarification on this. In the context of image/text models its difficult to know which will be the optimal approach.

For now I tend to use the tokeniser (or processor in my case) within the data collator which pulls in a single batch as and when it is required. I do this so that the padding is correct for all items in a given batch.

Though when work with text only, it often does not require much time to just tokenise everything in one go.

I think performance wise…tokenising on the batch level is actually better because you inputs will only be as big as they need to be. whereas if you tokenise ahead of time then all of your inputs are max_length. Though, I am unsure as to how much this would impact performance.

I may experiment with this soon to see what works best, if I get any solid stats I will let you both know.