Why use tokenizer in Trainer with Tokenized Data

Hi there!

I am getting started with the implementation of the SageMaker integration using this notebook. I realised that this notebook preprocesses the train and test set, such that they are tokenized when calling the estimator later in the notebook. However, assessing the script called, the tokenizer is also provided to the Trainer. Can somebody explain to me why this is necessary?

This is both a question out of curiosity and for this concrete application, as I am a) worried to tokenize twice; and b) have several different tokenizers, so calling the tokenizer in the trainer directly would be less tedious than tokenizing it first and then providing it to the trainer again.

I assume you refer to this script, the reason the tokenizer is provided to the Trainer is because the text isn’t tokenized yet. One only provides the train_dataset and eval_dataset to it, and the Trainer will internally use the tokenizer to tokenize the text.

That said, a tokenizer is often provided to the Trainer even if the dataset is already tokenized so that it saves/pushes its config files locally/to the hub, along with the model weights.

Ok, thank you for the quick clarification. I think in this case the data is already tokenized (see section ‘Preprocessing’ in the linked notebook, where the tokenizer is mapped to the datasets and then pushed to the hub). But if I provide tokenized data and a tokenizer, the Trainer works fine as well?

The docs says that if you pass a tokenizer to the Trainer, then the DataCollatorWithPadding is used. I checked the source code, and all this is going to do is use tokenizer.pad to make sure all PyTorch tensors are of the same length. So you’re good :slight_smile:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.