Why use tokenizer in Trainer with Tokenized Data

nicoberk · September 12, 2024, 8:18am

Hi there!

I am getting started with the implementation of the SageMaker integration using this notebook. I realised that this notebook preprocesses the train and test set, such that they are tokenized when calling the estimator later in the notebook. However, assessing the script called, the tokenizer is also provided to the Trainer. Can somebody explain to me why this is necessary?

This is both a question out of curiosity and for this concrete application, as I am a) worried to tokenize twice; and b) have several different tokenizers, so calling the tokenizer in the trainer directly would be less tedious than tokenizing it first and then providing it to the trainer again.

nielsr · September 12, 2024, 9:21am

I assume you refer to this script, the reason the tokenizer is provided to the Trainer is because the text isn’t tokenized yet. One only provides the train_dataset and eval_dataset to it, and the Trainer will internally use the tokenizer to tokenize the text.

That said, a tokenizer is often provided to the Trainer even if the dataset is already tokenized so that it saves/pushes its config files locally/to the hub, along with the model weights.

nicoberk · September 12, 2024, 9:41am

Ok, thank you for the quick clarification. I think in this case the data is already tokenized (see section ‘Preprocessing’ in the linked notebook, where the tokenizer is mapped to the datasets and then pushed to the hub). But if I provide tokenized data and a tokenizer, the Trainer works fine as well?

nielsr · September 12, 2024, 4:02pm

The docs says that if you pass a tokenizer to the Trainer, then the DataCollatorWithPadding is used. I checked the source code, and all this is going to do is use tokenizer.pad to make sure all PyTorch tensors are of the same length. So you’re good

system · September 13, 2024, 5:23am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DataCollator vs. Tokenizers 🤗Transformers	1	3788	May 1, 2021
How are the inputs tokenized when model deployment? Amazon SageMaker	13	4270	September 3, 2021
DataCollator uses Tokenizer while having BatchEncodings? 🤗Transformers	0	138	February 29, 2024
Pass tokenizer to Trainer when data is already tokenized? Beginners	0	474	August 25, 2023
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021

Why use tokenizer in Trainer with Tokenized Data

Related topics