Custom training - tokenization via collate fn or getitem?

malba96 · April 14, 2024, 6:55pm

hello,

I am finetuning clip with 1.1TB of image and text pairs using pytorch lighting fabric and mosaic-ml streaming datasets to load the data from multiple shards. When loading I have the option to apply the tokenizer and processor for text and images respectively in __getitem__ or using them via colllate_fn to do it batch-wise staking them afterwards.

My question is, what is more recommended, apply transformations/tokenizers in collate_fn or __getitem__? I have seen very few examples online of people applying tokenizers/processors via collate_fn (theoretically it should be faster than in __getitem__)

I tried to look the ViT example and it seems the transformations are done on the fly and then collate fn to stack, should I follow the same to gain training speed and reduce memory footprint?

thank you in advance

best,

Topic		Replies	Views
How to use a data collator when dealing with text and images 🤗Transformers	0	1116	March 6, 2023
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5042	June 21, 2023
The difference between Seq2SeqDataset.collate_fn and Seq2SeqDataCollator._encode Beginners	2	1303	October 24, 2020
Dataset expected by Trainer Beginners	5	8994	September 28, 2020
What is happening in the trainer api, with data collator? Beginners	0	368	April 29, 2023

Custom training - tokenization via collate fn or __getitem__?

Related topics

Custom training - tokenization via collate fn or getitem?