hello,
I am finetuning clip with 1.1TB of image and text pairs using pytorch lighting fabric and mosaic-ml streaming datasets to load the data from multiple shards. When loading I have the option to apply the tokenizer and processor for text and images respectively in __getitem__
or using them via colllate_fn to do it batch-wise staking them afterwards.
My question is, what is more recommended, apply transformations/tokenizers in collate_fn
or __getitem__
? I have seen very few examples online of people applying tokenizers/processors via collate_fn
(theoretically it should be faster than in __getitem__)
I tried to look the ViT example and it seems the transformations are done on the fly and then collate fn to stack, should I follow the same to gain training speed and reduce memory footprint?
thank you in advance
best,