Custom training - tokenization via collate fn or __getitem__?

hello,

I am finetuning clip with 1.1TB of image and text pairs using pytorch lighting fabric and mosaic-ml streaming datasets to load the data from multiple shards. When loading I have the option to apply the tokenizer and processor for text and images respectively in __getitem__ or using them via colllate_fn to do it batch-wise staking them afterwards.

My question is, what is more recommended, apply transformations/tokenizers in collate_fn or __getitem__? I have seen very few examples online of people applying tokenizers/processors via collate_fn (theoretically it should be faster than in __getitem__)

I tried to look the ViT example and it seems the transformations are done on the fly and then collate fn to stack, should I follow the same to gain training speed and reduce memory footprint?

thank you in advance

best,