Batching vs. Sharding a Large Dataset

Hi,
I have a very large dataset that fails in the preprocess stage due to its size.
I’ve looked into: sharding the dataset vs. using batched=True on the map(…) function.
What are the advantages/use cases of either cases?

Also can anyone point me to an example that uses each methodology?
Shards - The official docs lacks the use of shards (eg. how to do a full train using shards – do you run the training scheme such that you do 1 epoch for the shard and then switch to the next shard?)

Batched - Not sure what it means by:
Provided function which is applied to all elements of table returns a dict of types [<class 'list'>, <class 'list'>, <class 'str'>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>)
Any clear example showing batched=True will help!

1 Like

Hi ! Do you need to shard your dataset because you don’t have enough disk space ?
Have you considered doing the preprocessing on-the-fly during training to avoid filling up your disk ?

Hi!
I have 3TB of space, 128GB of memory. The custom dataset I’m using is audio + transcript(text) data which is about 100GB.
By preprocessing on-the-fly, are you referring to set_transform?

@lhoestq
I was able to resolve the issue with set_transform along with remove_unused_columns=False on the Trainer!

1 Like

Hi ! I’m glad you managed to resolve your issue :slight_smile:

Indeed to use the Trainer with a dataset that uses a transform, you must set remove_unused_columns to False