Batching vs. Sharding a Large Dataset

tadf · June 3, 2021, 9:35am

Hi,
I have a very large dataset that fails in the preprocess stage due to its size.
I’ve looked into: sharding the dataset vs. using batched=True on the map(…) function.
What are the advantages/use cases of either cases?

Also can anyone point me to an example that uses each methodology?
Shards - The official docs lacks the use of shards (eg. how to do a full train using shards – do you run the training scheme such that you do 1 epoch for the shard and then switch to the next shard?)

Batched - Not sure what it means by:
Provided function which is applied to all elements of table returns a dict of types [<class 'list'>, <class 'list'>, <class 'str'>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>)
Any clear example showing batched=True will help!

lhoestq · June 4, 2021, 9:34am

Hi ! Do you need to shard your dataset because you don’t have enough disk space ?
Have you considered doing the preprocessing on-the-fly during training to avoid filling up your disk ?

tadf · June 7, 2021, 1:42am

Hi!
I have 3TB of space, 128GB of memory. The custom dataset I’m using is audio + transcript(text) data which is about 100GB.
By preprocessing on-the-fly, are you referring to set_transform?

tadf · June 8, 2021, 7:11am

@lhoestq
I was able to resolve the issue with set_transform along with remove_unused_columns=False on the Trainer!

lhoestq · June 8, 2021, 11:09am

Hi ! I’m glad you managed to resolve your issue

Indeed to use the Trainer with a dataset that uses a transform, you must set remove_unused_columns to False

Topic		Replies	Views
Working with large datasets 🤗Datasets	5	4155	November 10, 2020
Big text dataset loading for training 🤗Datasets	2	118	May 7, 2025
How to fit custom audio dataset during pre-process? Batch? Stream? Shard? Beginners	1	253	May 26, 2023
An optimal way to perform partitioning of the dataset 🤗Datasets	2	42	June 17, 2025
Streaming datasets and batched mapping 🤗Datasets	5	2670	January 10, 2022

Batching vs. Sharding a Large Dataset

Related topics