Limitations of iterable datasets

mariosasko · April 21, 2022, 11:58am

Hi!

_ the map function of iterable datasets doesnt seem to accept the num_proc argument, I wonder whether this will create a bottleneck in my codes or if dataloader_num_workers will allow the iterable dataset to operate in fast multi-processing ?

Adding support for multiple workers (num_workers > 1) to IterableDataset is a work in progress and will be available (most likely) in the next release of datasets. But in your case, for maximum performance, it’s better to use the standard arrow-backed Dataset. Thanks to memory mapping, this version also doesn’t bring everything in memory (only the requested rows/columns).

You can create a dataset from parquet files (the arrow backed version) as follows:

from datasets import load_dataset
dataset  = load_dataset("parquet", data_files=[<list of paths to parquet files>])

_ when working in run_mlm.py with the trainer and an iterable dataset, what are the changes to make for parallel-processing please ?
I read this Process but I am not sure if this applies

You can use the training_args.main_process_first context manager for that (for the arrow backed dataset). You can find an example here.

_ my datasets are stored as .parquet containing input sequences as well as labels/meta-data, one column I would like to implement is a sampling probability in order to over-sample certain training examples
Is there any way to allow this inside an iterable dataset or should I consider duplicating training examples as a pre-processing ?

I’m not sure I understand this question. Could you clarify it a bit more?

Topic		Replies	Views
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1559	April 11, 2023
Num_worker with IterableDataset 🤗Datasets	4	2671	November 16, 2023
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	2924	September 26, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Dataset.map stuck with `torch.set_num_threads` set to 2 or larger Beginners	1	1655	May 2, 2023

Limitations of iterable datasets

Related topics