Limitations of iterable datasets

Hi!

_ the map function of iterable datasets doesnt seem to accept the num_proc argument, I wonder whether this will create a bottleneck in my codes or if dataloader_num_workers will allow the iterable dataset to operate in fast multi-processing ?

Adding support for multiple workers (num_workers > 1) to IterableDataset is a work in progress and will be available (most likely) in the next release of datasets. But in your case, for maximum performance, it’s better to use the standard arrow-backed Dataset. Thanks to memory mapping, this version also doesn’t bring everything in memory (only the requested rows/columns).

You can create a dataset from parquet files (the arrow backed version) as follows:

from datasets import load_dataset
dataset  = load_dataset("parquet", data_files=[<list of paths to parquet files>])

_ when working in run_mlm.py with the trainer and an iterable dataset, what are the changes to make for parallel-processing please ?
I read this Process but I am not sure if this applies

You can use the training_args.main_process_first context manager for that (for the arrow backed dataset). You can find an example here.

_ my datasets are stored as .parquet containing input sequences as well as labels/meta-data, one column I would like to implement is a sampling probability in order to over-sample certain training examples
Is there any way to allow this inside an iterable dataset or should I consider duplicating training examples as a pre-processing ?

I’m not sure I understand this question. Could you clarify it a bit more?