Hi!
_ the map function of iterable datasets doesnt seem to accept the num_proc argument, I wonder whether this will create a bottleneck in my codes or if dataloader_num_workers will allow the iterable dataset to operate in fast multi-processing ?
Adding support for multiple workers (num_workers > 1
) to IterableDataset
is a work in progress and will be available (most likely) in the next release of datasets
. But in your case, for maximum performance, it’s better to use the standard arrow-backed Dataset
. Thanks to memory mapping, this version also doesn’t bring everything in memory (only the requested rows/columns).
You can create a dataset from parquet files (the arrow backed version) as follows:
from datasets import load_dataset
dataset = load_dataset("parquet", data_files=[<list of paths to parquet files>])
_ when working in run_mlm.py with the trainer and an iterable dataset, what are the changes to make for parallel-processing please ?
I read this Process but I am not sure if this applies
You can use the training_args.main_process_first
context manager for that (for the arrow backed dataset). You can find an example here.
_ my datasets are stored as .parquet containing input sequences as well as labels/meta-data, one column I would like to implement is a sampling probability in order to over-sample certain training examples
Is there any way to allow this inside an iterable dataset or should I consider duplicating training examples as a pre-processing ?
I’m not sure I understand this question. Could you clarify it a bit more?