I have started to setup my research project based on RoBERTa and your run_mlm.py example with trainer, for that purpose I only worked on a subset of my dataset which I load in memory and benchmarked speed for parallel-processing. I am satisfied with the results and I will move to the next steps.
For context, I launch my scripts as
OMP_NUM_THREADS=12 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes=1 --nproc_per_node=8 run_mlm.py --dataloader_num_workers 64 --sharded_ddp zero_dp_2 …
I want to work with streaming datasets and I wonder about the limitations and whether I should default to load everything in memory. Here are my questions, thanks for your advice.
_ the map function of iterable datasets doesnt seem to accept the num_proc argument, I wonder whether this will create a bottleneck in my codes or if dataloader_num_workers will allow the iterable dataset to operate in fast multi-processing ?
_ when working in run_mlm.py with the trainer and an iterable dataset, what are the changes to make for parallel-processing please ?
I read this Process but I am not sure if this applies
_ my datasets are stored as .parquet containing input sequences as well as labels/meta-data, one column I would like to implement is a sampling probability in order to over-sample certain training examples
Is there any way to allow this inside an iterable dataset or should I consider duplicating training examples as a pre-processing ?