Torchrun, trainer, dataset setup

wdavies · December 8, 2024, 9:35pm

Hi,

I have a piece of python code, that loads a dataset from a file, splits it into train test, and then sets up a Trainer with the two splits as the train and validation sets. I am training a distilBERT model (I dont think that detail matters).

It works great. Even more so, I can do torchrun --nproc_per_node = 4, and it seems to run great too, it seems to spawn 4 versions and trains them on the 4 GPUs.

However, I just got suspicious - given that I’m doing a train test split every time the python code spins up (one per GPU - I can see it by printing out some debug), is the Trainer basically training on just the first 25% of each training set (and ditto for validation).

I’d assumed initially it was smarter than this, and some how only used one of the train/test instances (presumably the first), but now I realize I cannot be certain of this. So, can anyone tell me if I need to “pre-partition” the datasets into 4 shards, and then load them keyed by LOCAL RANK?

Thanks in advance,
W

wdavies · December 8, 2024, 9:37pm

Also, less importantly, the train_test split may be non deterministic, so that even if there was some kind of hash to tell that each worker had the same train and validation inputs and to just shard them, the fact that I’ve potentially got a different train/test split for each evocation of the python code, MIGHT, confuse things.

wdavies · December 12, 2024, 11:39pm

So it looks like noone knows. I wish there was a way to get an actual huggingface employee to answer this question. Unfortunately I don’t think pytorch community will know.

https://pytorch.org/docs/stable/elastic/run.html

John6666 · December 13, 2024, 3:30am

I think trainer is part of the transformers library, so I think you can check it by opening an issue on the transformers github.

wdavies · December 20, 2024, 5:12pm

Here’s the open issue:

github.com/huggingface/transformers

Unclear what happens when using torchrun, multi-gpu and trainer arguments.

opened 08:59PM - 17 Dec 24 UTC

davies-w

bug

### System Info - `transformers` version: 4.47.0 - Platform: Linux-6.8.0-51-ge…neric-x86_64-with-glibc2.35 - Python version: 3.11.6 - Huggingface_hub version: 0.26.3 - Safetensors version: 0.4.5 - Accelerate version: 1.1.1 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1+cu117 (True) - Tensorflow version (GPU?): 2.15.1 (True) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using distributed or parallel set-up in script?: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 stages/train_model.py - Using GPU in script?: Yes - GPU type: NVIDIA RTX A4000 ### Who can help? @muellerzr @SunMarc I have a piece of python code, that loads a training dataset from a file, then sets up a Trainer with the training dataset. I am training a distilBERT model (I dont think that detail matters). It works great. And then I can do torchrun --nproc_per_node = 4, and it seems to run great too, it seems to spawn 4 versions and trains them on the 4 GPUs. However, I got suspicious - given that I’m passing the same dataset in everyt ime, is the Trainer basically training on just the first 25% of the training set? I’d assumed initially it was smarter than this, and some how only used one of the train dataset instances (presumably the first), but now I realize I cannot be certain of this. So, can anyone tell me if I need to “pre-partition” the datasets into 4 shards, and then load them keyed by LOCAL RANK? If it's not doing the "right" thing - i.e. training on all the data, then a warning should be given when this is detected. Script looks like this: train_test_dataset = Dataset.load_from_disk('train_test_dataset_file') trainer = Trainer(..., train_dataset=train_test_dataset['train'], validation_dataset=train_test_dataset['test'],...) Should script looks like this: train_test_dataset = Dataset.load_from_disk(f"one_quarter_train_test_dataset_file_{LOCAL_RANK)") trainer = Trainer(..., train_dataset=train_test_dataset['train'], validation_dataset=train_test_dataset['test'],...) ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction clarity is needed on what behaviour is happening with the training data. Is the training dataset being used by all the nodes with an equal division of labor over each quarter of the dataset (or interleaved every 4 minibatches), or is it sending the same minibatch from the training set, and never getting to the final three quarters of the dataset. ### Expected behavior A better understanding of what data is being trained on when invoking Trainer through torchrun.

Topic		Replies	Views
Boilerplate for Trainer using torch.distributed Beginners	4	2042	January 11, 2022
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1097	April 15, 2024
Does Trainer use multiple workers on datasets? 🤗Transformers	0	528	July 13, 2023
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2684	June 14, 2023
Trainer API for data parallel on multi-node 🤗Transformers	4	95	February 6, 2025

Torchrun, trainer, dataset setup

Related topics