Torchrun, trainer, dataset setup

Hi,

I have a piece of python code, that loads a dataset from a file, splits it into train test, and then sets up a Trainer with the two splits as the train and validation sets. I am training a distilBERT model (I dont think that detail matters).

It works great. Even more so, I can do torchrun --nproc_per_node = 4, and it seems to run great too, it seems to spawn 4 versions and trains them on the 4 GPUs.

However, I just got suspicious - given that I’m doing a train test split every time the python code spins up (one per GPU - I can see it by printing out some debug), is the Trainer basically training on just the first 25% of each training set (and ditto for validation).

I’d assumed initially it was smarter than this, and some how only used one of the train/test instances (presumably the first), but now I realize I cannot be certain of this. So, can anyone tell me if I need to “pre-partition” the datasets into 4 shards, and then load them keyed by LOCAL RANK?

Thanks in advance,
W

1 Like

Also, less importantly, the train_test split may be non deterministic, so that even if there was some kind of hash to tell that each worker had the same train and validation inputs and to just shard them, the fact that I’ve potentially got a different train/test split for each evocation of the python code, MIGHT, confuse things.

1 Like

So it looks like noone knows. I wish there was a way to get an actual huggingface employee to answer this question. Unfortunately I don’t think pytorch community will know.

https://pytorch.org/docs/stable/elastic/run.html

1 Like

I think trainer is part of the transformers library, so I think you can check it by opening an issue on the transformers github.

1 Like

Here’s the open issue:

1 Like