I have a piece of python code, that loads a dataset from a file, splits it into train test, and then sets up a Trainer with the two splits as the train and validation sets. I am training a distilBERT model (I dont think that detail matters).
It works great. Even more so, I can do torchrun --nproc_per_node = 4, and it seems to run great too, it seems to spawn 4 versions and trains them on the 4 GPUs.
However, I just got suspicious - given that I’m doing a train test split every time the python code spins up (one per GPU - I can see it by printing out some debug), is the Trainer basically training on just the first 25% of each training set (and ditto for validation).
I’d assumed initially it was smarter than this, and some how only used one of the train/test instances (presumably the first), but now I realize I cannot be certain of this. So, can anyone tell me if I need to “pre-partition” the datasets into 4 shards, and then load them keyed by LOCAL RANK?
Thanks in advance,