Same seed across different gpus in multiple workers

princethewinner · March 8, 2024, 1:45pm

This is more of a discussion choice question because I have a workaround to suit my use case.

My understanding was if I am training my model on multiple GPUs (say n GPUs) in a DistributedDataParallel fashion, all GPUs will receive different batches, and the effective batch size would be n x batch_size_per_gpu. With this understanding, I created a simple script to see how I can control the randomness of the batches, as shown below. But to my surprise, different workers with the same id received the same seed and the same shard. This would mean, in effect, all the GPUs would be seeing the same data at the same time, thus this would not increase the batch size.

This behavior IMO defeats the purpose of doing parallel training because it does not increase the batch size. Please correct me if I there are any gaps in my understanding or if this is not how we are supposed to use DDP.

The print statement in the SeedTest.seedTest function prints the seed received by different workers across GPU (Refer to the output – I reordered it to highlight the point.).

Output:

rank=0 worker=0 seed=6909045637428952499 shards=[0]
rank=1 worker=0 seed=6909045637428952499 shards=[0]

rank=0 worker=1 seed=6909045637428952500 shards=[1]
rank=1 worker=1 seed=6909045637428952500 shards=[1]

rank=0 worker=2 seed=6909045637428952501 shards=[2]
rank=1 worker=2 seed=6909045637428952501 shards=[2]

rank=0 worker=3 seed=6909045637428952502 shards=[3]
rank=1 worker=3 seed=6909045637428952502 shards=[3]

Code

import typing as tp
import os
import torch
from datasets import IterableDataset  # huggingface - transformers datasets
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer


class SeedTest:

    def __init__(self) -> None:
        pass

    def seedTest(self, shards: tp.List[tp.Any]) -> tp.Any:
        rank = int(os.environ.get("LOCAL_RANK", 0))
        worker_info: tp.Any = torch.utils.data.get_worker_info()

        print(f"rank={rank} worker={worker_info.id} seed={worker_info.seed}")

        yield {
            "seq": "ATGCATGC"
        }


if __name__ == "__main__":
    
    model_name ="InstaDeepAI/nucleotide-transformer-v2-50m-multi-species"
    model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
    tokeniser = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    collator = DataCollatorForLanguageModeling(tokenizer=tokeniser, mlm=False)

    _d_shard = IterableDataset.from_generator(SeedTest().seedTest, gen_kwargs={"shards": list(range(4))})
    _d_shard = _d_shard.map(lambda x: tokeniser(x['seq']))
    _d_shard = _d_shard.select_columns(["input_ids", "attention_mask"])
    _d_shard = _d_shard.with_format("torch")

    trargs = TrainingArguments(output_dir="test", max_steps=1, dataloader_num_workers=4)
    trainer = Trainer(
        model,
        args=trargs,
        train_dataset=_d_shard,
        data_collator=collator
    )
    trainer.train()

Run command

torchrun --nproc_per_node 2 seed_test.py

P.S.: I was unsure if I should ask it in the Pytorch community or the Huggingface community. Please let me know if it is the correct place to ask the question.

Topic		Replies	Views
How does Accelerate ensure uniqueness of data samples across GPUs? 🤗Accelerate	2	387	June 21, 2023
Clarifying multi-GPU memory usage Beginners	1	1303	November 5, 2020
Distribute training 🤗Transformers	0	274	November 16, 2022
Skipped batches do not consider distributed training 🤗Transformers	0	504	June 19, 2023
Unique text generation per-GPU Beginners	0	324	June 9, 2022

Same seed across different gpus in multiple workers

Related Topics