Basics for Multi GPU Training with Huggingface Trainer

After reading the documentation about the trainer https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel

and further on the documentation about DeepSpeed https://huggingface.co/docs/transformers/main/main_classes/deepspeed

I wonder, how the actualy Python implementation looks like. Both documentations go in detail about how to setup the SLURM batch, run the torch.distributed.run batch script, but I couldn’t find any documentation on how my actual train.py should look like.

Given this example script, what do I need to modify, to actually use it for ZeRO MultiGPU (and MultiNode) training? (Using DeepSpeed Integration with the Trainer Class, and ZeRO Stage 1)

model = AutoModelFromCasualLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_data = TextDataset(tokenizer=tokenizer, file_path="data.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir = "output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    save_steps=1000,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator
)

trainer.train()

trainer.save_model("models/trained_model.pt")