After reading the documentation about the trainer https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel
and further on the documentation about DeepSpeed https://huggingface.co/docs/transformers/main/main_classes/deepspeed
I wonder, how the actualy Python implementation looks like. Both documentations go in detail about how to setup the SLURM batch, run the torch.distributed.run
batch script, but I couldn’t find any documentation on how my actual train.py
should look like.
Given this example script, what do I need to modify, to actually use it for ZeRO MultiGPU (and MultiNode) training? (Using DeepSpeed Integration with the Trainer Class, and ZeRO Stage 1)
model = AutoModelFromCasualLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
train_data = TextDataset(tokenizer=tokenizer, file_path="data.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir = "output",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=16,
save_steps=1000,
save_total_limit=2
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator
)
trainer.train()
trainer.save_model("models/trained_model.pt")