Accelerate not performing distributed training

bennicholl · October 5, 2023, 2:33pm

I’m following this tutorial
https://huggingface.co/docs/transformers/accelerate
in order to perform distributed training on various g5 sagemaker instances, I’ve refactored my code in the same manner as the tutorial shows. No matter how big the G5 instance, or how many instances I utilize when I run my sagemaker training job, it always fails when the max amount of tokens is greater than 18. Below is an example code snippet

accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)#, torch_dtype="auto")
tokenizer.pad_token = tokenizer.eos_token
optimizer = AdamW(nli_model.parameters(),
                          lr = learning_rate, # previous 8e-6
                          eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                        )
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = 1)

toks = tokenizer(sentences, questions, padding='longest')
    
ds = Dataset.from_dict({"x": torch.tensor(toks['input_ids']), "mask": torch.tensor(toks['attention_mask']), 'labels' : torch.tensor([ 0 if i == 'no' else 1 for i in yes_or_no])}).with_format("torch")

dataloader = DataLoader(ds, batch_size=batch_count)

nli_model, optimizer, dataloader, scheduler = accelerator.prepare(nli_model, optimizer, dataloader, scheduler)
nli_model.train()
for batch in dataloader:
    x_batch = batch["x"]#.to(device)
    mask_batch = batch["mask"]#.to(device)
    labels_batch = batch["labels"]#.to(device)
    loss = nli_model(x_batch, attention_mask=mask_batch, labels = labels_batch)[0]
    accelerator.backward(loss)
    #loss.backward()
    #print(nli_model.model.device)
    #print(loss)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Is there a reason multiple GPU’s are never being used on these sagemaker training jobs?

muellerzr · October 5, 2023, 2:54pm

How are you calling your script? Just doing python myscript.py will not work

bennicholl · October 5, 2023, 3:08pm

Here is how I’m starting the code

pt_estimator = PyTorch(
    entry_point="ph_1_5_with_accelerator.py",
    source_dir='source_dir_phi_1_5',
    role=get_execution_role(),
    framework_version="1.10.2",
    py_version="py38",
    instance_count=1,
    instance_type="ml.g5.16xlarge",
    distribution={
        "pytorchddp": {
            "enabled": True   # I've also hashtagged distribution out
        }
    }
)

pt_estimator.fit()

Topic		Replies	Views
Distributed Training on Sagemaker Amazon SageMaker	13	2719	August 5, 2021
Huggingface Distributed Training with Accelerate Beginners	1	864	May 11, 2023
Hugging Face Trainer class with accelerate 🤗Accelerate	2	388	May 21, 2024
Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset 🤗Transformers	0	204	October 2, 2023
Distributed GPU training not working 🤗Accelerate	2	4495	November 30, 2023

Accelerate not performing distributed training

Related topics