I’m following this tutorial
https://huggingface.co/docs/transformers/accelerate
in order to perform distributed training on various g5 sagemaker instances, I’ve refactored my code in the same manner as the tutorial shows. No matter how big the G5 instance, or how many instances I utilize when I run my sagemaker training job, it always fails when the max amount of tokens is greater than 18. Below is an example code snippet
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)#, torch_dtype="auto")
tokenizer.pad_token = tokenizer.eos_token
optimizer = AdamW(nli_model.parameters(),
lr = learning_rate, # previous 8e-6
eps = 1e-8 # args.adam_epsilon - default is 1e-8.
)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = 1)
toks = tokenizer(sentences, questions, padding='longest')
ds = Dataset.from_dict({"x": torch.tensor(toks['input_ids']), "mask": torch.tensor(toks['attention_mask']), 'labels' : torch.tensor([ 0 if i == 'no' else 1 for i in yes_or_no])}).with_format("torch")
dataloader = DataLoader(ds, batch_size=batch_count)
nli_model, optimizer, dataloader, scheduler = accelerator.prepare(nli_model, optimizer, dataloader, scheduler)
nli_model.train()
for batch in dataloader:
x_batch = batch["x"]#.to(device)
mask_batch = batch["mask"]#.to(device)
labels_batch = batch["labels"]#.to(device)
loss = nli_model(x_batch, attention_mask=mask_batch, labels = labels_batch)[0]
accelerator.backward(loss)
#loss.backward()
#print(nli_model.model.device)
#print(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Is there a reason multiple GPU’s are never being used on these sagemaker training jobs?