Struggle with finetuneing flan-t5-xxl using deepspeed

Hi,

I am a researcher at Children’s hospital of Philadelphia. We are currently working on automated processing electronic heath records. I attempted to finetune flan-t5-xxl model for translation. However, I kept running into cuda out of memory problem. I am using 4*A100 GPU each with 80G, the four GPUs are on a same node. I also requested 800G CPU memory which is across multi-nodes.

I have successfully trained a flan-t5-xl model using 4 GPUs. When I attempt to train a xxl model, I set offload to CPU and using ZeRO Stage 3. The same code crashes right before training. Could you help?

Can anyone help? I highly appreciate it!

‘’’
Here is my transformer env:
‘’’

  • transformers version: 4.27.2
  • Platform: Linux-5.14.0-162.6.1.el9_1.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.2
  • Huggingface_hub version: 0.13.4
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

‘’’
My config file is as follows
‘’’
{
“bf16”: {
“enabled”: “True”,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 500,
“hysteresis”: 4,
“min_loss_scale”: 1
},
“optimizer”: {
“type”: “AdamW”,
“params”: {
“lr”: “auto”,
“betas”: “auto”,
“eps”: “auto”,
“weight_decay”: “auto”
}
},
“scheduler”: {
“type”: “WarmupLR”,
“params”: {
“warmup_min_lr”: “auto”,
“warmup_max_lr”: “auto”,
“warmup_num_steps”: “auto”
}
},
“zero_optimization”: {
“stage”: 3,
“offload_optimizer”: {
“device”: “cpu”,
“pin_memory”: true
},
“offload_param”: {
“device”: “cpu”,
“pin_memory”: true
},
“overlap_comm”: true,
“contiguous_gradients”: true,
“sub_group_size”: 1e9,
“reduce_bucket_size”: “auto”,
“stage3_prefetch_bucket_size”: “auto”,
“stage3_param_persistence_threshold”: “auto”,
“stage3_max_live_parameters”: 1e9,
“stage3_max_reuse_distance”: 1e9,
“stage3_gather_16bit_weights_on_model_save”: true
},
“gradient_accumulation_steps”: “auto”,
“gradient_clipping”: “auto”,
“steps_per_print”: 2000,
“train_batch_size”: “auto”,
“train_micro_batch_size_per_gpu”: “auto”,
“wall_clock_breakdown”: false
}

‘’’
Here is my python code
‘’’
ROOT_DIR = “.”
train_tsvfile = os.path.join(ROOT_DIR,
‘Clinical-Genetics-NLP-Training-Data/Physical-Exam/Partitioned/Assembled-Span-Annotated-Training-Data-Train.tsv’)
dev_tsvfile = os.path.join(ROOT_DIR,
‘Clinical-Genetics-NLP-Training-Data/Physical-Exam/Partitioned/Assembled-Span-Annotated-Training-Data-Dev.tsv’)
model_name = ‘google/flan-t5-xl’

train_inputs, train_targets, id2name = translater_train_data(ROOT_DIR, train_tsvfile, 3)
dev_inputs, dev_targets = translater_dev_data(dev_tsvfile, id2name)
mydataset = create_my_dataset(train_inputs, train_targets, dev_inputs, dev_targets)

prefix = "translate EHR to HPO: "
tokenizer = T5Tokenizer.from_pretrained(model_name)

def preprocess_function(examples):
inputs = [prefix + example[‘EHR’] for example in examples[‘translation’]]
targets = [example[‘HPO’] for example in examples[‘translation’]]
model_inputs = tokenizer(inputs, max_length=512, padding=‘max_length’, truncation=True)
labels = tokenizer(targets, max_length=512, padding=‘max_length’, truncation=True)
labels[“input_ids”] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[“input_ids”]]
model_inputs[“labels”] = labels[“input_ids”]
return model_inputs

tokenized_encodings = mydataset.map(preprocess_function, batched=True)

model = T5ForConditionalGeneration.from_pretrained(model_name)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)
metric = evaluate.load(“sacrebleu”)

def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels

def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result

training_args = Seq2SeqTrainingArguments(
output_dir=“my_model”,
evaluation_strategy=“epoch”,
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
weight_decay=0.01,
save_total_limit=2,
num_train_epochs=2,
predict_with_generate=True,
no_deprecation_warning=True,
push_to_hub=False,
)

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_encodings[“train”],
eval_dataset=tokenized_encodings[“validation”],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

trainer.train()

‘’’
Error Message
‘’’

In your python code:

def compute_metrics(eval_preds):
    ...
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,

It may help to reduce these values, or to add the auto_find_batch_size parameter, which I believe is used in addition to the others, to reduce until a good fit is found.

@pogpog Thank you very much for your reply. I have solved solved this problem. But I really appreciate your help.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.