Finetune LLM with DeepSpeed

I want to finetune large LM such as opt-13b, opt-30b using Huggingface trainer and its DeepSpeed integration.

I always get out of memory when starting to train.

I have 2 tesla v100-sxm2-32gb GPUs.
I ran DeepSpeed memory estimation:

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 13000M total params.
  per CPU  |  per GPU |   Options
  290.57GB |  24.21GB | offload_optimizer=cpu 
   72.64GB | 242.14GB | offload_optimizer=none

My question is:

  • Consider the table above - it seems like the per GPU is below 32 so we are good. but the per CPU is very high and I have less than that. It is even possible in that case? it is even possible to finetune 13b model with my resources? There is a solution where also the RAM is limited?

I saw this post in the forum but the guy there have 8 GPUs of 80G each:

this is my training script:

training_args = TrainingArguments(
    output_dir=f'ft_rte_{opt_size}',
    logging_dir=f'ft_rte_{opt_size}',
    overwrite_output_dir=True,
    max_steps=len(few_shot_dataset),
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy='steps',
    logging_strategy='steps',
    eval_steps=1,
    logging_steps=1,
    logging_first_step=True,
    save_strategy='no',
    remove_unused_columns=True,
    seed=0,
    fp16=True,
    deepspeed='ds_config.json'
)

model = AutoModelForCausalLM.from_pretrained(opt_size, cache_dir=cache_dir)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=few_shot_dataset,
    eval_dataset=dev_dataset,
    data_collator=data_collator,
    compute_metrics=eval_rte
)

trainer.train()

and this is my deepspeed config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Thanks in advance,
Shon

Hi @shon711 . Did you solve this? I met the same issue. Could you share the ds_config and how you fix this?

Hello! I don’t have a solution for you, but Im running the exact same setup since I don’t have access to A100’s yet. Could you please read this and see if I am on the right track, because if I run into the error you are having I can assist you.

My post: DeepSpeed integration for HuggingFace Seq2SeqTrainingArguments