Issues with using DeepSpeed on multiple GPUs

Eichhof · August 25, 2022, 10:18pm

Hello

I would like to fine-tune the EleutherAI/gpt-j-6B model on a HPC cluster. On the cluster I have only GPUs available with at most 24 GB memory. Thus, I’m running out of memory when fine-tuning on these GPUs. That’s why I would like to use DeepSpeed and fine-tune on 2 or 4 GPUs at the same time. I have the following TrainingArguments:

training_args = TrainingArguments(
        disable_tqdm=True,
        output_dir='/cluster/scratch/myUser/cache/models/checkpoints',
        save_total_limit=10,
        logging_dir='/cluster/scratch/myUser/cache/models/logs',
        num_train_epochs=4,
        evaluation_strategy='epoch',
        save_strategy='steps',
        save_steps=30,
        logging_steps=10,
        overwrite_output_dir=True,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        eval_accumulation_steps=4,
        gradient_checkpointing=True,
        max_grad_norm=0.5,
        lr_scheduler_type="cosine",
        learning_rate=1e-4,
        warmup_ratio=0.05,
        weight_decay=0.1,
        fp16_full_eval=True,
        fp16=True,
        fp16_opt_level='O1',
        deepspeed="configs/ds_config.json",
        report_to=['tensorboard']
    )

In addition I have the following deepspeed configuration file ds_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "opt_level": "O3"
    },

    "zero_optimization": {
        "stage": 2,

        "offload_param": {
            "device": "none",
            "buffer_count": 4,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9,
            "pin_memory": false
        },

        "offload_optimizer": {
            "device": "none",
            "buffer_count": 4,
            "pin_memory": false,
            "pipeline_read": false,
            "pipeline_write": false,
            "fast_init": false
        },

        "allgather_partitions": true,
        "allgather_bucket_size":  2e8 ,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,

        "contiguous_gradients": true,
        "cpu_offload": false,
        "cpu_offload_params" : false,


        "sub_group_size": 1e7,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e7,
        "stage3_max_reuse_distance": 1e7,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "aio": {
        "block_size": 1048576,
        "queue_depth": 16,
        "single_submit": false,
        "overlap_events": true,
        "thread_count": 1
    },

  "activation_checkpointing": {
      "partitioned_activations":true,
      "number_checkpoints": 100,
      "contiguous_memory_optimization": true,
      "cpu_checkpointing": true,
      "profile": true,
      "synchronize_checkpoint_boundary": true
    },

    "flops_profiler": {
        "enabled": true,
        "profile_step": 1,
        "module_depth": -1,
        "top_modules": 3,
        "detailed": true
    },

    "tensorboard": {
      "enabled": true,
      "output_path": "./logs",
      "job_name": "finetune_gpt_j_6b"
    },

    "steps_per_print": 100,
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false
}

I have also the following lines of code:

    import torch.distributed as dist
    model.resize_token_embeddings(len(model.tokenizer))
    model.gradient_checkpointing_enable()
    model.to(device=dist.get_rank())
    optimizer = AdamW(model.parameters(), lr=1e-4)
    scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.66)

How can I reflect the values of TrainingArguments in the deepspeed config file? Especially, how can I use the scheduler with exponential learning rate?

Second, I’m starting the script with the command python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py. In train.py I’m calling the train() method of the Trainer. I’m not sure if deepspeed is actually running as there is no output on the command line related to deepspeed. Should I start it in a different way? I’ve read that it could be started by using the deepspeed command on the command line but somehow this does not work (command not found). I have installed deepspeed over pip.

Eichhof · August 26, 2022, 11:38pm

Does somebody have any hints? I’m a bit puzzled.

jdwx · September 9, 2022, 9:36pm

Yes, I have only been able to get deepspeed to work for this purpose by using the “deepspeed” executable script that comes with the package.

If you have a virtual environment and deepspeed is properly installed, then “deepspeed” will be in the virtual environment bin directory.

When you are using deepspeed, you do not want to use torch.distributed at all.

If your model trainer is called train.py, you would do

deepspeed train.py

And then plow all of your training arguments (starting with --deepspeech configs/ds_config.json) on after the name of your script.

I’m sure it’s possible to invoke Deepspeed from inside a python script, rather than the other way around, but I sure have never gotten it to work and ultimately decided continued effort in that direction didn’t serve any useful purpose anyway.

Topic		Replies	Views
Setup for Deepspeed Multi GPU Training DeepSpeed	2	8103	December 7, 2022
How to run single-node, multi-GPU training with HF Trainer and deepspeed? Beginners	1	1642	April 21, 2024
Optimizer got an empty parameter list when using deepspeed Beginners	0	900	October 29, 2021
Question about using trainer with DeepSpeed 🤗Transformers	0	479	April 25, 2023
What arguments need to be changed when using deepeed in trainer? 🤗Transformers	2	474	July 3, 2021

Issues with using DeepSpeed on multiple GPUs

Related topics