Issues with using DeepSpeed on multiple GPUs

Hello

I would like to fine-tune the EleutherAI/gpt-j-6B model on a HPC cluster. On the cluster I have only GPUs available with at most 24 GB memory. Thus, I’m running out of memory when fine-tuning on these GPUs. That’s why I would like to use DeepSpeed and fine-tune on 2 or 4 GPUs at the same time. I have the following TrainingArguments:

training_args = TrainingArguments(
        disable_tqdm=True,
        output_dir='/cluster/scratch/myUser/cache/models/checkpoints',
        save_total_limit=10,
        logging_dir='/cluster/scratch/myUser/cache/models/logs',
        num_train_epochs=4,
        evaluation_strategy='epoch',
        save_strategy='steps',
        save_steps=30,
        logging_steps=10,
        overwrite_output_dir=True,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        eval_accumulation_steps=4,
        gradient_checkpointing=True,
        max_grad_norm=0.5,
        lr_scheduler_type="cosine",
        learning_rate=1e-4,
        warmup_ratio=0.05,
        weight_decay=0.1,
        fp16_full_eval=True,
        fp16=True,
        fp16_opt_level='O1',
        deepspeed="configs/ds_config.json",
        report_to=['tensorboard']
    )

In addition I have the following deepspeed configuration file ds_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "opt_level": "O3"
    },

    "zero_optimization": {
        "stage": 2,

        "offload_param": {
            "device": "none",
            "buffer_count": 4,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9,
            "pin_memory": false
        },

        "offload_optimizer": {
            "device": "none",
            "buffer_count": 4,
            "pin_memory": false,
            "pipeline_read": false,
            "pipeline_write": false,
            "fast_init": false
        },

        "allgather_partitions": true,
        "allgather_bucket_size":  2e8 ,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,

        "contiguous_gradients": true,
        "cpu_offload": false,
        "cpu_offload_params" : false,


        "sub_group_size": 1e7,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e7,
        "stage3_max_reuse_distance": 1e7,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "aio": {
        "block_size": 1048576,
        "queue_depth": 16,
        "single_submit": false,
        "overlap_events": true,
        "thread_count": 1
    },

  "activation_checkpointing": {
      "partitioned_activations":true,
      "number_checkpoints": 100,
      "contiguous_memory_optimization": true,
      "cpu_checkpointing": true,
      "profile": true,
      "synchronize_checkpoint_boundary": true
    },

    "flops_profiler": {
        "enabled": true,
        "profile_step": 1,
        "module_depth": -1,
        "top_modules": 3,
        "detailed": true
    },

    "tensorboard": {
      "enabled": true,
      "output_path": "./logs",
      "job_name": "finetune_gpt_j_6b"
    },

    "steps_per_print": 100,
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false
}

I have also the following lines of code:

    import torch.distributed as dist
    model.resize_token_embeddings(len(model.tokenizer))
    model.gradient_checkpointing_enable()
    model.to(device=dist.get_rank())
    optimizer = AdamW(model.parameters(), lr=1e-4)
    scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.66)
    

How can I reflect the values of TrainingArguments in the deepspeed config file? Especially, how can I use the scheduler with exponential learning rate?

Second, I’m starting the script with the command python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py. In train.py I’m calling the train() method of the Trainer. I’m not sure if deepspeed is actually running as there is no output on the command line related to deepspeed. Should I start it in a different way? I’ve read that it could be started by using the deepspeed command on the command line but somehow this does not work (command not found). I have installed deepspeed over pip.

Does somebody have any hints? I’m a bit puzzled.

Yes, I have only been able to get deepspeed to work for this purpose by using the “deepspeed” executable script that comes with the package.

If you have a virtual environment and deepspeed is properly installed, then “deepspeed” will be in the virtual environment bin directory.

When you are using deepspeed, you do not want to use torch.distributed at all.

If your model trainer is called train.py, you would do

deepspeed train.py

And then plow all of your training arguments (starting with --deepspeech configs/ds_config.json) on after the name of your script.

I’m sure it’s possible to invoke Deepspeed from inside a python script, rather than the other way around, but I sure have never gotten it to work and ultimately decided continued effort in that direction didn’t serve any useful purpose anyway.