About the DeepSpeed category

Discussions related to DeepSpeed Integration in :hugs: Transformers

If after reading through Integration Documentation you have questions please use this forum to ask such questions.

If you discover a bug in the integration part please file an Issue, however if the problem is with DeepSpeed please file a DeepSpeed Issue instead.

HF trainer shows different capabilities of memory efficiency with DeepSpeed or FairScale

Hi there, I would like to share this little experiment result with you, and see if it’s normal. I am pre-training BERT with MLM and NSP tasks. To test how big the model I can fit in, I used DeepSpeed and FairScale. The only increasing parameter of BERT is the hidden_size, everything else, like the maximum_sequence_size, batch_size is fixed.

On a single node with two GPUs, without taking DeepSpeed nor FairScale, the maximum hidden_size I can set is around 110x16. With 120x16, there is an OOM error. With DeepSpeed_zero_1, the maximum hidden_size is around 80x16. Stage2 and Stage3 do not help in increasing the model size.

With FairScale_simple, the maximum hidden_size is around 120x16, around 70% memory is occupied. I thought FaireScale_zero_dp_2 or level 3 could help in this, but not.

In a nutshell, with the same technique behind DeepSpeed and Fairscale, they show different capabilities. Is this normal? or Is it a bug? or I didn’t set the argument correctly. And, furthermore, I didn’t see the capabilities increasing with the stages. I thought It should do better with DeepSpeed Stage 2 and Stage 3. The same for Fairscale. Is my assumption right?

Thank you.

The script and the data are attached, so you can reproduce the experiments.

train_HF.py

import torch
from torch import nn
from torch.utils.data import Dataset
from transformers import BertConfig, BertForPreTraining, \
Trainer, TrainingArguments, HfArgumentParser, set_seed
import numpy as np
import sys

from dataclasses import dataclass

class DataNumpyDataset(Dataset):
    def __init__(self):
        self.dataset_size = 1024

    def __len__(self):
        return self.dataset_size

    def __getitem__(self, idx):
        if idx >= self.dataset_size:
            self.npy.seek(0)
            raise IndexError
        try:
            data = np.load("data.npy", allow_pickle=True)
            pt_data = dict()
            pt_data["input_ids"] = torch.tensor(data[0]["input_ids"], dtype=torch.long)[0,:].cuda()
            pt_data["labels"] = pt_data["input_ids"].detach().clone() #torch.tensor(data[0]["input_ids"], dtype=torch.long)[0,:].cuda()
            pt_data["next_sentence_label"] = torch.tensor(data[1], dtype=torch.long)[0,:].cuda()
            pt_data["token_type_ids"] = torch.tensor(data[0]["token_type_ids"], dtype=torch.long)[0,:].cuda()
            pt_data["attention_mask"] = torch.tensor(data[0]["attention_mask"], dtype=torch.float)[0,:].cuda()
            return pt_data
        
        except EOFError:
            raise IOError("Reach the end of the data file")

@dataclass
class ModelArguments:
    hidden_size: int = 1024

if __name__ == '__main__':
    # args_hidden_size = 1920
    args_num_hidden_layers = 24
    args_num_attention_heads = 16
    args_vocab_size = 30522

    parser = HfArgumentParser((ModelArguments, TrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, training_args = parser.parse_args_into_dataclasses()

    set_seed(training_args.seed)
    configuration = BertConfig(hidden_size=model_args.hidden_size, num_hidden_layers=args_num_hidden_layers,
                               num_attention_heads=args_num_attention_heads, intermediate_size=4*model_args.hidden_size,
                               vocab_size=args_vocab_size)
  
    model = BertForPreTraining(configuration)

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)

    dataset = DataNumpyDataset()

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset
        )

    trainer.train()

Scripts to run it:

deepspeed --num_gpus=2 \
./train_HF.py \
--do_train \
--learning_rate 1e-4 \
--hidden_size=1760 \
--num_train_epochs 10 \
--per_device_train_batch_size 1 \
--output_dir "./log" \
--save_strategy="no" \
--report_to="none" \
--fp16=false \
--dataloader_pin_memory=False \
--deepspeed 'ds_config_zero1.json'

ds_config_zero1.json, zero2.json, zero3.json, and the data is attached in drive.google.
link to deep_config files and data.

Some other info:
*GPU device is GeForce RTX 3090, with around 24GiB.
*Driver Version: 460.32.03 CUDA Version: 11.2.

  • deepspeed 0.5.4, installed from pip
  • transformers 4.12.2, installed from pip
  • fairscale fairscale 0.4.1, installed from pip
  • pytorch version 1.8.1+cu111
  • python version Python 3.7.7
  • f16 optimization is disabled.
  • gradient checkpointing is disabled
  • CPU offload is disabled.