Albert Pre-training with Batch size 8 is throwing OOM

Environment info

  • transformers version: 4.15.0
  • Platform: Ubuntu 18.1
  • Python version: 3.7
  • PyTorch version (GPU?): 1.6+cuda 10.1
  • Tensorflow version (GPU?): None
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

The tasks I am working on is:
Masked Language Modelling (using AlbertForMaskedLM)
Training Albert model from scratch

python run_mlm.py --model_type albert --num_train_epochs 300 --train_file /home/kushwanth/write_chunks/sample.txt --validation_file /home/kushwanth/write_chunks/sample.txt --tokenizer_name albert --do_train=yes --output_dir=/home/kushwanth/model --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --save_steps 3000 --logging_steps 500 --report_to tensorboard --preprocessing_num_workers 10 2>&1 | tee /home/kushwanth/log_1.txt

GPU spects:
V-100
Number of GPUs 8 with 16GB RAM each

Trying to allocate batch size of 8 and we are getting Out of memory error and GPU utilisation is 50% on avg

Model Config:
“attention_probs_dropout_prob”: 0,
“bos_token_id”: 2,
“classifier_dropout_prob”: 0.1,
“embedding_size”: 128,
“eos_token_id”: 3,
“hidden_act”: “gelu_new”,
“hidden_dropout_prob”: 0,
“hidden_size”: 768,
“initializer_range”: 0.02,
“inner_group_num”: 1,
“intermediate_size”: 3072,
“layer_norm_eps”: 1e-12,
“max_position_embeddings”: 512,
“model_type”: “albert”,
“num_attention_heads”: 12,
“num_hidden_groups”: 1,
“num_hidden_layers”: 12,
“pad_token_id”: 0,
“position_embedding_type”: “absolute”,
“torch_dtype”: “float32”,
“transformers_version”: “4.15.0”,
“type_vocab_size”: 1,
“vocab_size”: 40000

File "run_mlm.py", line 442, in main                                                                                      
    train_result = trainer.train(resume_from_checkpoint=checkpoint)                                                               
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1332, in train       
    tr_loss_step = self.training_step(model, inputs)                                                                              
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1891, in training_step         loss = self.compute_loss(model, inputs)                                                                                       
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1923, in compute_loss      
    outputs = model(**inputs)                                                                                                     
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl          result = self.forward(*input, **kwargs)                                                                                       
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward 
    return self.gather(outputs, self.output_device)                                                                               
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather  
    return gather(outputs, output_device, dim=self.dim)
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_m
ap
    for k in out))
  File "<string>", line 7, in __init__
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/file_utils.py", line 2294, in __post_init__
    for element in iterator:
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_m
ap
    return Gather.apply(target_device, dim, *outputs)
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/kushwanth/anaconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA out of memory. Tried to allocate 4.88 GiB (GPU 0; 15.78 GiB total capacity; 7.63 GiB already allocated; 2.08 Gi
B free; 12.50 GiB reserved in total by PyTorch)