Colab RAM crash error - Fine-tuning RoBERTa in Colab

Hi,
I’m trying to fine-tune my first NLI model with Transformers on Colab. I’m trying to fine-tune ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli on a dataset of around 276.000 hypothesis-premise pairs. I’m following the instructions from the docs here and here.

The issue is that I get a memory error, when I run the code below on colab. My colab GPU seems to have around 12 GB RAM. The error occurs at the end during the training step, but I see in colab that already after the encoding step, 7~GB RAM is occupied. Then RAM usage shoots up at training and colab crashes.

I’m new to fine-tuning models. It would be great if someone could give some advice on how to reduce the RAM footprint in the code below.

What I’ve tried:

  • Use model.half() to reduce memory footprint
  • I changed per_device_train_batch_size and per_device_eval_batch_size from 32 to 8 to 2. (Not sure if a lower number here reduces the memory requirement? Or are higher numbers better for RAM?)
  • What else can/should be improved in the code below?

Thanks a lot for your help!

My code:

# ... some data preparation

###  load model and tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

max_length = 256
hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
    model = model.half()  # for half-precision training. reduces RAM requirement; decreases speed if on older GPU # https://huggingface.co/transformers/v1.1.0/examples.html
model.to(device)
model.train();

# ... some data preparation ... 

encodings_train = tokenizer(premise_train, hypothesis_train, return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=True, padding=True)
encodings_val = tokenizer(premise_val, hypothesis_val, return_tensors="pt", max_length=max_length,
                          return_token_type_ids=True, truncation=True, padding=True)
encodings_test = tokenizer(premise_test, hypothesis_test, return_tensors="pt", max_length=max_length,
                           return_token_type_ids=True, truncation=True, padding=True)

### create pytorch dataset object
import torch

class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

### training
from transformers import Trainer, TrainingArguments

# https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=2,  # batch size per device during training
    per_device_eval_batch_size=2,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val             # evaluation dataset
)

trainer.train()

Hi, you could try reducing max-length.

For a bert-base model, I found that I needed to keep maxlen x batchsize below about 8192. I think that limit would be even lower for a bert-large model.

Do you need roberta-large, or would roberta-base be sufficient?
(Or even distilroberta-base)

1 Like

Another thought: are you freezing any of the RoBERTa levels?

(I don’t know whether that affects RAM or just training speed.)

@rgwatwormhill thanks for your response. For some reason, the issue randomly disappeared after a few hours. Since colab attributes GPUs randomly, maybe I was unlucky with the amount of RAM of the GPUs I had received (even after a factory reset).
I will try your suggestions if it reappears.