[Beginner] ClassificationModel Running out of Memory, long training Epochs

Hi guys,
I am new to Deep Learning and wanted to train a binary (sentiment) classification using SimpleTransformers. As a dataset I took Sentiment140 (1,6 Tweets 800k Positive, 800k Negative). The training itself works, but depending on the length of the dataset Google Colab crashes. If I divide the 1.6 million tweets into 1.28 million training and 0.32 million test data the model crashes after ->

[2020-12-28 16:55:15,023] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
1278719/1278719 [09:25<00:00, 2260.76it/s] 

(1) Is this normal?
Now if I reduce the number to 800k training, 160k test data Google Colab does not crash, but one epoch takes 4 hours. (This number often works, sometimes 800k training-data also crashes as described above. When it gets to training, I don’t even know if it goes through - since an epoch lasts 4 hours, I’ve never run it through)
I do not know how far you can compare the things, but in tensorflow i have trained a CNN, BiLSTM network on the entire data set and there an epoch took only 5 minutes, (2) does 4 hours make sense, or have I made a gross error?

[2020-12-28 17:45:10,844] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
800000/800000 [05:44<00:00, 2638.77it/s]
Epoch 1 of 1: 0%
0/1 [00:00<?, ?it/s]
Epochs 0/1. Running Loss. 0.6640: 0% 375/100000 [01:04<3:50:03, 7.19it/s]
import torch
model_type, model_name = 'roberta', 'roberta-base'

model_args = {
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',

   'max_seq_length': 144,
   'num_train_epochs': 1,#50
   'learning_rate': 1e-3, 
   'adam_epsilon': 1e-8,
  "early_stopping_delta" : 1e-3,
  "early_stopping_patience" : 5, #5
   'overwrite_output_dir': True,
    'manual_seed' : True,
    'silent' : SILENT

model = ClassificationModel(model_type=model_type, model_name=model_name, args=model_args, 

I also tried to add 'eval_accumulation_steps' : 20 to my model_args, but it still crashed pre-training

ty in advanced

Hey there,

If the question is about SimpleTransformers then IMO it would be better to ask it on their issues or forum. We would be happy to help you here, but we are not familiar with SimpleTransformers :slight_smile:

1 Like

ah okay, sorry @valhalla
i couldnt find a simpletransformers forum, and i’ve heard it was developed by huggingface, is based on huggingface

It is not developed by HF, but it uses HF’s transformers library under the hood. However things like training as implemented in your code is a SimpleTransformers thing, not by HF.

1 Like

The 4 hours per epoch sounds sensible to me.

Roberta has a huge number of parameters to train, almost certainly much more than your CNN with BiLSTM, and it will take a long time.

Have you tried Freezing most of the layers of RoBERTa, so that only the last layer’s parameters are actually being trained.

i would know how to do that with tensorflow, but i dont know how i can do this is i use simple transformers to be honest. in their tutorial on binary text classification they didnt change a single layer, so i thought i would be fine leaving it like that -

in addition my loss is very fast converging to

i prop should use a small learning rate and just train 1 epoch to get the best result, because bert is pretrained right? is this some sort of transfer learning what im doing here? :smiley:

ill try to use a learning rate around 1.0e-5, testing small sample size, this gave me the best result - even tho its way below bilstm

Have you considered a smaller model, such as DistilBERT?

Huggingface have a DistilBERT version for Tensorflow, TFDistilBertForSequenceClassification, https://huggingface.co/transformers/model_doc/distilbert.html

When you use a pretrained BERT-type model, Yes, you are doing transfer learning.