[Beginner] ClassificationModel Running out of Memory, long training Epochs

unknownTransformer · December 28, 2020, 6:11pm

Hi guys,
I am new to Deep Learning and wanted to train a binary (sentiment) classification using SimpleTransformers. As a dataset I took Sentiment140 (1,6 Tweets 800k Positive, 800k Negative). The training itself works, but depending on the length of the dataset Google Colab crashes. If I divide the 1.6 million tweets into 1.28 million training and 0.32 million test data the model crashes after ->

[2020-12-28 16:55:15,023] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
100%
1278719/1278719 [09:25<00:00, 2260.76it/s]

(1) Is this normal?
Now if I reduce the number to 800k training, 160k test data Google Colab does not crash, but one epoch takes 4 hours. (This number often works, sometimes 800k training-data also crashes as described above. When it gets to training, I don’t even know if it goes through - since an epoch lasts 4 hours, I’ve never run it through)
I do not know how far you can compare the things, but in tensorflow i have trained a CNN, BiLSTM network on the entire data set and there an epoch took only 5 minutes, (2) does 4 hours make sense, or have I made a gross error?

[2020-12-28 17:45:10,844] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
100%
800000/800000 [05:44<00:00, 2638.77it/s]
Epoch 1 of 1: 0%
0/1 [00:00<?, ?it/s]
Epochs 0/1. Running Loss. 0.6640: 0% 375/100000 [01:04<3:50:03, 7.19it/s]

import torch
torch.cuda.is_available()

True

model_type, model_name = 'roberta', 'roberta-base'

model_args = {
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',

   'max_seq_length': 144,
   'num_train_epochs': 1,#50
   'learning_rate': 1e-3, 
   'adam_epsilon': 1e-8,
  "early_stopping_delta" : 1e-3,
  "early_stopping_patience" : 5, #5
   'overwrite_output_dir': True,
    'manual_seed' : True,
    'silent' : SILENT
} 

model = ClassificationModel(model_type=model_type, model_name=model_name, args=model_args, 
                            use_cuda=True, 
                            num_labels=2)

I also tried to add 'eval_accumulation_steps' : 20 to my model_args, but it still crashed pre-training

ty in advanced

valhalla · December 29, 2020, 6:39am

Hey there,

If the question is about SimpleTransformers then IMO it would be better to ask it on their issues or forum. We would be happy to help you here, but we are not familiar with SimpleTransformers

unknownTransformer · December 29, 2020, 10:09am

ah okay, sorry @valhalla
i couldnt find a simpletransformers forum, and i’ve heard it was developed by huggingface, is based on huggingface

BramVanroy · December 29, 2020, 10:34am

It is not developed by HF, but it uses HF’s transformers library under the hood. However things like training as implemented in your code is a SimpleTransformers thing, not by HF.

rgwatwormhill · December 30, 2020, 11:29pm

The 4 hours per epoch sounds sensible to me.

Roberta has a huge number of parameters to train, almost certainly much more than your CNN with BiLSTM, and it will take a long time.

Have you tried Freezing most of the layers of RoBERTa, so that only the last layer’s parameters are actually being trained.

unknownTransformer · December 31, 2020, 4:58pm

i would know how to do that with tensorflow, but i dont know how i can do this is i use simple transformers to be honest. in their tutorial on binary text classification they didnt change a single layer, so i thought i would be fine leaving it like that -

in addition my loss is very fast converging to

i prop should use a small learning rate and just train 1 epoch to get the best result, because bert is pretrained right? is this some sort of transfer learning what im doing here?

ill try to use a learning rate around 1.0e-5, testing small sample size, this gave me the best result - even tho its way below bilstm

rgwatwormhill · January 4, 2021, 3:49pm

Have you considered a smaller model, such as DistilBERT?

Huggingface have a DistilBERT version for Tensorflow, TFDistilBertForSequenceClassification, https://huggingface.co/transformers/model_doc/distilbert.html

When you use a pretrained BERT-type model, Yes, you are doing transfer learning.

Topic		Replies	Views
Colab error (memory crashes) Beginners	3	3060	April 22, 2021
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
Pytorch Lightning - Memory Leak Beginners	0	678	January 13, 2024
How to train a language model from scratch when my dataset is bigger than RAM? Beginners	19	9742	September 18, 2020
Error when finetuning pretrained huggingface conv-ai chatbot model 🤗Transformers	2	814	April 19, 2021

[Beginner] ClassificationModel Running out of Memory, long training Epochs

Related topics