Running out of Memory with run_clm.py

Hi,

first of all, thanks for creating such a cool library :blush:

I have already successfully fine-tuned a GPT2 model and I currently want to fine-tune a GPT2-Large model from the same 1.4 GB training dataset, but I seem to be running out of memory.

When I run the run_clm.py script, I usually get ā€œKilledā€ as the output. My parameters are the following:

python run_clm.py \
--use_fast_tokenizer \
--model_name_or_path gpt2-large \
--train_file "/home/mark/Downloads/adp5/train2.txt" \
--validation_file "/home/mark/Downloads/adp5/test2.txt" \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--eval_steps 200 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 8

When viewing memory allocation, I can see that both system memory (64 GB) and swap (16 GB) have been completely allocated (GPU memory is not allocated).

Iā€™ve tried using deepspeed as well, but end up with the same error.

Does anybody know whatā€™s wrong?

Cheers,
Mark

Hey @MarkStrong do you still get memory issues if you reduce the batch size?

@lewtun is there any scope of lazy loading to RAM from disk? i.e only that part of data will come into RAM on which training will happen on that particular time.

@lewtun
In answer to my question on big data size and lazy loading:
Transformers dataset dict format and its map method to call any function like tokenisation and grouping is designed to run in batches.It will handle any big data with batch run. So, work with any size big data use convert your dataset in Transformers dataset dict format and map method