Running out of Memory with run_clm.py

Hi,

first of all, thanks for creating such a cool library :blush:

I have already successfully fine-tuned a GPT2 model and I currently want to fine-tune a GPT2-Large model from the same 1.4 GB training dataset, but I seem to be running out of memory.

When I run the run_clm.py script, I usually get “Killed” as the output. My parameters are the following:

python run_clm.py \
--use_fast_tokenizer \
--model_name_or_path gpt2-large \
--train_file "/home/mark/Downloads/adp5/train2.txt" \
--validation_file "/home/mark/Downloads/adp5/test2.txt" \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--eval_steps 200 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 8

When viewing memory allocation, I can see that both system memory (64 GB) and swap (16 GB) have been completely allocated (GPU memory is not allocated).

I’ve tried using deepspeed as well, but end up with the same error.

Does anybody know what’s wrong?

Cheers,
Mark

Hey @MarkStrong do you still get memory issues if you reduce the batch size?