Missmatch between memory-estimate and Trainer-API

Hey, I fool around with some LM for Code Generation and I ran out of memory all the time, no matter which model I use. I have 8GB VRAM, so I tried flax-community/gpt-neo-125M-code-clippy-dedup-2048, because accelerate estimate-memory estimated 1.89GB VRAM for Training with Adam.

When I train, I get the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 246.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why does PyTorch allocate so much memory?

  • I reduced the sql-files to 10
  • I reduced batch size to 1
  • I tried to reduce the max_length parameter as well, but it had no effect
  • I use transformers 4.37.0 and torch 2.1.2+cu121

Here is my code for more context:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from datasets import Dataset, load_dataset, Value, Features
from pathlib import Path


DATA_PATH = Path("./data")
files = [str(p) for p in DATA_PATH.glob("*.sql")][:10]
train_files, test_files = files[:int(len(files) * 0.8)], files[int(len(files) * 0.8):]

features = Features({'code': Value('string')})
ds = load_dataset("text", data_files={"train": train_files, "test": test_files}, sample_by="document", features=features)

tokenizer = AutoTokenizer.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(
    # "stabilityai/stable-code-3b",
    # "deepseek-ai/deepseek-coder-1.3b-instruct",
    "flax-community/gpt-neo-125M-code-clippy-dedup-2048",
    trust_remote_code=True,
    torch_dtype="auto",
)

max_length = model.config.max_position_embeddings

tokenized_dataset = (ds.map(
    lambda example: tokenizer(example["code"],
                              return_tensors="pt",
                              truncation=True,
                              padding="max_length",
                              max_length=max_length),
    batched=True,
    batch_size=1,
))
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments("test-trainer")
model.cuda()


trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()