Memory leaks when training Gemma or Phi 3 and 3.5 tokenizer

MilosKovacevic68 · August 29, 2024, 12:58pm

I have a problem when training a new tokenizer for Gemma 2 2B or Phi 3 and 3.5 models using the following code:

def corpus_gen(dataset, batch_size=300, n=300_000):
    current = []
    tot = 0
    for ex in dataset:
        current.append(ex['txt'])
        tot += 1
        if tot == n: break
        if len(current) == batch_size:
            yield current
            current = []
    if current:
        yield current

def train_tokenizer():
    dataset = load_dataset(
        "json", 
        split="train",
        streaming=True,
        data_files=[
            "../serlama/tokenizer/paragraphs_tokenizer.jsonl",
            "../serlama/tokenizer/pdrs_tokenizer.jsonl",
            "../serlama/tokenizer/macocu_tokenizer.jsonl",
    ])

    existing_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") 
    
    new_tokenizer = existing_tokenizer.train_new_from_iterator(
        corpus_gen(dataset), 
        vocab_size=30000, 
        min_frequency=3
    )
    new_tokenizer.save_pretrained("sr_tokenizer")

train_tokenizer()

After n= 100 000 (examples) my RAM steadily increases in blocks of few gigabytes and i cannot train the tokenizer.

When i try the same code with Llama 3.1 tokenizer everything is ok and the RAM does not increase.
My transformers version is 4.44.0
Why is that?
What is the problem with Gemma 2 2B and Phi3 tokenizers. Do they have a memory leak problems?

Topic		Replies	Views
Llama-2 on colab Beginners	3	11379	November 28, 2023
AutoModel from_pretrained not releasing memory and causing a memory leak 🤗Transformers	7	7732	February 7, 2024
Training tokenizer takes too much RAM 🤗Tokenizers	1	1320	February 21, 2022
CUDA Out of Memory while fine-tuning even with LoRA Models	6	3202	April 12, 2024
Tokenizer.train() running out of memory 🤗Tokenizers	0	750	February 9, 2023

Memory leaks when training Gemma or Phi 3 and 3.5 tokenizer

Related topics