Training tokenizer takes too much RAM

Hello, I wanted to train my own tokenizer on multi-lingual corpus (115GB of oscar and mc4 data in 15 languages) . My machine has only 16GB RAM so I wrote a generator for this task.
The problem is it still uses all my RAM. It progressively adds up from using 5GB to 16GB in maybe like 3 hours and then kernel dies. What can be the problem? Thanks in advance.

My code:

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([Lowercase(), NFKC(), StripAccents()])
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([Metaspace(), ByteLevel()])

def text_generator():
    files = [ str(x) for x in Path("data/text/files/").glob("*.txt") ]
    for file in tqdm(files): # one file is 10_000 sentences
        with open(file, "r") as f:
            lines ='\n')
        yield lines # yields list of 10_000 strings 

trainer = BpeTrainer(
    vocab_size=200261, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]

tokenizer.train_from_iterator(text_generator(), trainer=trainer)

Same problem! Isn’t train_from_iterator supposed to be loading the data in batches and never have the complete data on memory at one time? Or is it the internal pair and vocab hashmaps getting bigger and bigger?
Does anyone know the problem?