I have a problem when training a new tokenizer for Gemma 2 2B or Phi 3 and 3.5 models using the following code:
def corpus_gen(dataset, batch_size=300, n=300_000):
current = []
tot = 0
for ex in dataset:
current.append(ex['txt'])
tot += 1
if tot == n: break
if len(current) == batch_size:
yield current
current = []
if current:
yield current
def train_tokenizer():
dataset = load_dataset(
"json",
split="train",
streaming=True,
data_files=[
"../serlama/tokenizer/paragraphs_tokenizer.jsonl",
"../serlama/tokenizer/pdrs_tokenizer.jsonl",
"../serlama/tokenizer/macocu_tokenizer.jsonl",
])
existing_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
new_tokenizer = existing_tokenizer.train_new_from_iterator(
corpus_gen(dataset),
vocab_size=30000,
min_frequency=3
)
new_tokenizer.save_pretrained("sr_tokenizer")
train_tokenizer()
After n= 100 000 (examples) my RAM steadily increases in blocks of few gigabytes and i cannot train the tokenizer.
When i try the same code with Llama 3.1 tokenizer everything is ok and the RAM does not increase.
My transformers version is 4.44.0
Why is that?
What is the problem with Gemma 2 2B and Phi3 tokenizers. Do they have a memory leak problems?