Hello, I wanted to train my own tokenizer on multi-lingual corpus (115GB of oscar and mc4 data in 15 languages) . My machine has only 16GB RAM so I wrote a generator for this task.
The problem is it still uses all my RAM. It progressively adds up from using 5GB to 16GB in maybe like 3 hours and then kernel dies. What can be the problem? Thanks in advance.
My code:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([Lowercase(), NFKC(), StripAccents()])
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([Metaspace(), ByteLevel()])
def text_generator():
files = [ str(x) for x in Path("data/text/files/").glob("*.txt") ]
for file in tqdm(files): # one file is 10_000 sentences
with open(file, "r") as f:
lines = f.read().split('\n')
yield lines # yields list of 10_000 strings
trainer = BpeTrainer(
vocab_size=200261, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train_from_iterator(text_generator(), trainer=trainer)