Hello, I wanted to train my own tokenizer on multi-lingual corpus (115GB of oscar and mc4 data in 15 languages) . My machine has only 16GB RAM so I wrote a generator for this task.
The problem is it still uses all my RAM. It progressively adds up from using 5GB to 16GB in maybe like 3 hours and then kernel dies. What can be the problem? Thanks in advance.
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.normalizer = normalizers.Sequence([Lowercase(), NFKC(), StripAccents()]) tokenizer.pre_tokenizer = pre_tokenizers.Sequence([Metaspace(), ByteLevel()]) def text_generator(): files = [ str(x) for x in Path("data/text/files/").glob("*.txt") ] for file in tqdm(files): # one file is 10_000 sentences with open(file, "r") as f: lines = f.read().split('\n') yield lines # yields list of 10_000 strings trainer = BpeTrainer( vocab_size=200261, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] ) tokenizer.train_from_iterator(text_generator(), trainer=trainer)