Running train_new_from_iterator to train a tokenizer is very slow

tehranixyz · March 29, 2024, 5:02pm

Hi everyone, I’m running train_new_from_iterator on top of microsoft/unixcoder-base tokenizer, which itself is a RobertaTokenizer to train the tokenizer (it’s a BPE tokenizer).

The problem is that the training is very slow.
The tokenizer training process takes ages to finish the Count pair part.
At first, it shows it is able to count around 1M pairs out of 14M pairs in about 1 hour, but as it continues, it slows down. It’s now running for two days and only counted 9M pairs, and still, 5M pairs are left.
Is there any way to speed up the tokenizer training process?

Here is the code that I’m using for training the tokenizer:

from dataset import concatenate_datasets

def get_training_corpus(train_set):
    for start_idx in range(0, len(train_set), 1000):
        samples = train_set[start_idx: start_idx + 1000]
        yield samples['text']
        
def train_tokenizer(tokenizer, datasets):
    trainset = concatenate_datasets([dataset['train'] for dataset in datasets]).shuffle()
    # remove non text columns
    trainset = trainset.remove_columns([                                                             
        col for col in trainset.column_names if col != "text"                                        
    ])  
    training_corpus = get_training_corpus(train_set=trainset)
    tokenizer = tokenizer.train_new_from_iterator(training_corpus, len(tokenizer))
    return tokenizer

It seems like not all processors are being used for training the tokenizer. But I’m not sure what I can do to utilize all processors.

Any help is appreciated.

beachbuggyracing · April 13, 2024, 10:04am

(Beach Buggy Racing Mod APK) amps up the excitement of beach racing with unlocked features. Similar to a speeding train, it accelerates your gaming adventure. Yet, like the slow process of training a tokenizer, unlocking levels and customizing your buggy may take time. But once you’re set, the thrill of racing on sandy tracks or coding swiftly is unmatched.

Topic		Replies	Views
Speed up tokenizer training 🤗Tokenizers	5	1196	September 17, 2024
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2229	November 11, 2024
Tokenizer taking extremely long time to train 🤗Tokenizers	1	968	March 19, 2025
Training tokenizer takes too much RAM 🤗Tokenizers	1	1316	February 21, 2022
How long to expect training to take, and guidance on subset size? 🤗Tokenizers	1	2026	May 23, 2024

Running train_new_from_iterator to train a tokenizer is very slow

Related topics