Running train_new_from_iterator to train a tokenizer is very slow

Hi everyone, I’m running train_new_from_iterator on top of microsoft/unixcoder-base tokenizer, which itself is a RobertaTokenizer to train the tokenizer (it’s a BPE tokenizer).

The problem is that the training is very slow.
The tokenizer training process takes ages to finish the Count pair part.
At first, it shows it is able to count around 1M pairs out of 14M pairs in about 1 hour, but as it continues, it slows down. It’s now running for two days and only counted 9M pairs, and still, 5M pairs are left.
Is there any way to speed up the tokenizer training process?

Here is the code that I’m using for training the tokenizer:

from dataset import concatenate_datasets

def get_training_corpus(train_set):
    for start_idx in range(0, len(train_set), 1000):
        samples = train_set[start_idx: start_idx + 1000]
        yield samples['text']
        
def train_tokenizer(tokenizer, datasets):
    trainset = concatenate_datasets([dataset['train'] for dataset in datasets]).shuffle()
    # remove non text columns
    trainset = trainset.remove_columns([                                                             
        col for col in trainset.column_names if col != "text"                                        
    ])  
    training_corpus = get_training_corpus(train_set=trainset)
    tokenizer = tokenizer.train_new_from_iterator(training_corpus, len(tokenizer))
    return tokenizer

It seems like not all processors are being used for training the tokenizer. But I’m not sure what I can do to utilize all processors.

Any help is appreciated.

(Beach Buggy Racing Mod APK) amps up the excitement of beach racing with unlocked features. Similar to a speeding train, it accelerates your gaming adventure. Yet, like the slow process of training a tokenizer, unlocking levels and customizing your buggy may take time. But once you’re set, the thrill of racing on sandy tracks or coding swiftly is unmatched.