Hi everyone, I’m running train_new_from_iterator
on top of microsoft/unixcoder-base
tokenizer, which itself is a RobertaTokenizer to train the tokenizer (it’s a BPE tokenizer).
The problem is that the training is very slow.
The tokenizer training process takes ages to finish the Count pair
part.
At first, it shows it is able to count around 1M pairs out of 14M pairs in about 1 hour, but as it continues, it slows down. It’s now running for two days and only counted 9M pairs, and still, 5M pairs are left.
Is there any way to speed up the tokenizer training process?
Here is the code that I’m using for training the tokenizer:
from dataset import concatenate_datasets
def get_training_corpus(train_set):
for start_idx in range(0, len(train_set), 1000):
samples = train_set[start_idx: start_idx + 1000]
yield samples['text']
def train_tokenizer(tokenizer, datasets):
trainset = concatenate_datasets([dataset['train'] for dataset in datasets]).shuffle()
# remove non text columns
trainset = trainset.remove_columns([
col for col in trainset.column_names if col != "text"
])
training_corpus = get_training_corpus(train_set=trainset)
tokenizer = tokenizer.train_new_from_iterator(training_corpus, len(tokenizer))
return tokenizer
It seems like not all processors are being used for training the tokenizer. But I’m not sure what I can do to utilize all processors.
Any help is appreciated.