Like the title says, initially, the speed was extremely fast, reaching about 80,000 examples/s, so the estimated time was around 5 minutes. However, it actually took almost 50 minutes because as the progress approached 100%, the speed gradually decreased. Around 30%, it dropped to 50,000 examples/s, at 60% it decreased to 30,000 examples/s, and so on. The most severe drop occurred between 90-100%, where the speed plummeted to levels as low as 3,000 to 90 examples/s. Consequently, within those 50 minutes, about 30 to 40 minutes were spent within the 90-100% range.
Additional information:
- I used
LlamaTokenizer
(not fast) for tokenization. - My dataset consists of approximately 20 million records.
- The version of
datasets
used is 2.15.0. - Due to this issue, I needed to set
multiprocess.set_start_method('spawn')
to benefit from multiprocessing.
def _tokenize(batch: dict[str, list[str]], tokenizer: PreTrainedTokenizerBase):
new_batch = tokenizer(
[x for x in batch['text'] if x],
add_special_tokens=False,
return_token_type_ids=False,
return_attention_mask=False,
verbose=False
)
for x in new_batch['input_ids']:
x.insert(0, tokenizer.bos_token_id)
x.append(tokenizer.eos_token_id)
return new_batch
dataset.map(
_tokenize,
batched=True,
remove_columns=True,
fn_kwargs=dict(tokenizer=tokenizer),
num_proc=96,
desc='Tokenize'
)