Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset

Hello, I wasn’t sure if I should use the category transformers, datasets, or tokenizers for this, but wanted to post some benchmark times for training a GPT style tokenizer on a 10s of GB text dataset because they seem slower than my expectation (which could be totally off). The pre-processing sequences step took ~ 3 hours on a modern 12 core AMD CPU.

Here is the script I used

import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer")

and here is the output,

python train_tokenizer.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Using custom data configuration gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808
Found cached dataset parquet (/home/galtay/.cache/huggingface/datasets/gabrielaltay___parquet/gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[02:55:09] Pre-processing sequences                 █████████████████████████████ 0        /        0
[00:00:07] Tokenize words                           █████████████████████████████ 6828518  /  6828518
[00:00:13] Count pairs                              █████████████████████████████ 6828518  /  6828518
[00:00:48] Compute merges                           █████████████████████████████ 32511    /    32511

The train split of the dataset is ~100GB but the text is duplicated in another column with markup so I estimate about 50GB in the “text” column. I think this should be doable at “training a tokenizer on english wikipedia speeds” within a factor of 10 or so (I was thinking minutes not hours). Can anyone see where I’m making a mistake or if my time estimates are just totally off?

I’m using,

datasets 2.8.0
transformers 4.25.1

and this is the dataset on the hub gabrielaltay/pubtator-central-bigbio-kb-2022-12-18 · Datasets at Hugging Face

thanks,
-G

UPDATE: attempting to isolate dataset iteration speed with

import datasets                                                                                      
from tqdm import tqdm                                                                                
import datasets                                                                                      
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    for batch in tqdm(batch_iterator(ds_train)):                                                     
        x = 1  

and getting,

700it [02:10,  5.18it/s]

leading me to believe the bottleneck is dataset iteration speed
(33M samples) / (batch size 1000) / (6 it/s) = 5500 s ~ 90 minutes

Problem Solved! (thanks to @lhoestq)

Turns out the slow iteration speed was b/c of all the extra columns in the dataset besides the “text” column. Running with just the text column in the dataset gave 40x speedup ,

old
700it [02:10,  5.18it/s]

new
13435it [00:32, 228.80it/s]
import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    # remove non text columns
    ds_train = ds_train.remove_columns([                                                             
        col for col in ds_train.column_names if col != "text"                                        
    ])                                                                                               
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer") 
2 Likes

I also have the issue of slow training speed with the tokenizer on smaller datasets. Upon investigation, it became clear that the tokenizer only utilizes 1 CPU core, and batching or not batching doesn’t affect its speed. What do you think is the solution to this problem?