Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset

Hello, I wasn’t sure if I should use the category transformers, datasets, or tokenizers for this, but wanted to post some benchmark times for training a GPT style tokenizer on a 10s of GB text dataset because they seem slower than my expectation (which could be totally off). The pre-processing sequences step took ~ 3 hours on a modern 12 core AMD CPU.

Here is the script I used

import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer")

and here is the output,

python train_tokenizer.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Using custom data configuration gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808
Found cached dataset parquet (/home/galtay/.cache/huggingface/datasets/gabrielaltay___parquet/gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[02:55:09] Pre-processing sequences                 █████████████████████████████ 0        /        0
[00:00:07] Tokenize words                           █████████████████████████████ 6828518  /  6828518
[00:00:13] Count pairs                              █████████████████████████████ 6828518  /  6828518
[00:00:48] Compute merges                           █████████████████████████████ 32511    /    32511

The train split of the dataset is ~100GB but the text is duplicated in another column with markup so I estimate about 50GB in the “text” column. I think this should be doable at “training a tokenizer on english wikipedia speeds” within a factor of 10 or so (I was thinking minutes not hours). Can anyone see where I’m making a mistake or if my time estimates are just totally off?

I’m using,

datasets 2.8.0
transformers 4.25.1

and this is the dataset on the hub gabrielaltay/pubtator-central-bigbio-kb-2022-12-18 · Datasets at Hugging Face

thanks,
-G

UPDATE: attempting to isolate dataset iteration speed with

import datasets                                                                                      
from tqdm import tqdm                                                                                
import datasets                                                                                      
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    for batch in tqdm(batch_iterator(ds_train)):                                                     
        x = 1  

and getting,

700it [02:10,  5.18it/s]

leading me to believe the bottleneck is dataset iteration speed
(33M samples) / (batch size 1000) / (6 it/s) = 5500 s ~ 90 minutes

Problem Solved! (thanks to @lhoestq)

Turns out the slow iteration speed was b/c of all the extra columns in the dataset besides the “text” column. Running with just the text column in the dataset gave 40x speedup ,

old
700it [02:10,  5.18it/s]

new
13435it [00:32, 228.80it/s]
import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    # remove non text columns
    ds_train = ds_train.remove_columns([                                                             
        col for col in ds_train.column_names if col != "text"                                        
    ])                                                                                               
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer") 
3 Likes

I also have the issue of slow training speed with the tokenizer on smaller datasets. Upon investigation, it became clear that the tokenizer only utilizes 1 CPU core, and batching or not batching doesn’t affect its speed. What do you think is the solution to this problem?

1 Like

I agree. The training doesn’t seem to be using all cores; and it’s still bottlenecked by the rate at which data can be read from the iterator.

I wonder if there is any way to improve that.

Hi @gabrielaltay, I am facing the same issue… I am currently training a BPE tokenizer for the Panjabi language on a 50 GB text corpus. However, I am encountering an “Out of Memory” (OOM) issue even when using a 1TB RAM instance. Can you help me understand the reason behind this and provide any references or suggestions to train this model more efficiently?

from datasets import load_from_disk, load_dataset
from transformers import AutoTokenizer

ds = load_dataset('kdcyberdude/Vichaar', num_proc=8, cache_dir='./gemma_data_cache')['train']
print(ds)
tokenizer = AutoTokenizer.from_pretrained("openchat/openchat-3.5-0106-gemma")

def batch_iterator(batch_size=1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i : i + batch_size]["text"]

new_tokenizer = tokenizer.train_new_from_iterator( batch_iterator(), vocab_size=32_000, length=len(ds))
new_tokenizer.save_pretrained("./gemma-32k-pa-tokenizer")

I have also tried this using a data loader, the Pre-processing sequences steps keep on iterating even after len(ds) and memory keeps increasing. The iteration goes 7*len(ds) until it hits OOM. Not sure when it will stop. Same as this issue and issue

class TextDataset(torch.utils.data.Dataset):
    def __init__(self, ds, batch_size):
        self.batch_size = batch_size
        self.ds = ds

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        batch = self.ds[idx:idx + self.batch_size]['text']
        return batch

dataset = TextDataset(ds, batch_size=1024)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=None)

new_tokenizer = tokenizer.train_new_from_iterator( dataloader, vocab_size=32_000, length=len(ds))

I also tried debugging the code to understand which part is consuming this much RAM but I am not able to get into this train_from_iterator function in tokenization_utils_fast.py. I am speculating this could be calling executable/binary code that may be running in Rust.

Any help or pointers would be greatly appreciated!

That is indeed weird, I’ll investigate as it should be using threads

Fast encode by ArthurZucker · Pull Request #1560 · huggingface/tokenizers · GitHub should help! There are issue with parallelization