Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset

Hello, I wasn’t sure if I should use the category transformers, datasets, or tokenizers for this, but wanted to post some benchmark times for training a GPT style tokenizer on a 10s of GB text dataset because they seem slower than my expectation (which could be totally off). The pre-processing sequences step took ~ 3 hours on a modern 12 core AMD CPU.

Here is the script I used

import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer")

and here is the output,

python train_tokenizer.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Using custom data configuration gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808
Found cached dataset parquet (/home/galtay/.cache/huggingface/datasets/gabrielaltay___parquet/gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[02:55:09] Pre-processing sequences                 █████████████████████████████ 0        /        0
[00:00:07] Tokenize words                           █████████████████████████████ 6828518  /  6828518
[00:00:13] Count pairs                              █████████████████████████████ 6828518  /  6828518
[00:00:48] Compute merges                           █████████████████████████████ 32511    /    32511

The train split of the dataset is ~100GB but the text is duplicated in another column with markup so I estimate about 50GB in the “text” column. I think this should be doable at “training a tokenizer on english wikipedia speeds” within a factor of 10 or so (I was thinking minutes not hours). Can anyone see where I’m making a mistake or if my time estimates are just totally off?

I’m using,

datasets 2.8.0
transformers 4.25.1

and this is the dataset on the hub gabrielaltay/pubtator-central-bigbio-kb-2022-12-18 · Datasets at Hugging Face

thanks,
-G

UPDATE: attempting to isolate dataset iteration speed with

import datasets                                                                                      
from tqdm import tqdm                                                                                
import datasets                                                                                      
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    for batch in tqdm(batch_iterator(ds_train)):                                                     
        x = 1  

and getting,

700it [02:10,  5.18it/s]

leading me to believe the bottleneck is dataset iteration speed
(33M samples) / (batch size 1000) / (6 it/s) = 5500 s ~ 90 minutes

Problem Solved! (thanks to @lhoestq)

Turns out the slow iteration speed was b/c of all the extra columns in the dataset besides the “text” column. Running with just the text column in the dataset gave 40x speedup ,

old
700it [02:10,  5.18it/s]

new
13435it [00:32, 228.80it/s]
import datasets                                                                                      
from transformers import AutoTokenizer                                                               
                                                                                                     
def batch_iterator(dataset, batch_size=1_000):                                                       
    for batch in dataset.iter(batch_size=batch_size):                                                
        yield batch["text"]                                                                          
                                                                                                     
if __name__ == "__main__":                                                                           
                                                                                                     
    ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"                                     
    clone_from_name = "gpt2"                                                                         
    vocab_size = 32_768                                                                              
                                                                                                     
    clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)                            
    ds_train = datasets.load_dataset(ds_id, split="train")                                           
    # remove non text columns
    ds_train = ds_train.remove_columns([                                                             
        col for col in ds_train.column_names if col != "text"                                        
    ])                                                                                               
                                                                                                     
    tokenizer = clone_from_tokenizer.train_new_from_iterator(                                        
        batch_iterator(ds_train),                                                                    
        vocab_size=vocab_size,                                                                       
    )                                                                                                
                                                                                                     
    tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer") 
3 Likes

I also have the issue of slow training speed with the tokenizer on smaller datasets. Upon investigation, it became clear that the tokenizer only utilizes 1 CPU core, and batching or not batching doesn’t affect its speed. What do you think is the solution to this problem?

1 Like

I agree. The training doesn’t seem to be using all cores; and it’s still bottlenecked by the rate at which data can be read from the iterator.

I wonder if there is any way to improve that.

Hi @gabrielaltay, I am facing the same issue… I am currently training a BPE tokenizer for the Panjabi language on a 50 GB text corpus. However, I am encountering an “Out of Memory” (OOM) issue even when using a 1TB RAM instance. Can you help me understand the reason behind this and provide any references or suggestions to train this model more efficiently?

from datasets import load_from_disk, load_dataset
from transformers import AutoTokenizer

ds = load_dataset('kdcyberdude/Vichaar', num_proc=8, cache_dir='./gemma_data_cache')['train']
print(ds)
tokenizer = AutoTokenizer.from_pretrained("openchat/openchat-3.5-0106-gemma")

def batch_iterator(batch_size=1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i : i + batch_size]["text"]

new_tokenizer = tokenizer.train_new_from_iterator( batch_iterator(), vocab_size=32_000, length=len(ds))
new_tokenizer.save_pretrained("./gemma-32k-pa-tokenizer")

I have also tried this using a data loader, the Pre-processing sequences steps keep on iterating even after len(ds) and memory keeps increasing. The iteration goes 7*len(ds) until it hits OOM. Not sure when it will stop. Same as this issue and issue

class TextDataset(torch.utils.data.Dataset):
    def __init__(self, ds, batch_size):
        self.batch_size = batch_size
        self.ds = ds

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        batch = self.ds[idx:idx + self.batch_size]['text']
        return batch

dataset = TextDataset(ds, batch_size=1024)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=None)

new_tokenizer = tokenizer.train_new_from_iterator( dataloader, vocab_size=32_000, length=len(ds))

I also tried debugging the code to understand which part is consuming this much RAM but I am not able to get into this train_from_iterator function in tokenization_utils_fast.py. I am speculating this could be calling executable/binary code that may be running in Rust.

Any help or pointers would be greatly appreciated!

1 Like

That is indeed weird, I’ll investigate as it should be using threads

Fast encode by ArthurZucker · Pull Request #1560 · huggingface/tokenizers · GitHub should help! There are issue with parallelization

Hi, I encountered the same problem as @kdcyberdude did. I used a host with 1.5TB memory and trained a 64k-vocab tokenizer on a 25GB text corpus using hf tokenizer. It ran slower and slower and broke down during merging.
Could anyone tell me how to avoid this? :sob: