Building a Custom Tokenizer with HGF Dataset: Batch Iterator Best Practices

Rong-Tao · December 7, 2023, 11:23am

Dear Hugging Face Community,

I am currently working on building my own tokenizer using a HGF dataset. The training of the tokenizer is initiated as follows:

tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

My primary concern revolves around the creation of an efficient batch iterator. I am utilizing the OSCAR dataset, which is structured in this manner:

Dataset({
    features: ['id', 'text', 'meta'],
    num_rows: 25113265
})

Based on the tutorials, a batch iterator can be constructed like this:

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

However, I discovered that there’s a built-in iterator method available:

batch_iter = ds.iter(batch_size=256)

This leads me to a couple of questions:

How can I modify this built-in iterator to specifically select the ‘text’ subfield?
Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?
Also it seems that ByteLevelBPETokenizer have no attribute
train_new_from_iterator. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space

tokenizer.train(files, trainer)

Any insights or suggestions on these points would be greatly appreciated.

Thank you in advance for your assistance!

mariosasko · December 7, 2023, 10:28pm

Hi!

How can I modify this built-in iterator to specifically select the ‘text’ subfield?

def batch_iterator():
    for batch in ds.select_columns("text").iter(batch_size=256):
        yield batch["text"]

Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?

No, this is not needed.

Also it seems that ByteLevelBPETokenizer have no attribute
train_new_from_iterator. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space

You can use train_from_iterator to train it on an iterator. This method comes from the tokenizers library, and can be accessed via train_new_from_iterator in transformers,

Rong-Tao · December 8, 2023, 2:07am

Thanks! And the code goes as the following,

from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=ByteLevelBPETokenizer()
    )

Even though it is bit confusing to jump back and forth between transformer and tokenizer, I would expect all tokenizer stuff to be dealt within tokenizer classes, guess there are some architecture behind it.

Rong-Tao · December 8, 2023, 6:48am

Hi, I am back again.

from datasets import Dataset
from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer

#tokenizer = AutoTokenizer.from_pretrained("./gpt")
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=ByteLevelBPETokenizer()
    )

ds = Dataset.load_from_disk("../dataset/dutch.hf")
ds = ds.with_format("torch")

def batch_iterator():
    for batch in ds.select_columns("text").iter(batch_size=64):
        yield batch["text"]
        
tokenizer.train_new_from_iterator(batch_iterator(), 
                                  vocab_size=52_000)

tokenizer.save_pretrained("bpe-post")

I ran the code multiple times the kernel died

[00:37:46] Pre-processing sequences                 █████████████████████████████████████ 0        /        0
[00:00:23] Tokenize words                           █████████████████████████████████████ 23818136 / 23818136
[00:02:43] Count pairs                              ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
[00:10:26] Count pairs                              ███████████████████████████████████████████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
[00:20:58] Count pairs                              ██████████████░ 23818100 / 23818136
Killed

Is it some hardware issue or some version issue?

mariosasko · December 8, 2023, 3:20pm

PreTrainedTokenizerFast is a class in transformers that wraps the tokenizers lib (they are implemented in Rust to make tokenization in transformers faster). Regarding the “Killed” issue, maybe these issues can help: Issue search results · GitHub. Otherwise, I suggest opening a new issue in the tokenizers repo.

Topic		Replies	Views
NLP dataset for ByteLevelTokenizer Training 🤗Datasets	1	1749	February 16, 2021
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	2	1225	March 4, 2024
Tokenizer Trainer Crashing 🤗Tokenizers	0	543	April 15, 2023
Running train_new_from_iterator to train a tokenizer is very slow 🤗Tokenizers	1	110	April 13, 2024
Training with varying lengths of sequences Beginners	0	972	May 31, 2023

Building a Custom Tokenizer with HGF Dataset: Batch Iterator Best Practices

Related Topics