Building a Custom Tokenizer with HGF Dataset: Batch Iterator Best Practices

Dear Hugging Face Community,

I am currently working on building my own tokenizer using a HGF dataset. The training of the tokenizer is initiated as follows:

tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

My primary concern revolves around the creation of an efficient batch iterator. I am utilizing the OSCAR dataset, which is structured in this manner:

Dataset({
    features: ['id', 'text', 'meta'],
    num_rows: 25113265
})

Based on the tutorials, a batch iterator can be constructed like this:

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

However, I discovered that there’s a built-in iterator method available:

batch_iter = ds.iter(batch_size=256)

This leads me to a couple of questions:

  • How can I modify this built-in iterator to specifically select the β€˜text’ subfield?
  • Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?
  • Also it seems that ByteLevelBPETokenizer have no attribute
    train_new_from_iterator. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space
tokenizer.train(files, trainer)

Any insights or suggestions on these points would be greatly appreciated.

Thank you in advance for your assistance!

Hi!

How can I modify this built-in iterator to specifically select the β€˜text’ subfield?

def batch_iterator():
    for batch in ds.select_columns("text").iter(batch_size=256):
        yield batch["text"]

Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?

No, this is not needed.

Also it seems that ByteLevelBPETokenizer have no attribute
train_new_from_iterator. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space

You can use train_from_iterator to train it on an iterator. This method comes from the tokenizers library, and can be accessed via train_new_from_iterator in transformers,

1 Like

Thanks! And the code goes as the following,

from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=ByteLevelBPETokenizer()
    )

Even though it is bit confusing to jump back and forth between transformer and tokenizer, I would expect all tokenizer stuff to be dealt within tokenizer classes, guess there are some architecture behind it. :thinking:

Hi, I am back again.

from datasets import Dataset
from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer

#tokenizer = AutoTokenizer.from_pretrained("./gpt")
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=ByteLevelBPETokenizer()
    )

ds = Dataset.load_from_disk("../dataset/dutch.hf")
ds = ds.with_format("torch")

def batch_iterator():
    for batch in ds.select_columns("text").iter(batch_size=64):
        yield batch["text"]
        
tokenizer.train_new_from_iterator(batch_iterator(), 
                                  vocab_size=52_000)

tokenizer.save_pretrained("bpe-post")

I ran the code multiple times the kernel died

[00:37:46] Pre-processing sequences                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0        /        0
[00:00:23] Tokenize words                           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 23818136 / 23818136
[00:02:43] Count pairs                              β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
[00:10:26] Count pairs                              β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
[00:20:58] Count pairs                              β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ 23818100 / 23818136
Killed

Is it some hardware issue or some version issue?

PreTrainedTokenizerFast is a class in transformers that wraps the tokenizers lib (they are implemented in Rust to make tokenization in transformers faster). Regarding the β€œKilled” issue, maybe these issues can help: Issue search results Β· GitHub. Otherwise, I suggest opening a new issue in the tokenizers repo.