Dear Hugging Face Community,
I am currently working on building my own tokenizer using a HGF dataset. The training of the tokenizer is initiated as follows:
tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)
My primary concern revolves around the creation of an efficient batch iterator. I am utilizing the OSCAR dataset, which is structured in this manner:
Dataset({
features: ['id', 'text', 'meta'],
num_rows: 25113265
})
Based on the tutorials, a batch iterator can be constructed like this:
def batch_iterator():
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
However, I discovered that thereβs a built-in iterator method available:
batch_iter = ds.iter(batch_size=256)
This leads me to a couple of questions:
- How can I modify this built-in iterator to specifically select the βtextβ subfield?
- Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?
- Also it seems that
ByteLevelBPETokenizer
have no attribute
train_new_from_iterator
. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space
tokenizer.train(files, trainer)
Any insights or suggestions on these points would be greatly appreciated.
Thank you in advance for your assistance!
Hi!
How can I modify this built-in iterator to specifically select the βtextβ subfield?
def batch_iterator():
for batch in ds.select_columns("text").iter(batch_size=256):
yield batch["text"]
Given the varying sizes of the text entries, should I apply any specific preprocessing or handling techniques to manage this variability?
No, this is not needed.
Also it seems that ByteLevelBPETokenizer
have no attribute
train_new_from_iterator
. What can I do If I really want to use BPE? Do I really have to output the text into txt and then read it using, this sounds like a waste of disk space
You can use train_from_iterator
to train it on an iterator. This method comes from the tokenizers
library, and can be accessed via train_new_from_iterator
in transformers
,
1 Like
Thanks! And the code goes as the following,
from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer
tokenizer = PreTrainedTokenizerFast(
tokenizer_object=ByteLevelBPETokenizer()
)
Even though it is bit confusing to jump back and forth between transformer and tokenizer, I would expect all tokenizer stuff to be dealt within tokenizer classes, guess there are some architecture behind it. 
Hi, I am back again.
from datasets import Dataset
from transformers import PreTrainedTokenizerFast
from tokenizers import ByteLevelBPETokenizer
#tokenizer = AutoTokenizer.from_pretrained("./gpt")
tokenizer = PreTrainedTokenizerFast(
tokenizer_object=ByteLevelBPETokenizer()
)
ds = Dataset.load_from_disk("../dataset/dutch.hf")
ds = ds.with_format("torch")
def batch_iterator():
for batch in ds.select_columns("text").iter(batch_size=64):
yield batch["text"]
tokenizer.train_new_from_iterator(batch_iterator(),
vocab_size=52_000)
tokenizer.save_pretrained("bpe-post")
I ran the code multiple times the kernel died
[00:37:46] Pre-processing sequences βββββββββββββββββββββββββββββββββββββ 0 / 0
[00:00:23] Tokenize words βββββββββββββββββββββββββββββββββββββ 23818136 / 23818136
[00:02:43] Count pairs βββββββββββββββββββββββββββββββββββ
[00:10:26] Count pairs ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[00:20:58] Count pairs βββββββββββββββ 23818100 / 23818136
Killed
Is it some hardware issue or some version issue?
PreTrainedTokenizerFast
is a class in transformers
that wraps the tokenizers
lib (they are implemented in Rust to make tokenization in transformers
faster). Regarding the βKilledβ issue, maybe these issues can help: Issue search results Β· GitHub. Otherwise, I suggest opening a new issue in the tokenizers
repo.