Using whitespace tokenizer for training models

Neel-Gupta · June 5, 2021, 10:14pm

I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.

This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

I wanted to use it for pre-training the BigBird attention model, but facing two issues:

I can’t seem to be able to use this snippet with the custom tokenizer above to convert tokenized sentences in model-friendly sequences

from tokenizers.processors import BertProcessing

tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>) as expected.

Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:

tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)

since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast does work - I assume it has something to do in the tokenization post-processing phase.

If anyone wants, I can create a reproducible colab notebook to speed up the issue being solved.

Thanks in advance,
N

Neel-Gupta · June 6, 2021, 2:03pm

I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here. Thanx

Topic		Replies	Views
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	95	March 14, 2025
Possible wrong BigBirdTtokenizationFast special token initialization in pretrained model Models	3	418	July 29, 2021
Load custom pretrained tokenizer 🤗Tokenizers	0	1609	October 28, 2021
Trained a tokenizer from scratch but problem when loading 🤗Transformers	0	479	October 8, 2023
Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper 🤗Tokenizers	3	9455	August 1, 2023

Using whitespace tokenizer for training models

Related topics