Using whitespace tokenizer for training models

I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.

This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

I wanted to use it for pre-training the BigBird attention model, but facing two issues:

  1. I can’t seem to be able to use this snippet with the custom tokenizer above to convert tokenized sentences in model-friendly sequences
from tokenizers.processors import BertProcessing

tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>) as expected.

  1. Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:
tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)

since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast does work - I assume it has something to do in the tokenization post-processing phase.

If anyone wants, I can create a reproducible colab notebook to speed up the issue being solved.

Thanks in advance,
N

I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here. Thanx :+1: