I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.
This snippet I got off github has a way to construct and use the custom
tokenizer that operates on whitespaces:-
from tokenizers import Tokenizer, trainers from tokenizers.models import BPE from tokenizers.normalizers import Lowercase from tokenizers.pre_tokenizers import CharDelimiterSplit # We build our custom tokenizer: tokenizer = Tokenizer(BPE()) tokenizer.normalizer = Lowercase() tokenizer.pre_tokenizer = CharDelimiterSplit(' ') # We can train this tokenizer by giving it a list of path to text files: trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True) tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)
I wanted to use it for pre-training the
BigBird attention model, but facing two issues:
- I can’t seem to be able to use this snippet with the custom
tokenizerabove to convert tokenized sentences in model-friendly sequences
from tokenizers.processors import BertProcessing tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing( ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")), ) tokenizer.enable_truncation(max_length=16000)
This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (
<s>; </s>) as expected.
- Next problem arises, when I save the
tokenizerstate in the specified folder, I am unable to use it via:
tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)
since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using
RobertaTokenizerFast does work - I assume it has something to do in the tokenization
If anyone wants, I can create a reproducible colab notebook to speed up the issue being solved.
Thanks in advance,