I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.
This snippet I got off github has a way to construct and use the custom tokenizer
that operates on whitespaces:-
from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit
# We build our custom tokenizer:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')
# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)
I wanted to use it for pre-training the BigBird
attention model, but facing two issues:
- I canāt seem to be able to use this snippet with the custom
tokenizer
above to convert tokenized sentences in model-friendly sequences
from tokenizers.processors import BertProcessing
tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)
This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>
) as expected.
- Next problem arises, when I save the
tokenizer
state in the specified folder, I am unable to use it via:
tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)
since it yields the error that my directory does not āreferenceā the tokenizer files, which shouldnāt be an issue since using RobertaTokenizerFast
does work - I assume it has something to do in the tokenization post-processing
phase.
If anyone wants, I can create a reproducible colab notebook to speed up the issue being solved.
Thanks in advance,
N