I am training my huggingface tokenizer on my own corpora, and I want to save it with a preprocessing step. That is, if I pass some text to it, I want it to apply the preprocessing and then tokenize the text, instead of explicitly preprocessing it before that. A good example is BERTweet: GitHub - VinAIResearch/BERTweet: BERTweet: A pre-trained language model for English Tweets (EMNLP-2020) and their tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)
(here normalization=True indicates that the input will be preprocessed according to some function). I want the same to apply when I train a tokenizer with a custom preprocessing function. My code is:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
def preprocess(text):
return text
paths = [str(x) for x in Path('data').glob('*.txt')]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])
tokenizer.save_model('CustomBertTokenizer')
Now, when I load the tokenizer:
from transformers import RobertaTokenizerFast
sentence = 'Hey'
tokenizer = RobertaTokenizerFast.from_pretrained('CustomBertTokenizer')
tokenizer(sentence)
I want sentence
to be preprocessed with the preprocess
function, and then tokenized. So I want to pass like an argument : preprocessing=True, or something like that. How can I do it?