When fitting a custom tokenizer, I am preprocessing raw text. I remove punctuation, but then there is no indication of the end/start of a sentence, therefore if I keep the “.” as an example that would be understood by the tokenizer model? Do you have any suggestions?
Hey bro, have you tried TemplateProcessing ? You should add the following code:
tokenizer.post_processor = TemplateProcessing( single="[BOS] $0 [EOS]", special_tokens=[ ("[BOS]", tokenizer.token_to_id("[BOS]")), ("[EOS]", tokenizer.token_to_id("[EOS]")) ]
Please note that you may need to add
Thanks a lot! Gonna try this out.