When fitting a custom tokenizer, I am preprocessing raw text. I remove punctuation, but then there is no indication of the end/start of a sentence, therefore if I keep the “.” as an example that would be understood by the tokenizer model? Do you have any suggestions?
Hey bro, have you tried TemplateProcessing ? You should add the following code:
tokenizer.post_processor = TemplateProcessing(
single="[BOS] $0 [EOS]",
special_tokens=[
("[BOS]", tokenizer.token_to_id("[BOS]")),
("[EOS]", tokenizer.token_to_id("[EOS]"))
]
Please note that you may need to add [PAD]
through tokenizer.enable_padding()
.
1 Like
Hey bro,
Thanks a lot! Gonna try this out.