Preprocessing raw text

When fitting a custom tokenizer, I am preprocessing raw text. I remove punctuation, but then there is no indication of the end/start of a sentence, therefore if I keep the “.” as an example that would be understood by the tokenizer model? Do you have any suggestions?

Hey bro, have you tried TemplateProcessing ? You should add the following code:

tokenizer.post_processor = TemplateProcessing(
    single="[BOS] $0 [EOS]",
    special_tokens=[
        ("[BOS]", tokenizer.token_to_id("[BOS]")),
        ("[EOS]", tokenizer.token_to_id("[EOS]"))
    ]

Please note that you may need to add [PAD] through tokenizer.enable_padding().

1 Like

Hey bro,
Thanks a lot! Gonna try this out.