Preprocessing raw text

antoine2323231 · October 17, 2022, 1:51pm

When fitting a custom tokenizer, I am preprocessing raw text. I remove punctuation, but then there is no indication of the end/start of a sentence, therefore if I keep the “.” as an example that would be understood by the tokenizer model? Do you have any suggestions?

lianghsun · October 25, 2022, 7:42pm

Hey bro, have you tried TemplateProcessing ? You should add the following code:

tokenizer.post_processor = TemplateProcessing(
    single="[BOS] $0 [EOS]",
    special_tokens=[
        ("[BOS]", tokenizer.token_to_id("[BOS]")),
        ("[EOS]", tokenizer.token_to_id("[EOS]"))
    ]

Please note that you may need to add [PAD] through tokenizer.enable_padding().

antoine2323231 · October 26, 2022, 3:21pm

Hey bro,
Thanks a lot! Gonna try this out.

Topic		Replies	Views
Text preprocessing for fitting Tokenizer model 🤗Tokenizers	1	1388	October 25, 2022
Preprocessing data for custom tokenizer 🤗Transformers	0	251	October 21, 2022
Issue with post-processing 🤗Tokenizers	1	1101	June 15, 2022
Add BOS and EOS when encoding a sentence 🤗Tokenizers	2	14538	August 22, 2022
What is the preferred way to preprocess punctuation? 🤗Transformers	0	236	October 13, 2022

Preprocessing raw text

Related topics