Hello. I have a question about a general understanding of how transformers work. The tokenizer feeds the model text, which is usually normalized, cleared of punctuation, and so on. But as I assume the transformer is trained on a raw corpus - I made this conclusion after seeing the single characters in the vocabulary. Hence the question - should I normalize and do other preprocessing of the sentences for which I want to get embbedings? If the model was trained on the raw corpus, how correct will the preprocessing described above be for the text under study?
No, you should not preprocess the dataset. Depending on the tokenizer scheme, you may want to tokenize words beforehand (which will then still be tokenised into subword units). For something like sentencepiece that is not needed (in fact it is recommended to not pretokenise in sentencepiece because a space is a regular character there).
It is best if the data is “clean” though, in the sense that it should not contain HTML/XML tags or other longer sequences of strange characters. But that should then also be true of inference data.
Should I remove puntuation?
As I said, no.