Should I normalize text or not

Roman · May 28, 2021, 4:33pm

Hello. I have a question about a general understanding of how transformers work. The tokenizer feeds the model text, which is usually normalized, cleared of punctuation, and so on. But as I assume the transformer is trained on a raw corpus - I made this conclusion after seeing the single characters in the vocabulary. Hence the question - should I normalize and do other preprocessing of the sentences for which I want to get embbedings? If the model was trained on the raw corpus, how correct will the preprocessing described above be for the text under study?

BramVanroy · May 28, 2021, 6:41pm

No, you should not preprocess the dataset. Depending on the tokenizer scheme, you may want to tokenize words beforehand (which will then still be tokenised into subword units). For something like sentencepiece that is not needed (in fact it is recommended to not pretokenise in sentencepiece because a space is a regular character there).

It is best if the data is “clean” though, in the sense that it should not contain HTML/XML tags or other longer sequences of strange characters. But that should then also be true of inference data.

Roman · May 28, 2021, 8:23pm

Should I remove puntuation?

BramVanroy · May 29, 2021, 7:42am

As I said, no.

ImranzamanML · April 26, 2024, 10:31am

Yes you are right because I tested the LLM models for the embeddings and when I use the text cleaning then the similarity and context even go down.

Topic		Replies	Views
Should I normalize SentenceTransformers embeddings? Beginners	0	1331	March 11, 2024
Do I need to perform 'Stop word removal' before feeding into hugging face 'pipeline' or 'automodels' Beginners	3	240	December 31, 2024
What does `tokenizers.normalizer.normalize` do? 🤗Tokenizers	5	3544	October 12, 2020
Text preprocessing for fitting Tokenizer model 🤗Tokenizers	1	1406	October 25, 2022
Preprocessing step for fine-tuning language model Beginners	1	853	March 12, 2021

Should I normalize text or not

Related topics