Do I need to perform 'Stop word removal' before feeding into hugging face 'pipeline' or 'automodels'

godoodim · December 31, 2024, 1:37am

I’m new to transformer models. In traditional NLP workflows (e.g., for text classification), we often apply preprocessing steps like stemming, lemmatization, and stop word removal. Does the Hugging Face pipeline (or the AutoModel classes) include these steps by default—especially stop word removal? If not, is it recommended to handle such preprocessing on our own before providing input to the pipeline? Thanks!

Alanturner2 · December 31, 2024, 2:24am

In the Hugging Face pipeline (and AutoModel classes), preprocessing steps like stemming, lemmatization, and stop word removal are typically not applied by default. The idea behind transformer models is that they are designed to work with raw text and capture context directly from it. As such, most models expect text input without heavy preprocessing.

For stop word removal, it’s generally not recommended to remove stop words before providing input to transformer models, as these models are capable of understanding and utilizing stop words in context, which can improve performance. The model’s attention mechanism allows it to weigh important words relative to others, making stop words still valuable for understanding sentence meaning.

However, if you have specific reasons for removing stop words or other preprocessing tasks, you can handle them separately before passing the input to the pipeline, but it’s usually not necessary for most tasks with transformer models.

ValdeJunior · December 31, 2024, 2:56am

The Hugging Face pipeline generally does not require manual preprocessing like stemming, lemmatization, or stop word removal. These pipelines automatically handle most of the necessary preprocessing for you, depending on the task and the model you’re using.

system · December 31, 2024, 4:37pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
How can I use the models provided in huggingface.co/models? Beginners	3	1562	April 9, 2021
Saving underlying language model after trained on downstream task 🤗Transformers	0	422	September 14, 2020
Should I normalize text or not Beginners	4	1936	April 26, 2024
Instantiating models will not terminate Community Calls	0	778	April 6, 2023

Do I need to perform 'Stop word removal' before feeding into hugging face 'pipeline' or 'automodels'

Related topics