Do I need to perform 'Stop word removal' before feeding into hugging face 'pipeline' or 'automodels'

I’m new to transformer models. In traditional NLP workflows (e.g., for text classification), we often apply preprocessing steps like stemming, lemmatization, and stop word removal. Does the Hugging Face pipeline (or the AutoModel classes) include these steps by default—especially stop word removal? If not, is it recommended to handle such preprocessing on our own before providing input to the pipeline? Thanks!

1 Like

In the Hugging Face pipeline (and AutoModel classes), preprocessing steps like stemming, lemmatization, and stop word removal are typically not applied by default. The idea behind transformer models is that they are designed to work with raw text and capture context directly from it. As such, most models expect text input without heavy preprocessing.

For stop word removal, it’s generally not recommended to remove stop words before providing input to transformer models, as these models are capable of understanding and utilizing stop words in context, which can improve performance. The model’s attention mechanism allows it to weigh important words relative to others, making stop words still valuable for understanding sentence meaning.

However, if you have specific reasons for removing stop words or other preprocessing tasks, you can handle them separately before passing the input to the pipeline, but it’s usually not necessary for most tasks with transformer models.

2 Likes

The Hugging Face pipeline generally does not require manual preprocessing like stemming, lemmatization, or stop word removal. These pipelines automatically handle most of the necessary preprocessing for you, depending on the task and the model you’re using.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.