How can state-of-the-art classifiers be so wrong?

Hello there! Sorry for the provocative question but consider this simple example

from transformers import pipeline

classifier = pipeline(task = 'sentiment-analysis')

classifier('this is so good!')
Out[4]: [{'label': 'POSITIVE', 'score': 0.9998476505279541}]

classifier('this is so gooood!')
Out[5]: [{'label': 'NEGATIVE', 'score': 0.9922532439231873}]

How can gooood be treated as negative with a very high confidence score? How can I fix this behavior?


The problem you are encountering has little to do with whether the language model is state-of-the-art, but rather the language dialect(s) used to train the model versus those dialects used at inference time.

Most likely you are using a model trained on standard English, such as the text that appears in Wikipedia, whereas your second query uses the slang term ‘gooood’ which is likely outside the vocabulary used during training. I suspect that you would encounter similar issues with ‘goooooood’, ‘soooo gooood’, vulgarities, acronyms (e.g.: lol, WTF) or words whose slang definition may differ from standard English (e.g.: bad, fly).

Dialect can be an important issue when dealing with casual speech, tweets, medical literature, clinical records, patents or other domains that use specialized language variants.

Possible approaches:

  • Select a pre-trained model trained on the language dialect of interest. For example, cardiffnlp/twitter-roberta-base-sentiment is a variant of RoBERTa trained on tweets. The huggingface model library also includes models trained on Reddit text.
  • Fine-tune an existing pre-trained model through supplementary training using examples of the language dialect of interest. Huggingface includes datasets related to Reddit and tweets that may be of use for this task.
  • Using an existing base language model trained on the dialect of interest (e.g.: Reddit), and a classification head and train using a standard sentiment analysis training dataset
1 Like

thank you so much. This is very useful. Could you please tell me how I could fine-tune this model to make sure the prediction of “goood” is accurate? Is there a simple way to proceed?


Before going to the effort to train a model, my suggestion is that you try out some of the existing sentiment analysis models trained on either tweet, Yelp or Reddit data. Something like cardiffnlp/twitter-roberta-base-sentiment might be good enough to meet your needs. Information on existing models can be found at:

In the left column of the Models page under Datasets you can filter on the datasets used to build the model (e.g.: twitter).

1 Like

Continued … due to forums embedded link limit per post

You can find more about the sentiments datasets at: hugginface sentiment datasets

As for training and fine-tuning, I suggest you start with Sebastian’s tutorials

1 Like

thanks! yes I saw the tutorial but I hoped you had something else… I am using tensorflow.

Sorry, I switched from TF to PyTorch a couple years ago :slight_smile:

There is also a fine-tuning video using TF by Matt on the same page Fine-tuning a pretrained model

If you’re using a base model whose pre-training includes the dialect of interest, you may be able to used one of the canned scripts to create a sentiment model. Haven’t done this yet, so no additional insights here.

1 Like

To learn how to fine-tune a model (either in PyTorch or TensorFlow) I would recommend following the chapter 3 of the course.


thanks @sgugger I have read this link. My main issue is that my data is stored in a Pandas dataframe so I am not sure how to adapt your code (and in particular mimic the function. Is there an easy way to transfer a pandas dataframe to a tensorflow dataset? does that make sense?

hey @olaffson you can load your pandas dataframe as a datasets.Dataset as follows:

from datasets import Dataset

df = ... # your dataframe
dset = Dataset.from_pandas(df)

from here it should be straightforward to adapt the code in the course lesson to fine-tune a model on your data :slight_smile:

you can find more information on loading from pandas in the docs

1 Like

amazing!!! thanks for some reason I could not find this in the tutorial. This is great

1 Like

Before finetuning, I’d echo @Williamsdoug’s suggestion and try an existing model. The training data really is a vital factor.

If we check out the model used by default for sentiment analysis:

c1 = pipeline(task="sentiment-analysis")
c1.modelcard  # None :-(
c1.model  # DistilBertForSequenceClassification

We can search on Huggingface’s hub for a model that matches “DistilBertForSequenceClassification”

I only see one result, and there’s no more obvious info there. But a good guess, by default, is that a model was trained on scraped web text, which might not have great performance for the domain you’re looking for.

If we test out this DistillBert model a bit more:

for i in range(2, 25):
    s = f"this is so g{'o'*i}d!"
    print(s, c1(s))


this is so good! [{'label': 'POSITIVE', 'score': 0.9998476505279541}]
this is so goood! [{'label': 'NEGATIVE', 'score': 0.9995097517967224}]
this is so gooood! [{'label': 'NEGATIVE', 'score': 0.9922532439231873}]
this is so goooood! [{'label': 'NEGATIVE', 'score': 0.9786434173583984}]
this is so gooooood! [{'label': 'NEGATIVE', 'score': 0.9764403700828552}]
this is so goooooood! [{'label': 'NEGATIVE', 'score': 0.9486815333366394}]
this is so gooooooood! [{'label': 'NEGATIVE', 'score': 0.9464263319969177}]
this is so goooooooood! [{'label': 'NEGATIVE', 'score': 0.9400624632835388}]
this is so gooooooooood! [{'label': 'NEGATIVE', 'score': 0.9257988929748535}]
this is so goooooooooood! [{'label': 'NEGATIVE', 'score': 0.9494100213050842}]
this is so gooooooooooood! [{'label': 'NEGATIVE', 'score': 0.931290328502655}]
this is so goooooooooooood! [{'label': 'NEGATIVE', 'score': 0.947573721408844}]
this is so gooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9173546433448792}]
this is so goooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.935723602771759}]
this is so gooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.8896605968475342}]
this is so goooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9372522234916687}]
this is so gooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.8814820647239685}]
this is so goooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9387907385826111}]
this is so gooooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.8939523696899414}]
this is so goooooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9408086538314819}]
this is so gooooooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9059429168701172}]
this is so goooooooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9447947144508362}]
this is so gooooooooooooooooooooooood! [{'label': 'NEGATIVE', 'score': 0.9063981175422668}]


But try loading up a different one, like a sentiment model trained on Twitter (again, as @Williamsdoug suggests!)

c2 = pipeline(
    task="sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment"

The outputs for this model aren’t as clear:

c2("hello")  # [{'label': 'LABEL_1', 'score': 0.6034973859786987}]

… so we can follow the example on the model’s page and download a label mapping:

# get label mapping for c2
labels = []
mapping_link = f""
with urllib.request.urlopen(mapping_link) as f:
    html ="utf-8").split("\n")
    csvreader = csv.reader(html, delimiter="\t")
labels = [row[1] for row in csvreader if len(row) > 1]
# labels = ['negative', 'neutral', 'positive']

Then, we can try the same test:

for i in range(2, 25):
    s = f"this is so g{'o'*i}d!"
    res = c2(s)
    # map LABEL_X -> pos./neut./neg.
    res[0]["label"] = labels[int(res[0]["label"][-1])]  
    print(s, res)


this is so good! [{'label': 'positive', 'score': 0.9917314648628235}]
this is so goood! [{'label': 'positive', 'score': 0.9901121854782104}]
this is so gooood! [{'label': 'positive', 'score': 0.989584743976593}]
this is so goooood! [{'label': 'positive', 'score': 0.9901219606399536}]
this is so gooooood! [{'label': 'positive', 'score': 0.9901283383369446}]
this is so goooooood! [{'label': 'positive', 'score': 0.9899706840515137}]
this is so gooooooood! [{'label': 'positive', 'score': 0.9893210530281067}]
this is so goooooooood! [{'label': 'positive', 'score': 0.9891370534896851}]
this is so gooooooooood! [{'label': 'positive', 'score': 0.9892650246620178}]
this is so goooooooooood! [{'label': 'positive', 'score': 0.9897728562355042}]
this is so gooooooooooood! [{'label': 'positive', 'score': 0.9892972111701965}]
this is so goooooooooooood! [{'label': 'positive', 'score': 0.9887627959251404}]
this is so gooooooooooooood! [{'label': 'positive', 'score': 0.9891249537467957}]
this is so goooooooooooooood! [{'label': 'positive', 'score': 0.9889877438545227}]
this is so gooooooooooooooood! [{'label': 'positive', 'score': 0.9879323840141296}]
this is so goooooooooooooooood! [{'label': 'positive', 'score': 0.987424910068512}]
this is so gooooooooooooooooood! [{'label': 'positive', 'score': 0.9873157143592834}]
this is so goooooooooooooooooood! [{'label': 'positive', 'score': 0.9900829195976257}]
this is so gooooooooooooooooooood! [{'label': 'positive', 'score': 0.9894198179244995}]
this is so goooooooooooooooooooood! [{'label': 'positive', 'score': 0.9894325137138367}]
this is so gooooooooooooooooooooood! [{'label': 'positive', 'score': 0.9897305369377136}]
this is so goooooooooooooooooooooood! [{'label': 'positive', 'score': 0.9894993901252747}]
this is so gooooooooooooooooooooooood! [{'label': 'positive', 'score': 0.9883342385292053}]

very interesting, thanks @mbforbes! I would be curious to see if we can fine-tune the first model by feeding him a couple of “goooood” sentences. :wink: