TensorFlow Getting Started Demo model outputs the same label for all examples

Hi,

I’ve been following the SageMaker demo available here - for training a binary sentiment classifier on the IMDB dataset.

I ran the notebook as provided, to get a feel for how to use Hugging Face with SageMaker, and the notebook successfully runs and the model trains. However, upon testing the resulting model I found that the output was the same, regardless of the input sentence. For example:

>>>sentiment_input= {"inputs":"This is the best movie I have ever watched. It is amazing!"}
>>>print(predictor.predict(sentiment_input))
[{'label': 'LABEL_0', 'score': 0.9999932050704956}]

>>>sentiment_input= {"inputs":"This is the worst movie I have ever watched. It is terrible!"}
>>>print(predictor.predict(sentiment_input))
[{'label': 'LABEL_0', 'score': 0.9999932050704956}]

I’ve looked through the code, but I can’t seem to find what might cause this behaviour. Unless the dataset isn’t being loaded properly (eg. if the classifier is only training with one label), or something is going wrong with the tokenization. However, I haven’t made any changes to the notebook and training script provided, so this seems unlikely.

Any help would be much appreciated! Thanks in advance.

I believe I’ve found the problem. The dataset is never shuffled in the train.py script provided with the demo. As a result the model was learning to assign LABEL_0 to any input. After adding a shuffle to the test set (not strictly necessary) and train set here (after line 45):

    # Load dataset
    train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])
    train_dataset = train_dataset.shuffle()
    test_dataset = test_dataset.shuffle()

    # Preprocess train dataset

the model trains successfully and returns the following labels as expected:

>>>sentiment_input= {"inputs":"This is the best movie I have ever watched. It is amazing!"}
>>>print(predictor.predict(sentiment_input))
[{'label': 'LABEL_1', 'score': 0.995592474937439}]

>>>sentiment_input= {"inputs":"This is the worst movie I have ever watched. It is terrible!"}
>>>print(predictor.predict(sentiment_input))
[{'label': 'LABEL_0', 'score': 0.9919235110282898}]
1 Like

Hey @NickWilk37,

thanks for finding the error! I already fixed it in the repository.

1 Like

No problem, thanks so much! I opened an issue on the GitHub repo in case you missed this post, so feel free to close that issue :slight_smile: