IMDb score prediction

Hi everyone!
I have a corpus of dialogues from 600 movie scripts (about 33 MB) along with their IMDb scores and genres. I want to train a model that can predict score based on a dialogue and genre.
Should I use pre-trained models and fine-tune them with my data or should I start from scratch. Which option will be more hardware demanding?

It’s advised to start from a pre-trained model and fine-tune it on your custom dataset, as 600 examples is typical for such a use case. For reference, pre-training is done on terrabytes of data, on clusters of GPUs.

If you want to predict the score, then you can use the AutoModelForSequenceClassification class and pass problem_type=“regression” (as this is a regression problem):

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", problem_type="regression")

This will make sure the mean-squared error (MSE) loss is used. Next, you can fine-tune it as shown in this tutorial (some updates need to be made to adapt it from classification to regression).

1 Like