I have trained my classifier, now how do I do predictions?

Abe · February 13, 2021, 12:09pm

Hi everybody and thank you in advance for anyone who can help my out. I am not a total beginner when it comes to huggingface libraries (I have already built a well functioning sentiment analyzer) however I have mostly taken tutorials and integrated their content without going too much into details of who each line of code does. Trying to learn more I have put together a document classifier using a couple of tutorials I’ve found online.

I have built the trainer and the validator and they work just fine. I started with a dataset that assigns 6 different labels to a text, with each text having 0, 1 or more than 1 label. I trained the model and saved it. My problem is: now what? I can’t understand exactly how to do the prediction part. Here is where I am:

def validation():

model = torch.load(destination_folder+'model.pt')
model.eval()

with torch.no_grad():
    for _, data in enumerate(testing_loader, 0):

        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        preds = model(ids, mask,token_type_ids)
        print(preds.argmax(1) + 1)

This is a snippet of the output of the print command:

tensor([1, 1, 1, 1])
tensor([6, 1, 1, 1])
tensor([1, 1, 1, 1])
tensor([1, 5, 2, 1])

I’ve done this using the validation data and by adapting the validation routine, while in reality I would need to do this for a single line of text, but regardless of the way the data is fed to the prediction function, how do I read the prediction data? How do I go from “This is the text of my document to be classified” to “This document is 75% label1, 15% label5, 2% label6”?

Again, thank you in advance for any help!

lewtun · February 13, 2021, 12:57pm

Hi @Abe, if I understand correctly you’d like to go from an input string like “I love this movie!” to a set of predicted labels and their confidence scores (i.e. probabilities).

The simplest way to achieve that would be to wrap your model and tokenizer in a TextClassificationPipeline with return_all_scores=True:

from transformers import TextClassificationPipeline

model = ...
tokenizer = ...
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
# outputs a list of dicts like [[{'label': 'NEGATIVE', 'score': 0.0001223755971295759},  {'label': 'POSITIVE', 'score': 0.9998776316642761}]]
pipe("I love this movie!")

The above also works for multiple inputs by feed a list of examples instead of a single string:

pipe(["I love this movie!", "I hate this movie!"])

If you want to have human-readable labels like “positive” and “negative” you can configure the id2label and label2id attributes of your model’s config class: Change label names on inference API - #3 by lewtun

HTH!

Abe · February 13, 2021, 1:52pm

Thank you for the incredibly quick reply!

I now get

torch.nn.modules.module.ModuleAttributeError: 'BERTClass' object has no attribute 'config'

Which makes me think the class used for training is missing something that pipe needs. This is the class I used

class BERTClass(torch.nn.Module):
def __init__(self):
    super(BERTClass, self).__init__()
    self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
    self.l2 = torch.nn.Dropout(0.3)
    self.l3 = torch.nn.Linear(768, 6)

def forward(self, ids, mask, token_type_ids):
    _, output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
    output_2 = self.l2(output_1)
    output = self.l3(output_2)
    return output

It always has no config attribute/method. Am I right in thinking I have to retrain using a different class?

lewtun · February 13, 2021, 2:15pm

Ah yes, if possible I think you’d be better off using the BertForSequenceClassification class together with BertConfig instead of the custom class you created.

For example, you might be able to make this work as follows:

config = ...
model = BertForSequenceClassification.from_pretrained(destination_folder+'model.pt', config=config)

and then passing model and the tokenizer to the pipeline as before. If not, then you’ll probably have to re-train the model or live with the default labels from the pipeline

laurb · February 13, 2021, 5:00pm

Using 3.5.1
I am also trying to use the text classification pipeline. I trained my model using trainer and saved it to “path to saved model”. My issue is that when I try to use the pipeline to predict, the call to the tokenizer is not truncating the result to the “model_max_length” set in the configuration of my trained model/tokenizer. I initialize as below:

tokenizer = RobertaTokenizer.from_pretrained(“path to saved model”)
model = RobertaForSequenceClassification.from_pretrained(“path to saved model”)
classifier = pipeline(‘sentiment-analysis’, model=model,tokenizer=tokenizer)

Do I need to create my own pipeline subclassing the text classification one in order to force truncation?

Thanks

lewtun · February 13, 2021, 8:26pm

Hi @laurb, I think you can specify the truncation length by passing max_length as part of generate_kwargs (e.g. 50 tokens in my example):

classifier = pipeline(‘sentiment-analysis’, model=model, tokenizer=tokenizer, generate_kwargs={"max_length":50})

As far as I know the Pipeline class (from which all other pipelines inherit) does not truncate the inputs by default: transformers/base.py at master · huggingface/transformers · GitHub

BramVanroy · February 13, 2021, 10:08pm

This would have been a lot easier if you had simply used BertForSequenceClassification because then you could have easily used this with save_pretrained inside the trainer. Here is one example but there are more on Github in the examples folder.

Abe · February 14, 2021, 1:09am

I tried this example but I don’t think this works with multiple labels since I get this:

1D target tensor expected, multi-target not supported

from this block of code

model.train()
for _,data in enumerate(training_loader, 0):
    ids = data['ids'].to(device, dtype = torch.long)
    mask = data['mask'].to(device, dtype = torch.long)
    token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
    targets = data['targets'].to(device, dtype = torch.float)

    outputs = model(ids, mask, token_type_ids)

    optimizer.zero_grad()
    #loss = loss_fn(outputs, targets)
    **loss = F.cross_entropy(outputs.logits, targets)**
...

Topic		Replies	Views
Looking for tool class to do predictions 🤗Transformers	3	551	October 9, 2020
How can I use the models provided in huggingface.co/models? Beginners	3	1562	April 9, 2021
Huggingface classification struggling with prediction 🤗Transformers	0	833	April 5, 2022
Predictions for sequenceclassification task Beginners	2	1256	October 9, 2020
Different outputs when using pipeline Intermediate	2	1232	July 20, 2023

I have trained my classifier, now how do I do predictions?

Related topics