I have trained my classifier, now how do I do predictions?

Hi everybody and thank you in advance for anyone who can help my out. I am not a total beginner when it comes to huggingface libraries (I have already built a well functioning sentiment analyzer) however I have mostly taken tutorials and integrated their content without going too much into details of who each line of code does. Trying to learn more I have put together a document classifier using a couple of tutorials I’ve found online.

I have built the trainer and the validator and they work just fine. I started with a dataset that assigns 6 different labels to a text, with each text having 0, 1 or more than 1 label. I trained the model and saved it. My problem is: now what? I can’t understand exactly how to do the prediction part. Here is where I am:

def validation():

model = torch.load(destination_folder+'model.pt')
model.eval()

with torch.no_grad():
    for _, data in enumerate(testing_loader, 0):

        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        preds = model(ids, mask,token_type_ids)
        print(preds.argmax(1) + 1)

This is a snippet of the output of the print command:

tensor([1, 1, 1, 1])
tensor([6, 1, 1, 1])
tensor([1, 1, 1, 1])
tensor([1, 5, 2, 1])

I’ve done this using the validation data and by adapting the validation routine, while in reality I would need to do this for a single line of text, but regardless of the way the data is fed to the prediction function, how do I read the prediction data? How do I go from “This is the text of my document to be classified” to “This document is 75% label1, 15% label5, 2% label6”?

Again, thank you in advance for any help!

Hi @Abe, if I understand correctly you’d like to go from an input string like “I love this movie!” to a set of predicted labels and their confidence scores (i.e. probabilities).

The simplest way to achieve that would be to wrap your model and tokenizer in a TextClassificationPipeline with return_all_scores=True:

from transformers import TextClassificationPipeline

model = ...
tokenizer = ...
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
# outputs a list of dicts like [[{'label': 'NEGATIVE', 'score': 0.0001223755971295759},  {'label': 'POSITIVE', 'score': 0.9998776316642761}]]
pipe("I love this movie!")

The above also works for multiple inputs by feed a list of examples instead of a single string:

pipe(["I love this movie!", "I hate this movie!"])

If you want to have human-readable labels like “positive” and “negative” you can configure the id2label and label2id attributes of your model’s config class: Change label names on inference API - #3 by lewtun

HTH!

1 Like

Thank you for the incredibly quick reply!

I now get

torch.nn.modules.module.ModuleAttributeError: 'BERTClass' object has no attribute 'config'

Which makes me think the class used for training is missing something that pipe needs. This is the class I used

class BERTClass(torch.nn.Module):
def __init__(self):
    super(BERTClass, self).__init__()
    self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
    self.l2 = torch.nn.Dropout(0.3)
    self.l3 = torch.nn.Linear(768, 6)

def forward(self, ids, mask, token_type_ids):
    _, output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
    output_2 = self.l2(output_1)
    output = self.l3(output_2)
    return output

It always has no config attribute/method. Am I right in thinking I have to retrain using a different class?

Ah yes, if possible I think you’d be better off using the BertForSequenceClassification class together with BertConfig instead of the custom class you created.

For example, you might be able to make this work as follows:

config = ...
model = BertForSequenceClassification.from_pretrained(destination_folder+'model.pt', config=config)

and then passing model and the tokenizer to the pipeline as before. If not, then you’ll probably have to re-train the model or live with the default labels from the pipeline

Using 3.5.1
I am also trying to use the text classification pipeline. I trained my model using trainer and saved it to “path to saved model”. My issue is that when I try to use the pipeline to predict, the call to the tokenizer is not truncating the result to the “model_max_length” set in the configuration of my trained model/tokenizer. I initialize as below:

tokenizer = RobertaTokenizer.from_pretrained(“path to saved model”)
model = RobertaForSequenceClassification.from_pretrained(“path to saved model”)
classifier = pipeline(‘sentiment-analysis’, model=model,tokenizer=tokenizer)

Do I need to create my own pipeline subclassing the text classification one in order to force truncation?

Thanks

Hi @laurb, I think you can specify the truncation length by passing max_length as part of generate_kwargs (e.g. 50 tokens in my example):

classifier = pipeline(‘sentiment-analysis’, model=model, tokenizer=tokenizer, generate_kwargs={"max_length":50})

As far as I know the Pipeline class (from which all other pipelines inherit) does not truncate the inputs by default: transformers/base.py at master · huggingface/transformers · GitHub

This would have been a lot easier if you had simply used BertForSequenceClassification because then you could have easily used this with save_pretrained inside the trainer. Here is one example but there are more on Github in the examples folder.

1 Like

I tried this example but I don’t think this works with multiple labels since I get this:

1D target tensor expected, multi-target not supported

from this block of code

model.train()
for _,data in enumerate(training_loader, 0):
    ids = data['ids'].to(device, dtype = torch.long)
    mask = data['mask'].to(device, dtype = torch.long)
    token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
    targets = data['targets'].to(device, dtype = torch.float)

    outputs = model(ids, mask, token_type_ids)

    optimizer.zero_grad()
    #loss = loss_fn(outputs, targets)
    **loss = F.cross_entropy(outputs.logits, targets)**
...