T5: classification using text2text?

My friend who wishes to remain anonymous asked a good question about T5 that I couldn’t answer:

Say we have a model that predicts sentiment – answers are “positive/negative/neutral” – for something like RoBERTa we’d add a layer, slap on a softmax – and we get both argmax predictions, and some notion of probability amongst the three classes (as well as entropy).

For T5, we just get a text reply.

But of course if we looked at the outputs to the softmax in T5, we’d see p(“positive”) etc – assuming the response is 1 token.

Has anyone tried to do this already or seen examples/notebooks like this?

Ideally, we want to ask our model several questions. Without worrying too much about the conditional logic, we’d like to be able to measure the probability of text outputs, including some rare categories (that nonetheless are present in our training set). As well as to look for low and high entropy predictions.

If nobody has done this, any code pointers where to look would be helpful.


I’m not sure I understand what is the difference between what you are describing and how the GLUE dataset is handled in the T5 paper

I’m also not sure what the question means here, are you trying to ask if someone has used T5 for classification ?Then yes, I’ve fine-tuned it for both binary and multi-class classification here.

As for measuring the probability, this paper used T5 in really interesting way for document ranking. First they train the model to predict true if doc is related to the query and false if not. And for ranking they apply a softmax only on the logits of the “true” and “false” token and rerank using the probabilities assigned to the “true” token.



To answer first part of your question, Yes, I have tried T5 for multi class classification. It generates the tokens based on the class types(It could be a single token or multiple tokens based on the tokenization of the class label)

Second part of the question is not clear to me. Please explain more.

Got a more specific version of the question:

How should I fix this code to use t5 without finetuning for sentiment:

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_sent = [' sentiment review: I love chocolate', ' sentiment review: I love chocolate']
labels = ['positive ', 'negative ']
input_ids = tokenizer(input_sent, return_tensors='pt').to(torch_device).input_ids
target_ids = tokenizer(labels, return_tensors='pt').to(torch_device).input_ids
#target_ids = torch.tensor([target_ids])
outputs = model(input_ids=input_ids, labels=target_ids, use_cache=False, return_dict=True)

I got the sentiment review prefix from https://github.com/google-research/text-to-text-transfer-transformer/issues/109 @valhalla

T5 was trained on sst2 as part of it’s multi-task pre-training mixture, so to use T5 for sentiment without fine-tuning use the prefix sst2 sentence: and pass it to the model. You can do it two ways

from transformers import T5ForConditionalGeneration, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "sst2 sentence: it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight"

enc = tokenizer(text, return_tensors="pt")
tokens = model.generate(**enc)
=> ['positive']

or use the text2text-generation pipeline

t5_sentiment = pipeline("text2text-generation")
text = "sst2 sentence: it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight"
=> [{'generated_text': 'positive'}]

That’s cool suraj, thanks! Is it possible to do that with forward?

forward as in without using generate?

As T5 is trained using text-2-text approach we need to generate the output as text either manually calling forward or using generate. If we wish to do this as discriminative task we could take the same approach as BART where we feed the same text to both encoder and decoder, pool the hidden states of the final eos token and pass that to a classification head, this is how BartForSequecneClassfication works. Not sure how this will work for T5, haven’t tried myself.

To answer the original question, you could use forward as shown below to generate the output

import torch
import torch.nn.functional as F
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "sst2 sentence: it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight"
with torch.no_grad():
  enc = tokenizer(text, return_tensors="pt")
  decoder_input_ids = torch.tensor([tokenizer.pad_token_id]).unsqueeze(0) 
  logits = model(**enc, decoder_input_ids=decoder_input_ids)[0]
  tokens = torch.argmax(logits, dim=2)
  sentiments = tokenizer.batch_decode(tokens)
  # 'positve'

Now if we wish to measure the probabilities, as I described in the earlier comment, we could only take the logits of positive and negative token and apply softmax on it. Thankfully T5 encodes positive and negative as single tokens so it’s easy to do. The token id for positive is 1465 and for negative 2841.

logits = logits.squeeze(1)
# only take the logits of positive and negative
selected_logits = logits[:, [1465, 2841]] 

probs = F.softmax(selected_logits, dim=1)
#=> tensor([[0.9820, 0.0180]])

Hope this answers your question.

cc @sshleifer


Incredible answer, thanks a ton!

1 Like

Sorry for the topic steal, wasn’t getting a lot of attention on my topic on T5 ( Yet another question about T5 prefixes: are they special? - Models - Hugging Face Forums).

Has anyone here used T5 for regression? From the paper that @valhalla links, it seems that you could rebase your continuous labels to 0-1 and then use the output of the softmax for one of two options (e.g. true and false) in a MSELoss function. Or is that a nonsense suggestion? The caveat is that it is not always possible to rebase your values to 0-1.

Thanks ! This helps a great Deal . I wanted to know how to make use of this if the token_id of the word is segregated into 2 or more parts . like for entailment , the token_id in t5 model is [35,5756,297] .

How can i fit this in selected_logits = logits[:, [1465, 2841]] .
I used using array of array’s . Am I missing any tweak?

@dipanjann I’ve wondered the same thing. My solution when the output text corresponds to multiple tokens is the following. I’m not sure it’s completely correct, but the basic idea is to use generate() to get the prediction logits for each step in the generated output sequence. Next, you roughly calculate the cross entropy loss for each of the possible output classes.

I’m really not sure if this is exactly mathematically correct. I’d love if someone can show a better way! In my tests, it produces reasonable results.

import typing as T

import pytorch_lightning
import torch
from more_itertools import chunked

class MyLitModule(pytorch_lightning.LightningModule):

    def predict_proba(self, text: T.Iterable[str], labels: T.Iterable[str]):
        """Predict the class probabilities"""
        # Compute the tokens corresponding to the text labels:
        class_ids = torch.LongTensor(self.tokenizer(list(labels), padding=True).input_ids)

        logits = []
        for chunk in chunked(text, 16):
            # Tokenize the input text:
            encoding = self.tokenizer(

            output_sequences = self.model.generate(
                min_length=class_ids.shape[1] + 1, max_length=class_ids.shape[1] + 1

            # Generate the logits for each token in the generated output sequence.
            # `scores` has size [batch, seq_length, vocab_size]
            scores = torch.stack(output_sequences.scores, dim=1).to("cpu")
            # We don't care about the logits of special tokens:
            scores[:, :, self.tokenizer.all_special_ids] = torch.nan
            # Index the logits in `scores` based on the class token IDs.
            # For example, if class_ids[0, :] is [10, 30], then the prediction logits
            # are scores[:, 0, 10] and scores[:, 1, 30].
            # Finally, we average the logits, which is similar to how the cross entropy loss is calculated.
            logits.append(scores.gather(dim=2, index=class_ids.T.expand(len(chunk), -1, -1)))

        return torch.concat(logits, dim=0).nanmean(dim=1).softmax(1)