RoBERTa for Sentence-pair classification

srishti-hf1110 · September 28, 2022, 8:02am

Hi,
I’m new to using HF and my current task involve sentence pair classification - the input is a pair of sentences and the output shall be binary 0 or 1.

I referred to the documentation, and tried some code out.
I know from theory and also figured out in code that some models like bert-base-uncased are able to use a pair of inputs inasmuch as they have this layer to assign token_type_ids to the sentences to be able to differentiate sentence 1 from sentence 2, like so -

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
bert_model = 'bert-base-uncased'
bert_layer = AutoModel.from_pretrained(bert_model)
tokenizer = AutoTokenizer.from_pretrained(bert_model) 
sent1 = 'how are you'
sent2 = 'all good'

encoded_pair = tokenizer(sent1, sent2, 
                                      padding='max_length',  # Pad to max_length
                                      truncation=True,  # Truncate to max_length
                                      max_length=50,  
                                      return_tensors='pt')
print(encoded_pair)

gives this:

{'input_ids': tensor([[ 101, 2129, 2024, 2017,  102, 2035, 2204,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}

While this is fine, there are other models that do not output the attribute token_type_ids like ‘roberta-base’ and so on.
Does this mean these models could not be used for sentence pair classification?

I eventually need to use climateBERT for my task, but it’s adaptively tuned using distillRoBERTa, so I’m asking this in the context of all such models that do not use token_type_ids.

I have only studied BERT’s paper, so not sure if models like roBERTa are meant to be used for sentence pair classification tasks or not. Please help me with answering this.

Thanks in advance for any help.

donggy · June 15, 2023, 7:51am

I also want to ask this question.

kinsvater · April 23, 2024, 1:56pm

Hello all,

Bringing this thread back to life.

I came across the same problem. I need to classify sentence pairs, which I can do successfully with the standard BERT architectures having the token_type_ids as part of the tokenized inputs.

Since RoBERTa=AutoModelForSequenceClassification.from_pretrained(‘roberta-base’) does not accept token_type_ids, I wonder:

a) Those token_type_ids are redundant for architectures like RoBERTa when dealing with sentence pairs

or

b) We should only use a subset of BERT-like architectures for sentence pairs, which utilize token_type_ids.

Any thoughts on this?

Topic		Replies	Views
Sentence Pair Classification Intermediate	1	1992	May 4, 2022
Two sentences classification detail questions 🤗Transformers	0	390	June 2, 2022
Encoding sentence pair with BERT cause ValueError: not enough values to unpack (expected 2, got 1) Beginners	1	6746	November 13, 2022
Chapter.6 - Why are the tokens and word_ids for 2nd sentence are not returned? Course	0	445	January 3, 2023
How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer() Beginners	4	1924	November 4, 2021

RoBERTa for Sentence-pair classification

Related topics