Use two sentences as inputs for sentence classification

for an example, we have a datasource with three columns
column_a: text data which describes one feature
column_b: text data which describes another feature
column_c : category/label

If i have to approach this kind of text classification problem with BERT, how can we pass column_a and column_b as inputs to bert model? is there a way to concatenate two sentences using separator token or is there a way to achieve this using encoder_plus method?

any help is appreciated!

Not an expert (and far from being one) but I’m interested in what you have tried so far.

If I wouldn’t get good results using simple methods (let’s say by just concatenating the two columns) I’d try having an ensemble of two BERT models where one receives column_a and the other receives column_b.
Let’s say one column is an expert opinion about the medical condition of a patient, and the other is the patient’s opinion on his medical condition.
Then to me it makes sense to have two models one fine-tuned to the expert opinion and the other fine-tuned to the patient’s opinions.

Hopefully I didn’t say too much nonsense :blush:

2 Likes

thanks @Maimonator, any ideas on passing two text columns into encoder_plus method?

Hi @saireddy,

BERT supports sentence pair classification out-of-the-box. There’s no need to ensemble two BERT models.

In BERT, 2 sentences are provided as follows to the model:
[CLS] sentence1 [SEP] sentence2 [SEP] [PAD] [PAD] [PAD] …

You can prepare them using BertTokenizer, simply by providing two sentences:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

sentence_a = "this is a sentence"
sentence_b = "this is another sentence"

encoding = tokenizer(sentence_a, sentence_b, padding="max_length", truncation=True)

This encoding is a Python dictionary, which includes keys including input_ids. If you decode them back as follows:

tokenizer.decode(encoding["input_ids"])

Then this will print [CLS] this is a sentence [SEP] this is another sentence [PAD] [PAD] [PAD] …

7 Likes

hi @nielsr , thanks for the clarification. I have used encode_plus method to pass values from two columns and its taking two sentences by inserting [SEP] token between . AS BERT had limitation of using 512 tokens, I will try with other variants of bert, which can take longer sentences. Any suggestions on models to start with for long text?

Actually, the encode_plus method is deprecated, it’s advised to just call the tokenizer, as shown above.

BERT indeed has a token limit of 512, so if you provide two sentences they should fit into this budget. If you want to try longer sequences, you can take a look at LongFormer or BigBird.

2 Likes

Do this @xap:

a1 = data["user"].tolist()
a2 = data["standard"].tolist()

Basically you can’t pass pandas Series objects to your tokenizer. You need to convert them to lists first.

1 Like

@xap My guess is that each batch in your DataLoader is a dictionary (I might be wrong; I need to check your DataLoader creation code to confirm). Why don’t you print out one batch & check its structure. Is it a tuple of tensors or is it a dictionary of tensors?

You can check by doing this:

for batch in train_dataloader:
    break

print(batch)

If it’s a dictionary, then follow the steps outlined here: A full training - Hugging Face Course

In particular:

outputs = model(**batch)

The problem with the following line is that it will pick up the keys of the dictionary rather than the values:

for batch_idx, (pair_token_ids, mask_ids, seg_ids, y) in enumerate(train_dataloader):

You can confirm this by adding some print statements inside the loop like so:

print(pair_token_ids)
print(mask_ids)
etc...