for an example, we have a datasource with three columns
column_a: text data which describes one feature
column_b: text data which describes another feature
column_c : category/label
If i have to approach this kind of text classification problem with BERT, how can we pass column_a and column_b as inputs to bert model? is there a way to concatenate two sentences using separator token or is there a way to achieve this using encoder_plus method?
any help is appreciated!
Not an expert (and far from being one) but I’m interested in what you have tried so far.
If I wouldn’t get good results using simple methods (let’s say by just concatenating the two columns) I’d try having an ensemble of two BERT models where one receives
column_a and the other receives
Let’s say one column is an expert opinion about the medical condition of a patient, and the other is the patient’s opinion on his medical condition.
Then to me it makes sense to have two models one fine-tuned to the expert opinion and the other fine-tuned to the patient’s opinions.
Hopefully I didn’t say too much nonsense
thanks @Maimonator, any ideas on passing two text columns into encoder_plus method?
BERT supports sentence pair classification out-of-the-box. There’s no need to ensemble two BERT models.
In BERT, 2 sentences are provided as follows to the model:
[CLS] sentence1 [SEP] sentence2 [SEP] [PAD] [PAD] [PAD] …
You can prepare them using
BertTokenizer, simply by providing two sentences:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence_a = "this is a sentence"
sentence_b = "this is another sentence"
encoding = tokenizer(sentence_a, sentence_b, padding="max_length", truncation=True)
This encoding is a Python dictionary, which includes keys including input_ids. If you decode them back as follows:
Then this will print [CLS] this is a sentence [SEP] this is another sentence [PAD] [PAD] [PAD] …
hi @nielsr , thanks for the clarification. I have used encode_plus method to pass values from two columns and its taking two sentences by inserting [SEP] token between . AS BERT had limitation of using 512 tokens, I will try with other variants of bert, which can take longer sentences. Any suggestions on models to start with for long text?
encode_plus method is deprecated, it’s advised to just call the tokenizer, as shown above.
BERT indeed has a token limit of 512, so if you provide two sentences they should fit into this budget. If you want to try longer sequences, you can take a look at LongFormer or BigBird.
Do this @xap:
a1 = data["user"].tolist()
a2 = data["standard"].tolist()
Basically you can’t pass pandas
Series objects to your tokenizer. You need to convert them to lists first.
@xap My guess is that each batch in your
DataLoader is a dictionary (I might be wrong; I need to check your
DataLoader creation code to confirm). Why don’t you print out one batch & check its structure. Is it a tuple of tensors or is it a dictionary of tensors?
You can check by doing this:
for batch in train_dataloader:
If it’s a dictionary, then follow the steps outlined here: A full training - Hugging Face Course
outputs = model(**batch)
The problem with the following line is that it will pick up the keys of the dictionary rather than the values:
for batch_idx, (pair_token_ids, mask_ids, seg_ids, y) in enumerate(train_dataloader):
You can confirm this by adding some print statements inside the loop like so: