for an example, we have a datasource with three columns
column_a: text data which describes one feature
column_b: text data which describes another feature
column_c : category/label
If i have to approach this kind of text classification problem with BERT, how can we pass column_a and column_b as inputs to bert model? is there a way to concatenate two sentences using separator token or is there a way to achieve this using encoder_plus method?
Not an expert (and far from being one) but I’m interested in what you have tried so far.
If I wouldn’t get good results using simple methods (let’s say by just concatenating the two columns) I’d try having an ensemble of two BERT models where one receives column_a and the other receives column_b.
Let’s say one column is an expert opinion about the medical condition of a patient, and the other is the patient’s opinion on his medical condition.
Then to me it makes sense to have two models one fine-tuned to the expert opinion and the other fine-tuned to the patient’s opinions.
BERT supports sentence pair classification out-of-the-box. There’s no need to ensemble two BERT models.
In BERT, 2 sentences are provided as follows to the model:
[CLS] sentence1 [SEP] sentence2 [SEP] [PAD] [PAD] [PAD] …
You can prepare them using BertTokenizer, simply by providing two sentences:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence_a = "this is a sentence"
sentence_b = "this is another sentence"
encoding = tokenizer(sentence_a, sentence_b, padding="max_length", truncation=True)
This encoding is a Python dictionary, which includes keys including input_ids. If you decode them back as follows:
tokenizer.decode(encoding["input_ids"])
Then this will print [CLS] this is a sentence [SEP] this is another sentence [PAD] [PAD] [PAD] …
hi @nielsr , thanks for the clarification. I have used encode_plus method to pass values from two columns and its taking two sentences by inserting [SEP] token between . AS BERT had limitation of using 512 tokens, I will try with other variants of bert, which can take longer sentences. Any suggestions on models to start with for long text?
Actually, the encode_plus method is deprecated, it’s advised to just call the tokenizer, as shown above.
BERT indeed has a token limit of 512, so if you provide two sentences they should fit into this budget. If you want to try longer sequences, you can take a look at LongFormer or BigBird.
@xap My guess is that each batch in your DataLoader is a dictionary (I might be wrong; I need to check your DataLoader creation code to confirm). Why don’t you print out one batch & check its structure. Is it a tuple of tensors or is it a dictionary of tensors?