I would like to use multiple texts as inputs to a model, let’s say I have a dataset with 10 columns each column is a text (sentence or two), how can I fit all these inputs to the model and do a classification for example ?
I can see it’s possible to just concatenate all texts in one, but seems that for me, I need a very large data to be apple to achieve good accuracy.
Maybe using multiple models (BERT) in parallel, taking last hidden state, concatenate them and classify ? But the problem is that there’s so many values order of 30 texts.
Any idea how to tackle this ?
You should take the same approach as Extractive text summarization :
Concatenate all your sentences, separated with a special token (
CLS for example), then use the
CLS token representation to do classification.
From the Presumm paper
Hi @colanim, thank you for your reply.
I understand what you suggested, the problem is that I don’t have only texts as inputs I have also some floats values, is converting this values to text would be sufficient ?
I never encountered this case myself, but maybe you can directly input the float values in the last classifier ?
Since it’s not text, there is no need for BERT to encode it (?)
I don’t know whether you’ve tried / considered the multimodal toolkit (blog post, github)- takes in tabular data (text, numbers, categorical data) and can use them as inputs to develop models.
Haven’t tried it myself, but looks quite promising.