Adding categorical and numerical values for bert training

saireddy · March 1, 2022, 11:04pm

Hello, if we have a dataset with text, numerical and categorical values to be used for text classification. What options we have to use these additional (numerical and categorical) columns for classification? Here are the options which I can think of

option 1: Combine categorical values with text using [SEP]
option 2 : concatenate numerical/categorical data to the CLS embedding and pass it to linear layer

Any help on this is or tutorials is greatly appreciated.

Thanks

kishore · March 6, 2022, 1:54am

great question and I have been looking for answers on the same for my academic project. I hope we will get some help here

saireddy · April 18, 2023, 12:41am

@lewtun @sgugger can you please share some suggestions/insights into this topic

nlgunther · March 28, 2024, 12:17am

An interesting toy project of mine is finding paragraph breaks in lines of text broken from a well-formatted original based on string length, like text extracted from PDFs. An approach based on domain knowledge (human grammar and punctuation) would be that a paragraph break generally only occurs after a sentence break, meaning a “.” or “!”. A paragraph break is especially likely if the final punctuation is followed by significant whitespace before the line end. This the sort of pattern that can be captured nicely by a regular expression such as: ‘(.|!)/s*$’.
Matching this regexp against a string returns a true or false boolean value, aka 0,1. A combination of (a) that matching test, expressed as a string function with a boolean return value, with (b) a pre-trained BERT model, should be more accurate than either separately.
How best to accomplish this, given the rather rigid structure of BERT-type LLMs? A good solution can point the way to combining narrow human rule-based heuristics with the complexity of LLMs, which despite their rich context-based learning can have difficulty learning narrow rules like a regular expression match. Such hybrid models may excel at extracting rules and formulas from natural language text, for example, legal documents.

Topic		Replies	Views
Concatenate non string features to a BERT transformers model Beginners	5	2865	March 27, 2022
Use two sentences as inputs for sentence classification 🤗Transformers	7	20395	April 21, 2022
Adding categorical and numerical data to Bert model 🤗Transformers	0	1011	February 20, 2024
Combine BertForSequenceClassificaion with Additional Features 🤗Transformers	3	9534	March 23, 2022
BERT for token & sentence classification Beginners	6	3599	July 20, 2022

Adding categorical and numerical values for bert training

Related topics