How to best deal with numbers?

bone · October 10, 2020, 10:27pm

Lets say I have mix of words and numbers,which represent for example prices or sizes of objects. The relation/magnitude between the numbers is very important. With normal tokenization it would split those large/less common numbers into smaller pieces (few digits) and concatenate them. Is this not very inefficient and suboptimal, since the transformer has to learn how digits/numbers work and would assign every digit/combination its own learned vector. Would it make sense to manually convert these numbers and pass in its own vector?

Example:
DATA (Laying carpet 150sqft | price 400$)-> LABEL (okay)
DATA (Carpet type 133 installation with an area of 70m2 | price 8000$)-> LABEL ( Not okay)

PereLluis13 · October 12, 2020, 11:00am

If your task is closely related to those numbers, and all examples follow the same pattern, perhaps you should just use the price and surface as raw features in your model, along with the text. You could concatenate the price and surface values to the Transformer embedding for the text at the last hidden-layer before classification. I am not sure if that is what you meant by

Maybe also normalize the surface values to one standard.

I don’t know if there are better ways to deal with values at tokenizing, but this would be my suggestion.

rgwatwormhill · October 12, 2020, 3:37pm

I agree with PereLluis13.

Were you thinking of using a pre-trained transformer, or training from scratch?

Are you sure you need a transformer at all? What are you hoping it will do with the text? Have you tried regular expressions and a rule-based system?

Topic		Replies	Views
Concatenate non string features to a BERT transformers model Beginners	5	2802	March 27, 2022
Which transfomer for numeric dataset Beginners	0	284	June 4, 2023
Transformer for numeric dataset 🤗Transformers	0	644	May 20, 2023
Predicting a value for each token on a scale 1-10 Beginners	0	181	September 6, 2023
Using Tokenizer for integer data 🤗Tokenizers	0	531	January 3, 2023

How to best deal with numbers?

Related topics