Lets say I have mix of words and numbers,which represent for example prices or sizes of objects. The relation/magnitude between the numbers is very important. With normal tokenization it would split those large/less common numbers into smaller pieces (few digits) and concatenate them. Is this not very inefficient and suboptimal, since the transformer has to learn how digits/numbers work and would assign every digit/combination its own learned vector. Would it make sense to manually convert these numbers and pass in its own vector?
DATA (Laying carpet 150sqft | price 400$)-> LABEL (okay)
DATA (Carpet type 133 installation with an area of 70m2 | price 8000$)-> LABEL ( Not okay)
If your task is closely related to those numbers, and all examples follow the same pattern, perhaps you should just use the price and surface as raw features in your model, along with the text. You could concatenate the price and surface values to the Transformer embedding for the text at the last hidden-layer before classification. I am not sure if that is what you meant by
Maybe also normalize the surface values to one standard.
I don’t know if there are better ways to deal with values at tokenizing, but this would be my suggestion.
I agree with PereLluis13.
Were you thinking of using a pre-trained transformer, or training from scratch?
Are you sure you need a transformer at all? What are you hoping it will do with the text? Have you tried regular expressions and a rule-based system?