Looking for ways to extract custom tokens from text

Hello community,

I am working on a project that requires extraction of a specific value from a text. Here is an example:
“This job offers a salary of $60000 and additional benefits like equity, health insurance and a private apartment”.
I want to be able to train a model that is able to recognize that $60000 is the salary of the job, but also be able to get the additional information that is related to the benefits like the equity and health insurance.

I have already solved this with a large corpus of regular expressions and manual text extraction, but as you are aware, there is always this one example that breaks the system. Therefore, I am hoping that I can use something to train my model model to recognize these “tokens”.

So in my internal language, the “$60000” is a token of the type “salary_value” and “equity”, “health insurance” and “private apartment” are tokens of the type “benefits”. There are a couple of other token types, but for the example let’s stay with this.

I have a lot of training data where these are annotated, so the text area that hast the token and what token is expected.

Can I use any of the hugging face libraries to build something similar? I have looked at the existing models, but they focus a lot on NER like “location”, “name”, “company”, etc.

I guess a good summary is that I am looking for some guidance on what to use best here.

Thanks!

Alex

1 Like

it really looks like something called knowledge graph extraction, i remembered that I’ve seen something similar there, LSTM and convnet maybe transformers perform well there:

you should really search about this
another links:

Hi @m1active520 ,

Wanted to ask: did you end up figuring out how to train such a model? I’m asking because I’m looking to do something similar (but simpler). More specifically, I have a dataset that consists of two columns: “description” and “store_number”, and I want my model to be able to extract the store_number from any description it is given. For instance:

descriptions:
[“FIVE GUYS 2565 DIST 468-981-3409 AL”,
“McDonald’s K6148”]

store_numbers:
[2565,
K6148]

and so on. I’m having a hard time figuring out what model or type of model I should train in order to accomplish this. Initially, I also looked at Named Entity-Recognition and token classification, but that doesn’t seem to be the correct approach, since 1) BERT’s NER is limited to things like person, organization, etc, and 2) even if we could identify numbers with such an NER, we aren’t interested in finding the category/type of an entity, but rather correctly identifying the entity itself.

Thanks!