Custom Entity Extraction from text

Hi! I’m trying to build a learning-based custom entity extraction model that is capable of extracting a specific value from a short piece of text. In other words, I have a dataset that consists of two columns: “description” and “store_number”, and I want my model to be able to extract the store_number from any description it is given. For instance:

descriptions:
[“FIVE GUYS 2565 DIST 468-981-3409 AL”,
“McDonald’s K6148”]

store_numbers:
[2565,
K6148]

and so on. I’m having a hard time figuring out what model or type of model I should train in order to accomplish this. Initially, I looked at Named Entity-Recognition and token classification, but that doesn’t seem to be the correct approach, since 1) BERT’s NER is limited to things like person, organization, etc, and 2) even if we could identify numbers with such an NER, we aren’t interested in finding the category/type of an entity, but rather correctly identifying the entity itself.

Any help on this would be appreciated–thanks!

Did you figure this out?

You need to define a custom NER tagging class. Here is a general outline that you need to do:

  1. Tag all the training data with store numbers with some tag. For example:

Text : “FIVE GUYS 2565 DIST 468-981-3409 AL”
Tags: O O B-STORE O O O O O

Here B-STORE represents the tag for store number and while O is ignored for rest.

  1. Tokenize and create a dataset of input_ids, attention_mask and labels. Use some kind of encoder architecture like BERT, XLMRoberta etc. to compute the embedding of input and feed it to a fully connected network. You might find seq_eval library helpful in training here.

  2. Run for few epochs and you would have your custom NER ready.