NLP model for tag generation

Hi all,

I’ve recently started into the NLP world, so this might seem like an obvious question but are there models to generate tags from text? E.g. lets say you have a corpus of text describing houses and you would like to retrieve the house type, size and color. The goal would be to get an output like this for every house description {house_type: ‘apartment’, size: ‘1000sqft’, color: ‘grey’}

It seems that question answering model are able to understand that information (or at least answer to one of the questions at a time), but I’m not sure if there’s a way to adapt these for the above task.

It seems to me that it could be pretty useful to train other models (e.g. vision) in a self-supervised way (e.g. get the house type size and color from a house photo using the output of the model above as target variable)


Hi Eliot,

If this information is explicitly available in the text then you can try to achieve this with a QA model. Asking questions like “What is the house type ?”, “What’s the color of the house ?” etc.

There are other methods for such type of semantic parsing tasks, but one way you can approach this using is using text2text approach with T5 (it’s seq-to-seq model where you can feed in some text and ask the model to output some text). i.e given your text you can train T5 to output a structured text, something like
house_type: apartment <sep> color: grey <sep> house_size: 1000

This might be a overkill but I tried this as a experiment in my work and so far it’s doing really well.

One other approach would be to frame this as an entity extraction task, you entities will be house_type, color and size. Something like spacy could really help. If you are new to Entity extraction see this demo to get an idea

1 Like

Hey @valhalla thanks for your reply!

I’m affraid the QA model will end up being inefficient and I didn’t know that the text2text approach might work for such a problem.

But entity extraction is definitely a great idea. I checked a demo and it seems like it could do the job. Will be my first option.

Thanks again!

Hi @elliotben ,

Wanted to ask: did you end up figuring out how to train such a model? I’m asking because I’m looking to do something similar (but simpler). More specifically, I have a dataset that consists of two columns: “description” and “store_number”, and I want my model to be able to extract the store_number from any description it is given. For instance:

[“FIVE GUYS 2565 DIST 468-981-3409 AL”,
“McDonald’s K6148”]


and so on. I’m having a hard time figuring out what model or type of model I should train in order to accomplish this. Initially, I looked at Named Entity-Recognition and token classification, but that doesn’t seem to be the correct approach, since 1) BERT’s NER is limited to things like person, organization, etc, and 2) even if we could identify numbers with such an NER, we aren’t interested in finding the category/type of an entity, but rather correctly identifying the entity itself.