NER: Treat whole sequence as one entity

Hey there,

I am facing a problem where I have strings which as a whole either represent a named entity or not. But I really struggle to find a model which does so or to come up with a way to force the usual NER models to not classify the substrings but the whole sequence.

For some context, the strings can be any language and I am mostly interested in deciding between person/corporation/group and non-NE. Being able to narrow the classes down to those would be a big plus. Any simple solutions? Until now I guess I have to built and train the NER top layer myself but I might be missing something obvious.

Thanks a lot in advance!
Johannes

Hello :hugs:
Can you give an example input output for the mentioned model?

Hello Merve, please excuse my late reply, I am right in the process of moving :upside_down_face:

Here are two exanple input → output paits:

  1. “Alexander von Schönburg” → “person”
  2. “b. braun melsungen” → “company”
  3. “What a beautiful day” → ------------

So it is known that the whole sequence represents a named entity (or not) and I would like to have a model which makes use of that information – otherwise there is a chance that the model classifies the sub strings (1. “Alexander” → “person”, “Schönburg” → “location”; 2. “b. braun” → “person”, “melsungen” → “location”). It’s basically just a simple sequence classification task instead of classic token level NER.

Hope that makes it a bit clearer?

Hello :wave:

And your sequences will definitely be all entities? e.g. what if the below happens
Alexander von Schönburg is working at b. braun melsungen.
How should it be classified?

The expected input is definitely just one entity. So ideally your example (and my example #3) would be classified as non-NE or some other class which makes it clear to me that I should manually check those samples since they are probably “dirt” in the data.

I feel like you can just model this as a text classification task but I don’t think it will be successful, given my intuition is that NER models are a bit context-based or using look-ups. Can’t you just take a NER model like this one and get rid of spans and just return the outcome?

yeah, that was my initial thought too, but I then I gave tner (xlm-roberta) a shot for a fast and simple baseline and was baffled by how well it worked without any context. So I thought the obvious step to increase the performance was to make the task easier by using the knowledge about the input. Otherwise I need to figure out a way to handle the cases where multiple Entities are detected.

Can you elaborate on the last part of your last sentence? By spans you mean the information on where the entites begin and end? Yeah, that’s pretty much what I do right now and which works fine in many cases – but my problem is that there often are multiple entities identified in one sequence and I have to have just a single label.

Thank you for your effort and patience! :hugs:

1 Like

One year later … but for anyone else with a similar use-case.

In my use-case the expected entity was only PER or ORG; I also found that NER worked quite well without any context. There were some cases were it didn’t recognise that the whole sequence did indeed correspond to the one entity so I was also hoping to find a way to specify entity boundaries, or something similar.

In the meantime, I have found that prefixing and suffixing the text does help in some instances where a multiple entity was otherwise detected.

e.g.

  1. The entity’s name is Alexander von Schonburg.
  2. The entity’s name is b. braun melsungen.

Well, my case is very similar to this case: it is exactly the “Treat whole sequence as one entity” but for LOCATIONS.

I want to get something like this:
"Athens, United States" -> "LOC"
But all the NER models I checked doing only this:
"Athens, United States" -> "LOC" "LOC"

Here some more details

So, what solution you found for this?

P.S.
@merve @EM-L-D @AJHoeh - what do you think?

If you want “Athens, United States” to be labeled as one entity, I’d just label them as follows:

Athens B-LOC
, i-LOC
United I-LOC
States I-LOC

whereas if you want the model to treat them as 2 separate entities, then labeling would be done as follows:

Athens B-LOC
, 0
United B-LOC
States I-LOC

Here, “B-LOC” means “begining of a location entity”. “O” stands for “outside” (basically, not an entity), and “I-LOC” stands for “inside-an-entity”.

This follows the classic IOB labeling scheme often used for NER.

If you want “Athens, United States” to be labeled as one entity, I’d just label them as follows …

Well, I think everybody needs correct recognition of such locations, so some pre-trained model recognizing “Athens, United States” exists somewhere… what is the name of this model ?