Token positions when using the Inference API

fdb · November 24, 2020, 4:21pm

Hi,

I am looking to run some NER models via the Inference API, but I am running into some issues. My problem is that the Inference API does not seem to return token positions. Consider this request:

curl -X POST https://api-inference.huggingface.co/models/dslim/bert-base-NER \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d "Hello Sarah Jessia Parker who lives in New York."

It returns:

[
  {
    "entity_group": "PER",
    "score": 0.9959956109523773,
    "word": "Sarah Jessia Parker"
  },
  {
    "entity_group": "LOC",
    "score": 0.9994343519210815,
    "word": "New York"
  }
]

So it finds the right tokens (and, nicely, returns the identified tokens grouped correctly.) However, there is no indication of where the tokens start in the input text.

Confusingly, the model page (https://huggingface.co/dslim/bert-base-NER?text=Hello+Sarah+Jessia+Parker+who+lives+in+New+York.) highlights the right tokens, which would suggest you can get token positions. (unless it’s doing something terribly hacky like looking for the first occurence of a particular token).

Is there an option I’m missing?

fdb · November 24, 2020, 4:26pm

Indeed, the page seems to just be highlighting the first occurence of a token. Note how it (probably) picks up the wrong “Jessica” in this example: https://huggingface.co/dslim/bert-base-NER?text=Hello+Sarah+Jessica+Parker+who+Jessica+lives+in+New+York.

Is there a way to extract the token position from the Inference API?

Narsil · November 25, 2020, 4:34pm

Hi @fdb,

Thanks for the note. You are perfectly right in your assumptions.
In earlier versions of tokenizers/transformers the offsets of tokens were not necessarily available (because the encode operation was destructive). Because offsets were not available, the API did not send this information.

However thanks to tokenizers @anthony and in the near future 4.0 version of transformers, the encode operation won’t be destructive anymore, so offsets will be available.

There is a PR for transformers for this pipeline https://github.com/huggingface/transformers/pull/8781
And the API inference is already running the fix:

curl -X POST -d 'Hello Sarah Jessica Parker who Jessica lives in New York' https://api-inference.huggingface.co/models/dslim/bert-base-NER

[
{"entity_group":"PER","score":0.9960219860076904,"word":"Sarah Jessica Parker","start":6,"end":26},
{"entity_group":"PER","score":0.9771094918251038,"word":"Jessica","start":31,"end":38},
{"entity_group":"LOC","score":0.9994266927242279,"word":"New York","start":48,"end":56}
]

https://huggingface.co/dslim/bert-base-NER?text=Hello+Sarah+Jessica+Parker+who+Jessica+lives+in+New+York. should soon follow.

fdb · November 25, 2020, 5:50pm

Hi @Narsil,

Indeed, I now see start/end positions in my responses from the inference API. Thanks for adding those fields!

Narsil · November 25, 2020, 5:53pm

The hard work was not done by me

Topic		Replies	Views
Getting entity offset from ONNX outputs Intermediate	1	584	April 28, 2022
Bert NER model start and end position None after fine-tuning Beginners	0	386	June 1, 2023
How to get string offsets from custom NER pipeline? 🤗Transformers	0	653	November 23, 2021
SQuAD with BERT tokenizer: Mismatch between span and token boundaries Models	0	505	November 12, 2021
Text Classification tokenizer problems on inference Intermediate	4	2267	October 12, 2022

Token positions when using the Inference API

Related topics