Getting spans from tokenizer

ldavid · October 11, 2023, 9:07pm

I’m fine-tuning RoBERTa for token classification on a custom NER dataset, which is formatted using NER annotator for SpaCy. It assigns entity types based on character spans rather than token indices.

I figured it wouldn’t be difficult to go from one to the other, but I couldn’t find anything related to span when using RobertaTokenizer. Calling RobertaTokenizer(text) gives me only input IDs and an attention mask, but this is insufficient for me to assign entity information for each token. Is there some way for me to get the span of each token returned by RobertaTokenizer?

I’d like to achieve something like this:

for annotation in annotations:
    text = annotation[0]
    labels = annotation[1]["entities"]
    tokenized = tokenizer(text)
    input_ids = tokenized["input_ids"]

    spans = tokenized["spans"]
    for label in labels:
        start = label[1]
        end = label[2]
        for span in spans:
            if span[0] < start < span[1] or span[0] < end < span[1]:
                # assign token with entity ID

ldavid · October 11, 2023, 9:48pm

I just found token_to_chars when tokenized_text = tokenizer(text), and it seems to be what I need. I’ll take a deeper look.


tokenized = tokenizer(text)
num_of_tokens = len(tokenized_text["input_ids"])
for i in range(num_of_tokens):
    charspan = tokenized_text.token_to_chars(i)
    print(charspan.start, charspan.end)

Topic		Replies	Views
NER model fine tuning with labeled spans Beginners	5	3913	May 7, 2023
RoBERTa Tokenizer supported characters 🤗Transformers	0	627	December 24, 2020
Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data 🤗Transformers	0	582	January 16, 2022
Tokenization in a NER context 🤗Tokenizers	5	5711	August 11, 2021
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3229	May 21, 2021

Getting spans from tokenizer

Related topics