Getting spans from tokenizer

I’m fine-tuning RoBERTa for token classification on a custom NER dataset, which is formatted using NER annotator for SpaCy. It assigns entity types based on character spans rather than token indices.

I figured it wouldn’t be difficult to go from one to the other, but I couldn’t find anything related to span when using RobertaTokenizer. Calling RobertaTokenizer(text) gives me only input IDs and an attention mask, but this is insufficient for me to assign entity information for each token. Is there some way for me to get the span of each token returned by RobertaTokenizer?

I’d like to achieve something like this:

for annotation in annotations:
    text = annotation[0]
    labels = annotation[1]["entities"]
    tokenized = tokenizer(text)
    input_ids = tokenized["input_ids"]

    spans = tokenized["spans"]
    for label in labels:
        start = label[1]
        end = label[2]
        for span in spans:
            if span[0] < start < span[1] or span[0] < end < span[1]:
                # assign token with entity ID

I just found token_to_chars when tokenized_text = tokenizer(text), and it seems to be what I need. I’ll take a deeper look.

tokenized = tokenizer(text)
num_of_tokens = len(tokenized_text["input_ids"])
for i in range(num_of_tokens):
    charspan = tokenized_text.token_to_chars(i)
    print(charspan.start, charspan.end)
1 Like