I’m fine-tuning RoBERTa for token classification on a custom NER dataset, which is formatted using NER annotator for SpaCy. It assigns entity types based on character spans rather than token indices.
I figured it wouldn’t be difficult to go from one to the other, but I couldn’t find anything related to span when using
RobertaTokenizer(text) gives me only input IDs and an attention mask, but this is insufficient for me to assign entity information for each token. Is there some way for me to get the span of each token returned by
I’d like to achieve something like this:
for annotation in annotations: text = annotation labels = annotation["entities"] tokenized = tokenizer(text) input_ids = tokenized["input_ids"] spans = tokenized["spans"] for label in labels: start = label end = label for span in spans: if span < start < span or span < end < span: # assign token with entity ID