Functionality for converting character-level spans to token-level spans?

seanswyi · July 3, 2021, 12:41pm

Hi. I’m currently trying to convert character-level spans to token-level spans and am wondering if there’s a functionality in the library that I may not be taking advantage of.

The data that I’m currently using consists of “proper” text (I say “proper” as in it’s written as if it’s a normal document, not with things like extra whitespaces for easier split operations) and annotated entities. The entities are annotated at the character level but I would like to obtain the tokenized subword-level span.

My plan was to first convert character-level spans to word-level spans, then convert that to subword-level spans. A piece of code that I wrote looks like this:

new_text = []
for word in original_text.split():
    if (len(word) > 1) and (word[-1] in ['.', ',', ';', ':']):
        new_text.append(word[:-1] + ' ' + word[-1])
    else:
        new_text.append(word)

new_text = ' '.join(new_text).split()

word2char_span = {}
start_idx = 0
for idx, word in enumerate(new_text):
    char_start = start_idx
    char_end = char_start + len(word)
    word2char_span[idx] = (char_start, char_end)
    start_idx += len(word) + 1

This seems to work well but one edge case I didn’t think of is parentheses. To give a more concrete example, one paragraph-entity pair looks like this:

>>> original_text = "RDH12, a retinol dehydrogenase causing Leber's congenital \
amaurosis, is also involved in steroid metabolism. Three retinol dehydrogenases \
(RDHs) were tested for steroid converting abilities: human and murine RDH 12 and \
 human RDH13. RDH12 is involved in retinal degeneration in Leber's congenital \
 amaurosis (LCA). We show that murine Rdh12 and human RDH13 do not reveal activity \
 towards the checked steroids, but that human type 12 RDH reduces  \
dihydrotestosterone to androstanediol, and is thus also involved in steroid  \
metabolism. Furthermore, we analyzed both expression and subcellular localization \
of these enzymes."
>>> entity_span = [139, 143]
>>> print(original_text[139:143])
'RDHs'

This example actually returns a KeyError when I try to refer to (139, 143) because the adjustment code I wrote takes (RDHs) as the entity rather than RDHs. I don’t want to hardcode parentheses handling either because there are some entities where the parentheses are included.

I feel like there should be a simpler approach to this issue and I’m overthinking things a bit. Any feedback on how I could achieve what I want is appreciated.

Topic		Replies	Views
Getting spans from tokenizer Beginners	1	936	October 11, 2023
NER model fine tuning with labeled spans Beginners	5	3914	May 7, 2023
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4561	January 13, 2021
Handling tokenization effects of punctuated numbers in NER (e.g. $10,000) 🤗Transformers	2	1351	March 30, 2023
Text Classification tokenizer problems on inference Intermediate	4	2275	October 12, 2022

Functionality for converting character-level spans to token-level spans?

Related topics