SQuAD with BERT tokenizer: Mismatch between span and token boundaries

mgreenbe · November 12, 2021, 9:28pm

I’m trying to train a BERT model on the SQuAD dataset. I’ve found several instances where the answer text doesn’t start or end at a token boundary. I’m using bert-base-uncased, but I think my question is valid for any tokenizer based on the word-part algorithm.

To find the indices start_position and end_position of the starting and ending tokens of the answer span, I use the offsets returned by the tokenizer. Here’s an example where everything works “as expected”:

question_text = "Which band recorded the album 'Thank You 4 Your Service'?"
context = "Thank You 4 Your Service is the sixth and final studio album by American hip hop group A Tribe Called Quest."
answer_text = "A Tribe Called Quest"
answer_start = context.index(answer_text)
answer_end = answer_start + len(answer_text)
x = tokenizer(context, return_offsets_mapping=True)
starts, ends = zip(*x.offset_mapping)
start_position = starts.index(answer_start)
end_position = ends.index(answer_end)
tokenizer.convert_ids_to_tokens(x.input_ids[start_position:end_position + 1])
# ['a', 'tribe', 'called', 'quest']

Things go awry, however, if we set

question_text = "How many studio albums did A Tribe Called Quest release?"
answer_text = "six"

The line

end_position = ends.index(answer_end)

raises an exception because the answer “six” is part of the word “sixth” which is, itself, a token (id 4369).

What’s the “right” thing to do here? By “right”, I mean how do people-who-know-what-they’re-doing resolve this. What did Devlin et al. do when fine-tuned BERT on SQuAD in the original BERT paper? I couldn’t find those details in the paper.

Topic		Replies	Views
Question about BERT for qa Beginners	0	594	June 30, 2022
I found a bug in your Course Course	3	1175	March 23, 2022
BUGs on offset-mapping 🤗Tokenizers	0	174	May 24, 2024
Text Classification tokenizer problems on inference Intermediate	4	2278	October 12, 2022
How does BERT actually answer questions? Research	1	801	March 11, 2021

SQuAD with BERT tokenizer: Mismatch between span and token boundaries

Related topics