I’m trying to train a BERT model on the SQuAD dataset. I’ve found several instances where the answer text doesn’t start or end at a token boundary. I’m using
bert-base-uncased, but I think my question is valid for any tokenizer based on the word-part algorithm.
To find the indices
end_position of the starting and ending tokens of the answer span, I use the offsets returned by the tokenizer. Here’s an example where everything works “as expected”:
question_text = "Which band recorded the album 'Thank You 4 Your Service'?" context = "Thank You 4 Your Service is the sixth and final studio album by American hip hop group A Tribe Called Quest." answer_text = "A Tribe Called Quest" answer_start = context.index(answer_text) answer_end = answer_start + len(answer_text) x = tokenizer(context, return_offsets_mapping=True) starts, ends = zip(*x.offset_mapping) start_position = starts.index(answer_start) end_position = ends.index(answer_end) tokenizer.convert_ids_to_tokens(x.input_ids[start_position:end_position + 1]) # ['a', 'tribe', 'called', 'quest']
Things go awry, however, if we set
question_text = "How many studio albums did A Tribe Called Quest release?" answer_text = "six"
end_position = ends.index(answer_end)
raises an exception because the answer “six” is part of the word “sixth” which is, itself, a token (id 4369).
What’s the “right” thing to do here? By “right”, I mean how do people-who-know-what-they’re-doing resolve this. What did Devlin et al. do when fine-tuned BERT on SQuAD in the original BERT paper? I couldn’t find those details in the paper.