In section Main NLP taska in chapter Question Answering
function preprocess_training_examples
should have if offset[context_start][0] > end_char or offset[context_end][1] < start_char or offset[context_end][1] < end_char:
instead of if offset[context_start][0] > end_char or offset[context_end][1] < start_char
because if tokenized context contains only a part of the answer offset[context_end][1] is smaller than end_char which results in incorrect labels
try example 965 of training dataset
it sets end position on [SEP] token
1 Like
I think that instead of:
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
It should be:
if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
start_positions.append(0)
end_positions.append(0)
Am I correct? This is because earlier in the section, it’s written that:
“ We will also set those labels (0, 0) in the unfortunate case where the answer has been truncated so that we only have the start (or end) of it. ”
The following diagram explains this:
Context 1 fully contains the answer. Context 2 STARTS AFTER the answer STARTS. Context 3 ENDS BEFORE the answer ENDS.
1 Like
Yes I think you are right
Thanks for reporting this @Gozdi! It should now be fixed on the website
1 Like