I found a bug in your Course

Gozdi · March 10, 2022, 9:40pm

In section Main NLP taska in chapter Question Answering
function preprocess_training_examples
should have if offset[context_start][0] > end_char or offset[context_end][1] < start_char or offset[context_end][1] < end_char:
instead of if offset[context_start][0] > end_char or offset[context_end][1] < start_char
because if tokenized context contains only a part of the answer offset[context_end][1] is smaller than end_char which results in incorrect labels
try example 965 of training dataset
it sets end position on [SEP] token

sadhaklal · March 20, 2022, 1:30pm

I think that instead of:

if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
    start_positions.append(0)
    end_positions.append(0)

It should be:

if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
    start_positions.append(0)
    end_positions.append(0)

Am I correct? This is because earlier in the section, it’s written that:

“ We will also set those labels (0, 0) in the unfortunate case where the answer has been truncated so that we only have the start (or end) of it. ”

The following diagram explains this:

Context 1 fully contains the answer. Context 2 STARTS AFTER the answer STARTS. Context 3 ENDS BEFORE the answer ENDS.

Gozdi · March 23, 2022, 5:58pm

Yes I think you are right

lewtun · March 23, 2022, 9:15pm

Thanks for reporting this @Gozdi! It should now be fixed on the website

Topic		Replies	Views
Adding doc_stride while preprocessing the data for Question Answering Beginners	0	341	September 8, 2021
TypeError: forward() got an unexpected keyword argument 'start_positions' 🤗Transformers	5	6807	June 28, 2021
SQuAD with BERT tokenizer: Mismatch between span and token boundaries Models	0	506	November 12, 2021
Missing, yet not missing, input_ids 🤗Transformers	2	1343	June 14, 2024
Labels in language modeling: which tokens to set to -100? Beginners	1	3455	November 30, 2020

I found a bug in your Course

Related topics