Preprocessing of dataset

I’m going through this notebook and below is a gist of some of it:

datasets = load_dataset("squad_v2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
example = datasets["train"][i]

tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)

sequence_ids = tokenized_example.sequence_ids()

answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

So as you can see, return_overflowing_tokens is True so the tokenized_examples input_ids are a list of lists so tokenized_example[“input_ids”][0] would contain just the first segment of the context for that sample.

So then to get token_end_index, the code starts at len(tokenized_example[“input_ids”][0]) - 1 and then moves backwards until it finds a 1. But surely len(tokenized_example[“input_ids”][0]) - 1 isn’t going to be the end of the context because len(tokenized_example[“input_ids”][0]) is only going to give you the length of the first segment of the context?