I’m going through this notebook and below is a gist of some of it:
datasets = load_dataset("squad_v2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
example = datasets["train"][i]
tokenized_example = tokenizer(
example["question"],
example["context"],
max_length=max_length,
truncation="only_second",
return_overflowing_tokens=True,
return_offsets_mapping=True,
stride=doc_stride
)
sequence_ids = tokenized_example.sequence_ids()
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
token_start_index += 1
# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
token_end_index -= 1
So as you can see, return_overflowing_tokens is True so the tokenized_examples input_ids are a list of lists so tokenized_example[“input_ids”][0] would contain just the first segment of the context for that sample.
So then to get token_end_index, the code starts at len(tokenized_example[“input_ids”][0]) - 1 and then moves backwards until it finds a 1. But surely len(tokenized_example[“input_ids”][0]) - 1 isn’t going to be the end of the context because len(tokenized_example[“input_ids”][0]) is only going to give you the length of the first segment of the context?