Preprocessing of dataset

boringblobking · April 10, 2024, 1:54pm

I’m going through this notebook and below is a gist of some of it:

datasets = load_dataset("squad_v2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
example = datasets["train"][i]

tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)

sequence_ids = tokenized_example.sequence_ids()

answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

So as you can see, return_overflowing_tokens is True so the tokenized_examples input_ids are a list of lists so tokenized_example[“input_ids”][0] would contain just the first segment of the context for that sample.

So then to get token_end_index, the code starts at len(tokenized_example[“input_ids”][0]) - 1 and then moves backwards until it finds a 1. But surely len(tokenized_example[“input_ids”][0]) - 1 isn’t going to be the end of the context because len(tokenized_example[“input_ids”][0]) is only going to give you the length of the first segment of the context?

Topic		Replies	Views
Adding doc_stride while preprocessing the data for Question Answering Beginners	0	345	September 8, 2021
Convert a Python Tokenizer into a TokenizerFast Beginners	0	343	May 20, 2022
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	573	November 25, 2021
Changing Tokenizer's max_length gets weird result Beginners	2	442	May 17, 2022
What does this warning mean? -overflowing tokens are not returned for the setting you have chosen 🤗Tokenizers	1	5423	March 30, 2022

Preprocessing of dataset

Related topics