Is it possible to filter/select dataset class by a column's values?

BadDepartment · January 20, 2021, 3:10pm

Thanks for your time and interest! This is close, but I think scoring SQuAD is more complicated than this, judging from the HF guide: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=lkelfiHfy2Z6

The problem is getting a filtered data set.


# load ds
dataset = load_dataset("squad_v2").shuffle(seed=1)
# get train
train = dataset['train']
# tokenize the raw data like HF
tokenized_train = train.map(prepare_validation_features, batched=True, remove_columns=train.column_names)
# we shuffled our data during training so lets say we trained these 4 examples
training_order = [99, 34, 2, 45]
# to score squad, i need the original data set order (ie.g., non-tokenized text data) to match that of the training order
# but the tokenized data and non-tokenized data are not of the same length, so indices cannot match anymore
# items that ARE shared across datasets are example ids
# lets find the example ids from the data set that we trained
filtered_df = tokenized_train.select(indices=training_order)  # contains 4 items now
# now what were the example ids of the trained examples?
example_ids = filtered_df['example_id']
# '572ec21cdfa6aa1500f8d34e',
# '5728ac583acd2414000dfcae',
# '5ace5a2b32bba1001ae4a387',
# '5a11c08c06e79900185c354b'
# ok since this is my training order and thus my scoring order, I need to organize the scoring data in the same order
# STUCK: No obvious way to order the dataset according to a column's values
# like those strings. I need `train` to just contain 4 examples and only those examples that match the strings above

from datasets import load_dataset, load_metric
from transformers import BertTokenizerFast, BertForQuestionAnswering
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
pad_on_right = tokenizer.padding_side == "right"
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Topic		Replies	Views
How to filter Datasets object? Beginners	3	648	June 6, 2024
Filtering Dataset Beginners	3	5749	April 8, 2024
Loading a part of a dataset from a specified feature value 🤗Datasets	1	54	December 11, 2024
Dataset select function: retrieving the examples not selected 🤗Datasets	0	34	December 9, 2024
Filtering performance 🤗Datasets	5	2094	March 5, 2025

Is it possible to filter/select dataset class by a column's values?

Related topics