Is it possible to filter/select dataset class by a column's values?

BadDepartment · January 19, 2021, 9:23pm

I am wondering if it possible to use the dataset indices to:

get the values for a column
use (#1) to select/filter the original dataset by the order of those values

The problem I have is this: I am using HF’s dataset class for SQuAD 2.0 data like so:

from datasets import load_dataset
dataset = load_dataset("squad_v2")

When I train, I collect the indices and can use those indices to filter/select the dataset in order of how it was trained with randomly shuffled batches like so:

dataset['train'].select(indices=[list of indices here])

The problem is when I also want to use HF’s scoring method for SQuAD 2.0. I need the original data set and its tokenized “features” to be in the same order. When the training data is tokenized, it does not share the same length; it gets larger, owing to text of varying length. For this reason, I cannot use the indices emitted during training to align my original training data properly; the sizes are not the same.

One way that I could align both data sets is to:

collect the indices used during training
use those indices to create a new training data set in the right order dataset['train'].select(indices=[list of indices here])
then from the output of step 2, get each a list of all the strings found in the id column
use the strings found in the id column to then re-order the dataset class by the each and every unique string value.

I do not think this is possible judging from the docs here, but I am curious if anyone has a recommendation:

https://huggingface.co/docs/datasets/package_reference/main_classes.html

A painful way of doing what I want is through elastic search (its very slow, but I am feeling that there has to be a better way to filter / query a dataset.

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost'}]).ping()
dataset['train'].add_elasticsearch_index(column='id')
out1 = dataset['train'].search(index_name='id', query='56be85543aeaaa14008c9063')
out1[1]

valhalla · January 20, 2021, 7:13am

From what I can understand, what you want is somehow filter the dataset and then use the same dataset to compute metrics, is this right?

You should be able to do this by

get your filtered dataset
create a dataloader
iterate over the batches and do prediction for each batch
compute the metrics.

 for batch in dataloader:
    model_input, targets = batch
    predictions = model(model_inputs)
    metric.add_batch(predictions, targets)

score = metric.compute()

BadDepartment · January 20, 2021, 3:10pm

Thanks for your time and interest! This is close, but I think scoring SQuAD is more complicated than this, judging from the HF guide: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=lkelfiHfy2Z6

The problem is getting a filtered data set.


# load ds
dataset = load_dataset("squad_v2").shuffle(seed=1)
# get train
train = dataset['train']
# tokenize the raw data like HF
tokenized_train = train.map(prepare_validation_features, batched=True, remove_columns=train.column_names)
# we shuffled our data during training so lets say we trained these 4 examples
training_order = [99, 34, 2, 45]
# to score squad, i need the original data set order (ie.g., non-tokenized text data) to match that of the training order
# but the tokenized data and non-tokenized data are not of the same length, so indices cannot match anymore
# items that ARE shared across datasets are example ids
# lets find the example ids from the data set that we trained
filtered_df = tokenized_train.select(indices=training_order)  # contains 4 items now
# now what were the example ids of the trained examples?
example_ids = filtered_df['example_id']
# '572ec21cdfa6aa1500f8d34e',
# '5728ac583acd2414000dfcae',
# '5ace5a2b32bba1001ae4a387',
# '5a11c08c06e79900185c354b'
# ok since this is my training order and thus my scoring order, I need to organize the scoring data in the same order
# STUCK: No obvious way to order the dataset according to a column's values
# like those strings. I need `train` to just contain 4 examples and only those examples that match the strings above

from datasets import load_dataset, load_metric
from transformers import BertTokenizerFast, BertForQuestionAnswering
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
pad_on_right = tokenizer.padding_side == "right"
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

valhalla · January 22, 2021, 6:32am

you could directly iterate on your filtered dataset and then compute scores, I am not what the problem is here.

lhoestq · February 22, 2021, 2:18pm

Hi ! Maybe you can include the original text along with the tokenized data in prepare_train_features ?

This way when you’re shuffling/selecting examples then each sample will still have both the text and the tokenized text.

JamaicanDave · May 27, 2024, 9:43pm

@valhalla I don’t see where you are applying a filter. The person after you applied a filter.

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1485	May 17, 2021
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	865	May 9, 2022
Remove a row/specific index from the dataset 🤗Datasets	6	13336	February 8, 2025
Filtering performance 🤗Datasets	5	2017	March 5, 2025
Filtering Dataset Beginners	3	5597	April 8, 2024

Is it possible to filter/select dataset class by a column's values?

Related topics