Is it possible to filter/select dataset class by a column's values?

I am wondering if it possible to use the dataset indices to:

  1. get the values for a column
  2. use (#1) to select/filter the original dataset by the order of those values

The problem I have is this: I am using HF’s dataset class for SQuAD 2.0 data like so:

from datasets import load_dataset
dataset = load_dataset("squad_v2")

When I train, I collect the indices and can use those indices to filter/select the dataset in order of how it was trained with randomly shuffled batches like so:

dataset['train'].select(indices=[list of indices here])

The problem is when I also want to use HF’s scoring method for SQuAD 2.0. I need the original data set and its tokenized “features” to be in the same order. When the training data is tokenized, it does not share the same length; it gets larger, owing to text of varying length. For this reason, I cannot use the indices emitted during training to align my original training data properly; the sizes are not the same.

One way that I could align both data sets is to:

  1. collect the indices used during training
  2. use those indices to create a new training data set in the right order dataset['train'].select(indices=[list of indices here])
  3. then from the output of step 2, get each a list of all the strings found in the id column
  4. use the strings found in the id column to then re-order the dataset class by the each and every unique string value.

I do not think this is possible judging from the docs here, but I am curious if anyone has a recommendation:

https://huggingface.co/docs/datasets/package_reference/main_classes.html

A painful way of doing what I want is through elastic search (its very slow, but I am feeling that there has to be a better way to filter / query a dataset.

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost'}]).ping()
dataset['train'].add_elasticsearch_index(column='id')
out1 = dataset['train'].search(index_name='id', query='56be85543aeaaa14008c9063')
out1[1]

From what I can understand, what you want is somehow filter the dataset and then use the same dataset to compute metrics, is this right?

You should be able to do this by

  • get your filtered dataset
  • create a dataloader
  • iterate over the batches and do prediction for each batch
  • compute the metrics.
 for batch in dataloader:
    model_input, targets = batch
    predictions = model(model_inputs)
    metric.add_batch(predictions, targets)

score = metric.compute()

Thanks for your time and interest! This is close, but I think scoring SQuAD is more complicated than this, judging from the HF guide: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=lkelfiHfy2Z6

The problem is getting a filtered data set.


# load ds
dataset = load_dataset("squad_v2").shuffle(seed=1)
# get train
train = dataset['train']
# tokenize the raw data like HF
tokenized_train = train.map(prepare_validation_features, batched=True, remove_columns=train.column_names)
# we shuffled our data during training so lets say we trained these 4 examples
training_order = [99, 34, 2, 45]
# to score squad, i need the original data set order (ie.g., non-tokenized text data) to match that of the training order
# but the tokenized data and non-tokenized data are not of the same length, so indices cannot match anymore
# items that ARE shared across datasets are example ids
# lets find the example ids from the data set that we trained
filtered_df = tokenized_train.select(indices=training_order)  # contains 4 items now
# now what were the example ids of the trained examples?
example_ids = filtered_df['example_id']
# '572ec21cdfa6aa1500f8d34e',
# '5728ac583acd2414000dfcae',
# '5ace5a2b32bba1001ae4a387',
# '5a11c08c06e79900185c354b'
# ok since this is my training order and thus my scoring order, I need to organize the scoring data in the same order
# STUCK: No obvious way to order the dataset according to a column's values
# like those strings. I need `train` to just contain 4 examples and only those examples that match the strings above

from datasets import load_dataset, load_metric
from transformers import BertTokenizerFast, BertForQuestionAnswering
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
pad_on_right = tokenizer.padding_side == "right"
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples



you could directly iterate on your filtered dataset and then compute scores, I am not what the problem is here.

Hi ! Maybe you can include the original text along with the tokenized data in prepare_train_features ?

This way when you’re shuffling/selecting examples then each sample will still have both the text and the tokenized text.