I am wondering if it possible to use the dataset indices to:
- get the values for a column
- use (#1) to select/filter the original dataset by the order of those values
The problem I have is this: I am using HF’s dataset class for SQuAD 2.0 data like so:
from datasets import load_dataset
dataset = load_dataset("squad_v2")
When I train, I collect the indices and can use those indices to filter/select the dataset in order of how it was trained with randomly shuffled batches like so:
dataset['train'].select(indices=[list of indices here])
The problem is when I also want to use HF’s scoring method for SQuAD 2.0. I need the original data set and its tokenized “features” to be in the same order. When the training data is tokenized, it does not share the same length; it gets larger, owing to text of varying length. For this reason, I cannot use the indices emitted during training to align my original training data properly; the sizes are not the same.
One way that I could align both data sets is to:
- collect the indices used during training
- use those indices to create a new training data set in the right order
dataset['train'].select(indices=[list of indices here])
- then from the output of step 2, get each a list of all the strings found in the
id
column - use the strings found in the
id
column to then re-order the dataset class by the each and every unique string value.
I do not think this is possible judging from the docs here, but I am curious if anyone has a recommendation:
https://huggingface.co/docs/datasets/package_reference/main_classes.html
A painful way of doing what I want is through elastic search (its very slow, but I am feeling that there has to be a better way to filter / query a dataset.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost'}]).ping()
dataset['train'].add_elasticsearch_index(column='id')
out1 = dataset['train'].search(index_name='id', query='56be85543aeaaa14008c9063')
out1[1]