I am wondering if it possible to use the dataset indices to:
get the values for a column
use (#1) to select/filter the original dataset by the order of those values
The problem I have is this: I am using HF’s dataset class for SQuAD 2.0 data like so:
from datasets import load_dataset
dataset = load_dataset("squad_v2")
When I train, I collect the indices and can use those indices to filter/select the dataset in order of how it was trained with randomly shuffled batches like so:
dataset['train'].select(indices=[list of indices here])
The problem is when I also want to use HF’s scoring method for SQuAD 2.0. I need the original data set and its tokenized “features” to be in the same order. When the training data is tokenized, it does not share the same length; it gets larger, owing to text of varying length. For this reason, I cannot use the indices emitted during training to align my original training data properly; the sizes are not the same.
One way that I could align both data sets is to:
collect the indices used during training
use those indices to create a new training data set in the right order dataset['train'].select(indices=[list of indices here])
then from the output of step 2, get each a list of all the strings found in the id column
use the strings found in the id column to then re-order the dataset class by the each and every unique string value.
I do not think this is possible judging from the docs here, but I am curious if anyone has a recommendation:
A painful way of doing what I want is through elastic search (its very slow, but I am feeling that there has to be a better way to filter / query a dataset.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost'}]).ping()
dataset['train'].add_elasticsearch_index(column='id')
out1 = dataset['train'].search(index_name='id', query='56be85543aeaaa14008c9063')
out1[1]
# load ds
dataset = load_dataset("squad_v2").shuffle(seed=1)
# get train
train = dataset['train']
# tokenize the raw data like HF
tokenized_train = train.map(prepare_validation_features, batched=True, remove_columns=train.column_names)
# we shuffled our data during training so lets say we trained these 4 examples
training_order = [99, 34, 2, 45]
# to score squad, i need the original data set order (ie.g., non-tokenized text data) to match that of the training order
# but the tokenized data and non-tokenized data are not of the same length, so indices cannot match anymore
# items that ARE shared across datasets are example ids
# lets find the example ids from the data set that we trained
filtered_df = tokenized_train.select(indices=training_order) # contains 4 items now
# now what were the example ids of the trained examples?
example_ids = filtered_df['example_id']
# '572ec21cdfa6aa1500f8d34e',
# '5728ac583acd2414000dfcae',
# '5ace5a2b32bba1001ae4a387',
# '5a11c08c06e79900185c354b'
# ok since this is my training order and thus my scoring order, I need to organize the scoring data in the same order
# STUCK: No obvious way to order the dataset according to a column's values
# like those strings. I need `train` to just contain 4 examples and only those examples that match the strings above
from datasets import load_dataset, load_metric
from transformers import BertTokenizerFast, BertForQuestionAnswering
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
pad_on_right = tokenizer.padding_side == "right"
def prepare_validation_features(examples):
# Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
# in one example possible giving several features when a context is long, each of those features having a
# context that overlaps a bit the context of the previous feature.
tokenized_examples = tokenizer(
examples["question" if pad_on_right else "context"],
examples["context" if pad_on_right else "question"],
truncation="only_second" if pad_on_right else "only_first",
max_length=max_length,
stride=doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
# Since one example might give us several features if it has a long context, we need a map from a feature to
# its corresponding example. This key gives us just that.
sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
# We keep the example_id that gave us this feature and we will store the offset mappings.
tokenized_examples["example_id"] = []
for i in range(len(tokenized_examples["input_ids"])):
# Grab the sequence corresponding to that example (to know what is the context and what is the question).
sequence_ids = tokenized_examples.sequence_ids(i)
context_index = 1 if pad_on_right else 0
# One example can give several spans, this is the index of the example containing this span of text.
sample_index = sample_mapping[i]
tokenized_examples["example_id"].append(examples["id"][sample_index])
# Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
# position is part of the context or not.
tokenized_examples["offset_mapping"][i] = [
(o if sequence_ids[k] == context_index else None)
for k, o in enumerate(tokenized_examples["offset_mapping"][i])
]
return tokenized_examples