See this Colab notebook for runnable versions of the code snippets pasted below.
I’ve been playing with the SQuAD dataset and I’ve noticed that simply accessing features of mapped versions of the dataset is extremely slow:
# Let's play with the validation set.
squad = load_dataset("squad")
ds = squad["validation"]
# standard tokeinzer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# tokenizer the dataset with standard hyperparameters.
f = lambda x: tokenizer(x["question"], x["context"], max_length=384, stride=128, truncation="only_second", padding="max_length", return_overflowing_tokens=True, return_offsets_mapping=True)
mapped_ds = ds.map(f, batched=True, remove_columns=val_ds.column_names)
Now, simply executing the statements – just accessing the features –
attention_mask = mapped_ds["attention_mask"]
input_ids = mapped_ds["input_ids"]
offset_mapping = mapped_ds["offset_mapping"]
overflow_to_sample_mapping = mapped_ds["overflow_to_sample_mapping"]
token_type_ids = mapped_ds["token_type_ids"]
takes about 12 seconds. Why does this take so long? My understanding from the docs is that an HF dataset is stored as a collection of feature columns. (Arrow table format?) Were this so, each assignment in the above snippet should just be assigning a pointer. There’s no data copying happening here, right?
Digging a little deeper, assigning attention_mask
, input_ids
, and token_type_ids
each take 1-1.5 seconds. Assigning overflow_to_sample_mapping
takes milliseconds, and assigning offet_mapping
takes 8-9 seconds. It seems that accessing features with more complex structure is taking longer. I’m very curious what are these assignment statements actually doing, under the hood!
On a related note, looping over mapped_ds
takes significantly longer than looping over its features, extracted and zipped:
for x in mapped_ds:
pass
takes 15 seconds while
for x in zip(attention_mask, input_ids, offset_mapping, overflow_to_sample_mapping, token_type_ids):
pass
takes 1.5 seconds.
This does have user-facing impact (right?): If you’re going to loop over the evaluation set in a custom evaluation script (overriding a trainer’s evaluate
method, say), you’re better off storing the individual features of the evaluation dataset on the trainer instance once and for all and looping over their zip on each evaluation step (of which there can be many!), rather than iterating over the evaluation dataset itself each time. (A custom evaluation script of this sort is used to compute “exact match” and “F1” metrics on the SQuAD data. See the evaluate
method of QuestionAnsweringTrainer
in this file and the postprocess_qa_predicitons
function in this one.)