See this Colab notebook for runnable versions of the code snippets pasted below.
I’ve been playing with the SQuAD dataset and I’ve noticed that simply accessing features of mapped versions of the dataset is extremely slow:
# Let's play with the validation set. squad = load_dataset("squad") ds = squad["validation"] # standard tokeinzer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # tokenizer the dataset with standard hyperparameters. f = lambda x: tokenizer(x["question"], x["context"], max_length=384, stride=128, truncation="only_second", padding="max_length", return_overflowing_tokens=True, return_offsets_mapping=True) mapped_ds = ds.map(f, batched=True, remove_columns=val_ds.column_names)
Now, simply executing the statements – just accessing the features –
attention_mask = mapped_ds["attention_mask"] input_ids = mapped_ds["input_ids"] offset_mapping = mapped_ds["offset_mapping"] overflow_to_sample_mapping = mapped_ds["overflow_to_sample_mapping"] token_type_ids = mapped_ds["token_type_ids"]
takes about 12 seconds. Why does this take so long? My understanding from the docs is that an HF dataset is stored as a collection of feature columns. (Arrow table format?) Were this so, each assignment in the above snippet should just be assigning a pointer. There’s no data copying happening here, right?
Digging a little deeper, assigning
token_type_ids each take 1-1.5 seconds. Assigning
overflow_to_sample_mapping takes milliseconds, and assigning
offet_mapping takes 8-9 seconds. It seems that accessing features with more complex structure is taking longer. I’m very curious what are these assignment statements actually doing, under the hood!
On a related note, looping over
mapped_ds takes significantly longer than looping over its features, extracted and zipped:
for x in mapped_ds: pass
takes 15 seconds while
for x in zip(attention_mask, input_ids, offset_mapping, overflow_to_sample_mapping, token_type_ids): pass
takes 1.5 seconds.
This does have user-facing impact (right?): If you’re going to loop over the evaluation set in a custom evaluation script (overriding a trainer’s
evaluate method, say), you’re better off storing the individual features of the evaluation dataset on the trainer instance once and for all and looping over their zip on each evaluation step (of which there can be many!), rather than iterating over the evaluation dataset itself each time. (A custom evaluation script of this sort is used to compute “exact match” and “F1” metrics on the SQuAD data. See the
evaluate method of
QuestionAnsweringTrainer in this file and the
postprocess_qa_predicitons function in this one.)