Why is simply accessing dataset features so slow?

See this Colab notebook for runnable versions of the code snippets pasted below.

I’ve been playing with the SQuAD dataset and I’ve noticed that simply accessing features of mapped versions of the dataset is extremely slow:

# Let's play with the validation set.
squad = load_dataset("squad")
ds = squad["validation"]

# standard tokeinzer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# tokenizer the dataset with standard hyperparameters.
f = lambda x: tokenizer(x["question"], x["context"], max_length=384, stride=128, truncation="only_second", padding="max_length", return_overflowing_tokens=True, return_offsets_mapping=True)
mapped_ds = ds.map(f, batched=True, remove_columns=val_ds.column_names)

Now, simply executing the statements – just accessing the features –

attention_mask = mapped_ds["attention_mask"]
input_ids = mapped_ds["input_ids"]
offset_mapping = mapped_ds["offset_mapping"]
overflow_to_sample_mapping = mapped_ds["overflow_to_sample_mapping"]
token_type_ids = mapped_ds["token_type_ids"]

takes about 12 seconds. Why does this take so long? My understanding from the docs is that an HF dataset is stored as a collection of feature columns. (Arrow table format?) Were this so, each assignment in the above snippet should just be assigning a pointer. There’s no data copying happening here, right?

Digging a little deeper, assigning attention_mask, input_ids, and token_type_ids each take 1-1.5 seconds. Assigning overflow_to_sample_mapping takes milliseconds, and assigning offet_mapping takes 8-9 seconds. It seems that accessing features with more complex structure is taking longer. I’m very curious what are these assignment statements actually doing, under the hood!

On a related note, looping over mapped_ds takes significantly longer than looping over its features, extracted and zipped:

for x in mapped_ds:

takes 15 seconds while

for x in zip(attention_mask, input_ids, offset_mapping, overflow_to_sample_mapping, token_type_ids):

takes 1.5 seconds.

This does have user-facing impact (right?): If you’re going to loop over the evaluation set in a custom evaluation script (overriding a trainer’s evaluate method, say), you’re better off storing the individual features of the evaluation dataset on the trainer instance once and for all and looping over their zip on each evaluation step (of which there can be many!), rather than iterating over the evaluation dataset itself each time. (A custom evaluation script of this sort is used to compute “exact match” and “F1” metrics on the SQuAD data. See the evaluate method of QuestionAnsweringTrainer in this file and the postprocess_qa_predicitons function in this one.)


this behavior is expected.

dataset[column] loads the entire dataset column into memory (this is why this part seems so slow to you), so operating directly on this data will be faster than operating on the dataset created with load_dataset, which is memory-mapped, i. e., stored in an arrow file by default. However, in many cases, a dataset or dataset transforms are too big to fit in RAM, so having them in memory is not an option and we’d rather pay a penalty (in respect to read/write operations) of having it stored in a file.

If you are sure that you have enough RAM, you can load the dataset as follows to have it in memory:

load_dataset("squad", split="validation", keep_in_memory=True)

Makes perfect sense. Thanks! Brief follow-up: When you access a row of the dataset via index, dataset[i], it’s reading from disk unless keep_in_memory is set to true on the dataset?

When you access a row of the dataset via index, dataset[i], it’s reading from disk unless keep_in_memory is set to true on the dataset?

Yes. You can check at any time whether your dataset is on disk or in memory with dset.cache_files, which will return an empty list in the former case.

1 Like