Image dataset performance when using map

jpcampbell42 · June 24, 2024, 10:43pm

I am attempting a simple inference task on a collection of images stored in a Dataset that I have built using the approach below.

data = ["/paths/to/files.jpg"]
features = Features({"image": Image(), "image_path": Value(dtype="string")})
dataset = Dataset.from_dict(data, features=features)

The task is essentially modeled on this example: Image Similarity with Hugging Face Datasets and Transformers

I have two related questions about performance.

When I call dataset.map(extract_embeddings, batched=True, batch_size=batch_size) there is an enormous performance penalty if I fail to specify remove_columns=["image"]. A profiler shows all most all the additional time is spent in arrow_writer.py. Apparently something is being written after the embeddings are computed; perhaps the entire dataset is being rewritten incrementally in some staging/cache location? Passing keep_in_memory=True does not affect the performance. The difference is so stark that I’m surprised it isn’t called out in the example and wonder if there is some documentation on how a Dataset behaves when a new column is (incrementally?) added to it. Am I building the dataset wrong? Is it normal to have to add remove_columns?
When I load the model onto a GPU, there is a large up-front delay (about 7 seconds, and it doesn’t obviously depend on the data size) in starting the map process that is not present when working on a CPU. A profiler shows the difference to be attributable to update_fingerprint. Why would that be different?

I’m interested in getting a better understanding of these and other performance issues so any insight or recommended reading to understand the api better is appreciated!

Topic		Replies	Views
[Solved] Image dataset seems slow for larger image size 🤗Datasets	7	3407	December 16, 2021
Iterating over Image feature columns is extremely slow 🤗Datasets	2	41	April 11, 2025
Dataset.map() with batching and multiprocessing 🤗Datasets	1	288	March 5, 2024
Image&Array2d/3d Performance Issue 🤗Datasets	0	270	November 16, 2023
Explain why datasets.map is faster compared to other similar libraries 🤗Datasets	4	882	September 6, 2022

Image dataset performance when using map

Related topics