Image dataset performance when using map

jpcampbell42 · June 24, 2024, 10:43pm

I am attempting a simple inference task on a collection of images stored in a Dataset that I have built using the approach below.

data = ["/paths/to/files.jpg"]
features = Features({"image": Image(), "image_path": Value(dtype="string")})
dataset = Dataset.from_dict(data, features=features)

The task is essentially modeled on this example: Image Similarity with Hugging Face Datasets and Transformers

I have two related questions about performance.

When I call dataset.map(extract_embeddings, batched=True, batch_size=batch_size) there is an enormous performance penalty if I fail to specify remove_columns=["image"]. A profiler shows all most all the additional time is spent in arrow_writer.py. Apparently something is being written after the embeddings are computed; perhaps the entire dataset is being rewritten incrementally in some staging/cache location? Passing keep_in_memory=True does not affect the performance. The difference is so stark that I’m surprised it isn’t called out in the example and wonder if there is some documentation on how a Dataset behaves when a new column is (incrementally?) added to it. Am I building the dataset wrong? Is it normal to have to add remove_columns?
When I load the model onto a GPU, there is a large up-front delay (about 7 seconds, and it doesn’t obviously depend on the data size) in starting the map process that is not present when working on a CPU. A profiler shows the difference to be attributable to update_fingerprint. Why would that be different?

I’m interested in getting a better understanding of these and other performance issues so any insight or recommended reading to understand the api better is appreciated!

Topic		Replies	Views
Dataset map during runtime 🤗Datasets	2	1308	September 13, 2023
Why is simply accessing dataset features so slow? 🤗Datasets	3	3804	November 22, 2021
Map() function freezes on large dataset 🤗Datasets	8	3063	September 10, 2023
Tokenizer performance is slow, after call to dataset map 🤗Datasets	0	176	June 15, 2024
Ds.map(): optimizing PIL Image processing as tensorflow tensor 🤗Datasets	2	1369	April 27, 2024

Image dataset performance when using map

Related topics