I am attempting a simple inference task on a collection of images stored in a Dataset
that I have built using the approach below.
data = ["/paths/to/files.jpg"]
features = Features({"image": Image(), "image_path": Value(dtype="string")})
dataset = Dataset.from_dict(data, features=features)
The task is essentially modeled on this example: Image Similarity with Hugging Face Datasets and Transformers
I have two related questions about performance.
-
When I call
dataset.map(extract_embeddings, batched=True, batch_size=batch_size)
there is an enormous performance penalty if I fail to specifyremove_columns=["image"]
. A profiler shows all most all the additional time is spent inarrow_writer.py
. Apparently something is being written after the embeddings are computed; perhaps the entire dataset is being rewritten incrementally in some staging/cache location? Passingkeep_in_memory=True
does not affect the performance. The difference is so stark that I’m surprised it isn’t called out in the example and wonder if there is some documentation on how aDataset
behaves when a new column is (incrementally?) added to it. Am I building the dataset wrong? Is it normal to have to addremove_columns
? -
When I load the model onto a GPU, there is a large up-front delay (about 7 seconds, and it doesn’t obviously depend on the data size) in starting the map process that is not present when working on a CPU. A profiler shows the difference to be attributable to
update_fingerprint
. Why would that be different?
I’m interested in getting a better understanding of these and other performance issues so any insight or recommended reading to understand the api better is appreciated!