Image dataset performance when using map

I am attempting a simple inference task on a collection of images stored in a Dataset that I have built using the approach below.

data = ["/paths/to/files.jpg"]
features = Features({"image": Image(), "image_path": Value(dtype="string")})
dataset = Dataset.from_dict(data, features=features)

The task is essentially modeled on this example: Image Similarity with Hugging Face Datasets and Transformers

I have two related questions about performance.

  1. When I call dataset.map(extract_embeddings, batched=True, batch_size=batch_size) there is an enormous performance penalty if I fail to specify remove_columns=["image"]. A profiler shows all most all the additional time is spent in arrow_writer.py. Apparently something is being written after the embeddings are computed; perhaps the entire dataset is being rewritten incrementally in some staging/cache location? Passing keep_in_memory=True does not affect the performance. The difference is so stark that I’m surprised it isn’t called out in the example and wonder if there is some documentation on how a Dataset behaves when a new column is (incrementally?) added to it. Am I building the dataset wrong? Is it normal to have to add remove_columns?

  2. When I load the model onto a GPU, there is a large up-front delay (about 7 seconds, and it doesn’t obviously depend on the data size) in starting the map process that is not present when working on a CPU. A profiler shows the difference to be attributable to update_fingerprint. Why would that be different?

I’m interested in getting a better understanding of these and other performance issues so any insight or recommended reading to understand the api better is appreciated!