Significant performance difference between two shapes using Array2D features

galopyz · August 26, 2023, 10:25pm

Hello,

I was following Part 2 of FastAI course, and I was trying Hugging Face datasets with Array2D feature on MNIST Fashion data. I also set the format type as Pytorch to gain performance. Image size is 28 by 28, so I used Array2D(shape=(28, 28), dtype='float32'), but it was very slow compared to using Array2D(shape=(1, 28*28), dtype='float32') when I created dataloaders and looped through it.

When the shape was (28,28), it took 33 seconds to loop through the dataloader, but it only took 6 seconds for (1, 28). Is this performance difference expected? If yes, then how can I gain more performance when using (28, 28) shape?

Here is a jupyter notebook code to see what I did.

Thank you.

galopyz · September 4, 2023, 8:16pm

Does anybody have an answer to this question?

Thank you.

Bjornedt · September 5, 2023, 9:33pm

Hi!

The Hugging Face datasets library uses Arrow under the hood. Arrow is optimized for columnar storage, and one possible explanation for the performance difference could be how Arrow handles multi-dimensional arrays versus one-dimensional arrays. Accessing and manipulating data in a one-dimensional array (shape=(1, 28*28)) could be more efficient in this context than a two-dimensional array (shape=(28, 28)).

Some ideas what you could try to optimize performance with the (28, 28) shape:

Reshape the data after loading but before feeding it into your model, so you benefit from the fast loading of the one-dimensional array.
Check if you have the latest versions of the datasets library and Arrow
Look into parallel processing or asynchronous data loading to speed up the data feeding process.

Hope this helps!

galopyz · September 5, 2023, 10:43pm

I appreciate that.

I left the shape as (1, 28*28) for dataloaders and switched to (28, 28) shape right before using the batch for training.

Topic		Replies	Views
Image&Array2d/3d Performance Issue 🤗Datasets	0	270	November 16, 2023
Dataset Viewer not available on features of type datasets.Array2D(shape=(None, 768), dtype='float64') 🤗Datasets	7	36	May 14, 2025
[Solved] Image dataset seems slow for larger image size 🤗Datasets	7	3409	December 16, 2021
HF Dataset: Array3D vs Image, which one is better and why 🤗Datasets	0	412	April 3, 2023
Hugging face datasets -- reading image shape takes very long time Beginners	1	281	April 4, 2023

Significant performance difference between two shapes using Array2D features

Related topics