I was following Part 2 of FastAI course, and I was trying Hugging Face datasets with Array2D feature on MNIST Fashion data. I also set the format type as Pytorch to gain performance. Image size is 28 by 28, so I used
Array2D(shape=(28, 28), dtype='float32'), but it was very slow compared to using
Array2D(shape=(1, 28*28), dtype='float32') when I created dataloaders and looped through it.
When the shape was (28,28), it took 33 seconds to loop through the dataloader, but it only took 6 seconds for (1, 28). Is this performance difference expected? If yes, then how can I gain more performance when using (28, 28) shape?
Here is a jupyter notebook code to see what I did.
Does anybody have an answer to this question?
The Hugging Face datasets library uses Arrow under the hood. Arrow is optimized for columnar storage, and one possible explanation for the performance difference could be how Arrow handles multi-dimensional arrays versus one-dimensional arrays. Accessing and manipulating data in a one-dimensional array (shape=(1, 28*28)) could be more efficient in this context than a two-dimensional array (shape=(28, 28)).
Some ideas what you could try to optimize performance with the (28, 28) shape:
- Reshape the data after loading but before feeding it into your model, so you benefit from the fast loading of the one-dimensional array.
- Check if you have the latest versions of the datasets library and Arrow
- Look into parallel processing or asynchronous data loading to speed up the data feeding process.
Hope this helps!
I appreciate that.
I left the shape as (1, 28*28) for dataloaders and switched to (28, 28) shape right before using the batch for training.