Access to MNIST examples is 7 times slower than torchvision:
import torchvision
import datasets
mnist_hf = datasets.load_dataset("mnist", split="train")
mnist_hf_inmem = datasets.load_dataset("mnist", split="train", keep_in_memory=True)
mnist_tv = torchvision.datasets.MNIST("~/home", train=True, download=True)
def f(data):
for ids in range(60000):
data[ids]
%timeit f(mnist_hf)
%timeit f(mnist_hf_inmem)
%timeit f(mnist_tv)
5.21 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.06 s ± 86.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
770 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is this due to the storage format? Can something be done about it? Being this slow, for simple convolutional networks, a training step is dominated by dataset access…