HF Datasets best practices

Hello everyone!

We are building a generic cookiecutter template to bootstrap PyTorch projects and to avoid writing boilerplate.

Among the other things, we integrated the HF Datasets library.

This means that any new project generated with this template already has a hugging face dataset configured and ready to be used to train a neural network :smile:

I would really appreciate any feedback on how to improve this integration – are there any best practices that we are not using?

The interesting files are the following:

  • .../data/datamodule.py where the (torchvision) transforms are applied with set_transform; and the DataLoaders are created.
  • .../utils/hf_io.py that handle the loading of the train/val/test splits and some basic pre-processing.

I am particular interested in speed related improvements

Thanks!