HF Datasets best practices

lucmos · October 14, 2023, 6:04pm

Hello everyone!

We are building a generic cookiecutter template to bootstrap PyTorch projects and to avoid writing boilerplate.

Among the other things, we integrated the HF Datasets library.

This means that any new project generated with this template already has a hugging face dataset configured and ready to be used to train a neural network

I would really appreciate any feedback on how to improve this integration – are there any best practices that we are not using?

The interesting files are the following:

.../data/datamodule.py where the (torchvision) transforms are applied with set_transform; and the DataLoaders are created.
.../utils/hf_io.py that handle the loading of the train/val/test splits and some basic pre-processing.

I am particular interested in speed related improvements

Thanks!

Topic		Replies	Views
Using HF to train a custom PyTorch architecture Beginners	0	511	July 29, 2022
Multiple Custom PyTorch Datasets 🤗Datasets	3	42	January 26, 2025
Custom, without any pretraining, training with PyTorch Beginners	0	286	January 30, 2023
Prakash Hinduja Geneva, Switzerland - How to fine-tune a model on custom dataset in HF? Beginners	2	47	June 6, 2025
Data augmentation for image (ViT) using Hugging Face Beginners	9	6012	December 10, 2021

HF Datasets best practices

Related topics