Using PyTorch Dataset Class with Dataset Builder

Hello everyone,

I’ve built an autoregressive sequence model that generates characters (not a traditional language model) and hosted it on GitHub. My repo includes a somewhat complex PyTorch Dataset class with optional arguments that modify the dataset (e.g., max sequence length). Instead of simply uploading the dataset via from_generator or from_dict, I’d like to give users full access to the functionality of my PyTorch class through Hugging Face. Is creating a dataset loading script (dataset builder) the best option? Also, how can I reuse my custom PyTorch Dataset class from my repo (e.g., by cloning or pip installing)?

Thank you!

1 Like

Hi there! Great work on building your autoregressive sequence model and hosting it on GitHub. To address your questions:

1. Using a Dataset Loading Script (Dataset Builder)

Creating a dataset loading script is indeed the best option if you want to make your dataset accessible through the Hugging Face datasets library. This allows users to seamlessly load your dataset via the load_dataset API, benefiting from built-in functionality like caching, slicing, and batching.

A loading script is particularly useful because it can expose all the customizability of your PyTorch Dataset class (e.g., setting max_sequence_length) through the datasets library’s config parameters. Here’s how you can approach it:

  • Create a dataset_infos.json file describing the dataset metadata.
  • Write a custom loading script (my_dataset.py) that processes and prepares the dataset, mapping its parameters to Hugging Face configuration.
  • Hugging Face provides a guide for creating dataset scripts. This will help you structure and publish your dataset script properly.

2. Reusing Your PyTorch Dataset Class

To allow users to reuse your existing PyTorch Dataset class:

  • Option 1: Package and Publish
    Package your repository as a Python package and publish it on PyPI. This makes it installable via pip install your_package_name. Add detailed documentation in your repo to guide users on how to integrate your PyTorch Dataset class.

    To package your code, include a setup.py or pyproject.toml in your repo, specifying dependencies and entry points.

  • Option 2: Direct GitHub Integration
    Users can directly clone your repo or install it using pip from GitHub:

    pip install git+https://github.com/your_username/your_repo_name.git
    

    Make sure the repo structure allows for easy importing of your Dataset class.

  • Option 3: Integrate With Your Hugging Face Script
    If your PyTorch Dataset class has functionality that complements the Hugging Face dataset, you can incorporate its logic into the dataset loading script itself or provide instructions for combining both approaches in your documentation.

Feel free to share your repo link—I’d be happy to take a look and give more specific feedback if needed. Good luck! :rocket:

2 Likes

Hello @Alanturner2 , thanks for your detailed response!

It seems that using my PyTorch Dataset class directly is not a great option because it has random data augmentations and I dataset modifiers (e.g., max_sequence_length). If simply call my PyTorch class inside _generate_examples two bad things would happen:

  1. The dataset would be cached with the data augmentations of the first pass through the data, and essentially make the data static and deterministic.
  2. Every time I call load dataset with a change in any of the modifiers (e.g., a different max_sequence_length) the entire dataset would be downloaded and processed again.

Please correct me if I’m wrong, but it seems that the Dataset builder is design for native Huggingface datasets, which means that my best options are:

  1. Related to your option 3. Essentially ignore my PyTorch dataset and re-write all its functionality inside the dataset loading script. I could also use set_transform for any data augmentations. Not sure if this will fix the problem where HF will download and process the dataset again if I select a different max_sequence_length
  2. Use HF to store the base datasets and just use the download manager to download the files. Then I can read the data with my PyTorch class.

If it helps, I’d be happy to send some code snippets of the dataset and my builder class, just let me know.

Please let me know what you think! Not sure if these are common issues of if there is an easier way. Thank you very much for your help.

1 Like

Any thoughts about this? Thanks!

1 Like