How to create subset when pushing to hub

gabrielsantosrv · June 23, 2022, 5:35pm

Hey!

I have a dataset of image and text, and I am trying to upload it to the hub using the script below. I was wondering how to create a subset, because everything is been put in “gabrielsantosrv–pracegover” subset.

Thanks in advance

from datasets import Dataset, load_dataset
from PIL import Image
import io
import json

def try_load_image(filepath):
    try:
        with open(filepath, 'rb') as f:
            image = Image.open(io.BytesIO(f.read()))
        if isinstance(image, Image.Image):
            return image
    except Image.UnidentifiedImageError:
        return None


if __name__ == "__main__":
    split = "demo"
    filepath = "sample/dataset_sample.json"
    dataset = load_dataset('json', data_files=filepath, field="data")
    dataset[split] = dataset.pop("train")   # renaming key from train to `split`

    dataset["demo"].map(lambda example: {"img": try_load_image(f"sample/images/{example['filename']}")})

    repo = "gabrielsantosrv/pracegover"
    dataset.push_to_hub(repo)

mariosasko · June 24, 2022, 1:17pm

Hi! No-code dataset repositories created with push_to_hub (or by pushing raw data files) currently support only a single config/sub-dataset, so you’ll have to write a generation script to get more than one config.

gabrielsantosrv · June 24, 2022, 5:26pm

Hi! Thanks for your reply

Do have any suggestions or example of how I can write a generation script and get more than one config?

lhoestq · June 27, 2022, 2:17pm

You can find some docs on how to write a dataset script here: Create a dataset loading script

There is also a section called “Multiple configurations” that can help you

Topic		Replies	Views
Create multiple dataset configs with `push_to_hub()` method? 🤗Datasets	1	646	November 3, 2022
Uploading json, jsonl files as subset on dataset repo 🤗Datasets	3	127	November 30, 2024
`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE) 🤗Datasets	6	2682	March 16, 2024
Add a subset to a dataset from CLI? 🤗Datasets	1	77	February 5, 2025
Pushing multiple splits of dataset to a single repo of Hub 🤗Datasets	1	2483	April 7, 2022

How to create subset when pushing to hub

Related topics