How to create subset when pushing to hub

Hey!

I have a dataset of image and text, and I am trying to upload it to the hub using the script below. I was wondering how to create a subset, because everything is been put in ā€œgabrielsantosrvā€“pracegoverā€ subset.

Thanks in advance :smiley:

from datasets import Dataset, load_dataset
from PIL import Image
import io
import json

def try_load_image(filepath):
    try:
        with open(filepath, 'rb') as f:
            image = Image.open(io.BytesIO(f.read()))
        if isinstance(image, Image.Image):
            return image
    except Image.UnidentifiedImageError:
        return None


if __name__ == "__main__":
    split = "demo"
    filepath = "sample/dataset_sample.json"
    dataset = load_dataset('json', data_files=filepath, field="data")
    dataset[split] = dataset.pop("train")   # renaming key from train to `split`

    dataset["demo"].map(lambda example: {"img": try_load_image(f"sample/images/{example['filename']}")})

    repo = "gabrielsantosrv/pracegover"
    dataset.push_to_hub(repo)

Hi! No-code dataset repositories created with push_to_hub (or by pushing raw data files) currently support only a single config/sub-dataset, so youā€™ll have to write a generation script to get more than one config.

Hi! Thanks for your reply :smiley:

Do have any suggestions or example of how I can write a generation script and get more than one config?

You can find some docs on how to write a dataset script here: Create a dataset loading script

There is also a section called ā€œMultiple configurationsā€ that can help you :wink: