Help making object detection dataset

erotemic · April 25, 2025, 9:15pm

I’ve been building an object detection dataset for the last 5 years, and I’d like to make it work with hugging-face in a plug-and-play manner.

I currently have it uploaded: erotemic/shitspotter · Datasets at Hugging Face

I have a splits.zip containing different versions of test, train, and validation COCO files. The assets folder contains zipped “cohorts” of data from different months.

From the docs: Create an image dataset I can see how I would upload the data if I split everything up into train/test/vali folders, but my splits can’t be cleanly separated in the current folder structure. Instead the COCO files point to which images are in what split. The “file_name” field in each COCO file is relative to the dataset root, so extracting the data as-is in the current upload should mean the paths resolve.

I was able do something like this:

split_fpaths = {
    'test': 'test_imgs121_6cb3b6ff.kwcoco.zip',
    'train': 'train_imgs7797_dd191142.kwcoco.zip',
    'val': 'vali_imgs1258_8d5b0240.kwcoco.zip',
}


def main():
    import kwcoco
    for split_key, fpath in split_fpaths.items():
        dset = kwcoco.CocoDataset(fpath)
        coco_to_hf(dset)


def coco_to_hf(split_key, dset):
    from datasets import Dataset
    examples = []

    # Make sure we are using legacy coco segmenations
    dset.conform(legacy=True)

    for coco_img in dset.images().coco_images:
        anns = coco_img.annots().objs
        item = {
            "id": coco_img['id'],
            "file_name": coco_img["file_name"],
            "height": coco_img.get("height"),
            "width": coco_img.get("width"),
            "objects": [
                {
                    "bbox": ann["bbox"],
                    "category_id": ann["category_id"],
                    "segmentation": ann.get("segmentation", []),
                    "id": ann["id"]
                }
                for ann in anns
            ]
        }
        examples.append(item)

    hf_dset = Dataset.from_list(examples)
    hf_dset.save_to_disk(f"{split_key}-hf-native")
    hf_dset.to_parquet(f"{split_key}-hf.parquet")

But the saved parquet and native files are pretty small indicating they don’t contain any image data.

From the docs it also looks like I shouldn’t be using dataset loading scripts anyway as they are marked as “legacy”. It looks like my dataset should be formatted as a WebDataset in order to get automatic metadata and plug-and-play capability?

Am I going to need to re-upload all of the image again in a split tar format? I’m worried about space limitations if I get something wrong and have to do an upload more than once.

I’m wondering if it is possible to make my dataset plug-and-play without modifying any of the existing zipped uploads (so I suppose there would need to be some extraction step - is there a way to point at data within zip files?).

If not what would be the best path for me to get this dataset properly formatted for use with hugging face loaders?

erotemic · April 26, 2025, 12:31am

It looks like I’ve got a variant of the dataset uploaded using webdataset, and it seems like it is able to extract metadata.

Here is the script I used to get things working:

# /// script
# dependencies = [
#   "Pillow",
#   "huggingface_hub",
#   "kwcoco",
#   "kwutil",
#   "scriptconfig",
#   "ubelt",
#   "webdataset",
# ]
# requires-python = ">=3.11"
# ///
"""
Convert a KWCoco dataset with train/vali/test splits to Hugging Face WebDataset format.

Example usage (locally):

    python kwcoco_to_hf_webdataset.py \
        --bundle-dir /path/to/dataset_bundle \
        --output-dir /path/to/output/webdataset_shards

Optionally push to HF:
    --push-to-hub --hf-repo erotemic/shitspotter

Example PyTorch DataLoader usage:

    >>> import webdataset as wds
    >>> import torch
    >>> from pathlib import Path
    >>> split = "train"
    >>> root = Path("webdataset_shards") / split
    >>> urls = str(root / f"{split}-{{000000..000008}}.tar")
    >>> dset = wds.WebDataset(urls).decode("pil").to_tuple("jpg", "json")
    >>> loader = torch.utils.data.DataLoader(dset.batched(2))
    >>> for imgs, metas in loader:
    >>>     print(imgs[0].size, metas[0])  # doctest: +SKIP
    >>>     break  # Only show the first batch


    >>> import webdataset as wds
    >>> import torch
    >>> from torchvision.transforms import ToTensor
    >>> from pathlib import Path
    >>> split = "train"
    >>> root = Path("webdataset_shards") / split
    >>> urls = str(root / f"{split}-{{000000..000008}}.tar")
    >>> # decode to PIL, then map PIL→Tensor
    >>> dset = (
    ...     wds.WebDataset(urls)
    ...       .decode("pil")
    ...       .to_tuple("jpg", "json")
    ...       .map_tuple(ToTensor(), lambda meta: meta)
    ... )
    >>> loader = torch.utils.data.DataLoader(dset.batched(2))
    >>> for imgs, metas in loader:
    ...     # imgs is a list of torch.Tensors, metas is a list of dicts
    ...     print(imgs[0].shape, metas[0])
    ...     break  # only show first batch

References:
    https://huggingface.co/datasets/erotemic/shitspotter
    https://discuss.huggingface.co/t/help-making-object-detection-dataset/152344
    https://discuss.huggingface.co/t/generating-croissant-metadata-for-custom-image-dataset/150255
"""

from PIL import Image
from huggingface_hub import HfApi, upload_file
from io import BytesIO
from pathlib import Path
from scriptconfig import DataConfig, Value
import json
import kwcoco
import kwutil
import os
import ubelt as ub
import webdataset as wds


class KwcocoToHFConfig(DataConfig):
    """
    Convert a KWCoco bundle (train/vali/test .kwcoco.zip files) to Hugging Face WebDataset format.
    """

    bundle_dir = Value(
        "/data/joncrall/dvc-repos/shitspotter_dvc",
        help="Directory with train/vali/test .kwcoco.zip files",
    )
    output_dir = Value(
        "/data/joncrall/dvc-repos/shitspotter_dvc/webdataset_shards",
        help="Output dir for WebDataset .tar files",
    )
    push_to_hub = Value(
        False, isflag=True, help="Push to Hugging Face hub (not implemented)"
    )
    hf_repo = Value(
        "erotemic/shitspotter", help="Optional HF repo (e.g. erotemic/shitspotter)"
    )


def convert_split(coco_fpath, out_tar, categories_out=None):
    dset = kwcoco.CocoDataset(coco_fpath)
    print(f"[INFO] Loaded {coco_fpath}: {len(dset.images())} images")

    if categories_out and not categories_out.exists():
        cats = dset.dataset.get("categories", [])
        categories_out.write_text(json.dumps(cats, indent=2))
        print(f"[INFO] Wrote categories.json with {len(cats)} categories")

    ub.Path(out_tar).parent.ensuredir()
    sink = wds.ShardWriter(str(out_tar), maxcount=1000)

    pman = kwutil.ProgressManager()
    with pman:
        for coco_img in pman.progiter(
            dset.images().coco_images, desc=f"Processing {coco_fpath}"
        ):
            image_id = coco_img.img["id"]
            img_path = coco_img.image_filepath()
            img_pil = Image.open(img_path).convert("RGB")

            # Save image to bytes
            img_bytes = BytesIO()
            img_pil.save(img_bytes, format="jpeg")
            img_bytes = img_bytes.getvalue()

            # Convert annots to basic JSON-serializable format
            anns = []
            for ann in coco_img.annots().objs:
                anns.append(
                    {
                        "bbox": ann["bbox"],
                        "category_id": ann["category_id"],
                        "segmentation": ann.get("segmentation", None),
                        "iscrowd": ann.get("iscrowd", 0),
                    }
                )

            # Save JSON metadata
            sample = {
                "__key__": str(image_id),
                "jpg": img_bytes,
                "json": json.dumps(
                    {
                        "id": image_id,
                        "file_name": os.path.basename(img_path),
                        "width": coco_img.img["width"],
                        "height": coco_img.img["height"],
                        "annotations": anns,
                    }
                ),
            }

            sink.write(sample)

    sink.close()
    print(f"Saved {out_tar}")


def upload_to_hub(hf_repo, bundle_dir, output_dir):
    api = HfApi()  # NOQA
    output_dir = Path(output_dir)

    for file in output_dir.glob("*/**.tar"):
        print(f"[UPLOAD] Uploading {file.name} to {hf_repo}")
        upload_file(
            path_or_fileobj=str(file),
            path_in_repo=str(file.relative_to(bundle_dir)),
            repo_id=hf_repo,
            repo_type="dataset",
        )
    for categories_file in output_dir.glob("*categories.json"):
        ...
        upload_file(
            path_or_fileobj=str(categories_file),
            path_in_repo=str(categories_file.relative_to(bundle_dir)),
            repo_id=hf_repo,
            repo_type="dataset",
        )


def main():
    config = KwcocoToHFConfig.cli()
    print(f"[CONFIG]\n{ub.urepr(config, nl=1)}")

    bundle_dir = Path(config.bundle_dir)
    output_dir = Path(config.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    splits = ["train", "vali", "test"]
    categories_out = output_dir / "categories.json"

    for split in splits:
        coco_fpath = bundle_dir / f"{split}.kwcoco.zip"
        out_tar = output_dir / f"{split}.tar"
        if not coco_fpath.exists():
            raise Exception(f"Missing {split} split at {coco_fpath}")

    for split in splits:
        coco_fpath = bundle_dir / f"{split}.kwcoco.zip"
        out_tar = output_dir / f"{split}/{split}-%06d.tar"
        categories_out = output_dir / f"{split}_categories.json"
        convert_split(coco_fpath, out_tar, categories_out)

    if config.push_to_hub:
        hf_repo = config.hf_repo
        if not hf_repo:
            raise ValueError("Must specify --hf-repo when using --push-to-hub")
        upload_to_hub(hf_repo, bundle_dir, output_dir)


if __name__ == "__main__":
    main()

I’m not sure if I’ve specified all the metadata correctly. I see my annotation metadata in the “json” column, but I don’t see boxes or polygons drawn, which makes me think I don’t have annotations encoded correctly.

I’m also not sure how I can go about updating this dataset if anything changes. If I change an annotation, I cause a huge LFS diff, that will make it difficult to tag multiple versions of the dataset as I continue to add to it or refine annotation. Any advice on how to handle that would be appreciated.

John6666 · April 26, 2025, 9:16am

Hmm… This seems difficult for me. @lhoestq

erotemic · April 26, 2025, 4:56pm

I’ve created a small test dataset to attempt to get a general kwcoco-to-huggingface conversion tool working.

I tried to follow Object detection to add standardized object detection information, but I’m not sure if huggingface is picking it up correctly. When I tried the column-wise “objects” dictionary to a webdataset.ShardWritter it complained that it wasn’t a recognized field, so I had to add them to the “json” portion, but I’m not sure if I’ve just added generic metadata or if this really is the place to put annotation information.

If annotations are specified correctly, should I expect them to be drawn on the images in the web dataset viewer?

Is there even a standardized way to add a set of boxes / segmentations to a webdataset? Looking at other datasets on huggingface with detections / segmentations, I’m not really seeing much common structure. Is this use-case underdeveloped?

My updated conversion script is in the details

#!/usr/bin/env python3
# /// script
# dependencies = [
#   "Pillow",
#   "huggingface_hub",
#   "kwcoco",
#   "kwutil",
#   "scriptconfig",
#   "ubelt",
#   "webdataset",
# ]
# requires-python = ">=3.11"
# ///
r"""
Convert a KWCoco dataset with train/vali/test splits to Hugging Face WebDataset format.

Example usage (locally):

    python kwcoco_to_hf_webdataset.py \
        --bundle_dir /data/joncrall/dvc-repos/shitspotter_dvc \
        --output_dir /data/joncrall/dvc-repos/shitspotter_dvc/webdataset_shards \
        --hf_repo erotemic/shitspotter

References:
    https://huggingface.co/datasets/erotemic/shitspotter
    https://discuss.huggingface.co/t/help-making-object-detection-dataset/152344
    https://discuss.huggingface.co/t/generating-croissant-metadata-for-custom-image-dataset/150255
    https://chatgpt.com/c/680be71a-4a0c-8002-a31e-bd9c17b5ac05

Example:
    >>> # Demo of full conversion
    >>> from kwcoco_to_hf_webdataset import *  # NOQA
    >>> import ubelt as ub
    >>> import kwcoco
    >>> dpath = ub.Path.appdir('kwcoco/demo/hf-convert').ensuredir()
    >>> full_dset = kwcoco.CocoDataset.demo('shapes32')
    >>> full_dset.reroot(absolute=True)
    >>> # Create splits
    >>> split_names = ['train', 'validation', 'test']
    >>> imgid_chunks = list(ub.chunks(full_dset.images(), nchunks=3))
    >>> for split_name, gids in zip(split_names, imgid_chunks):
    >>>     sub_dset.fpath = dpath / (split_name + '.kwcoco.zip')
    >>>     sub_dset.dump()
    >>> # Call conversion script
    >>> config = KwcocoToHFConfig(
    >>>     bundle_dir=dpath,
    >>>     output_dir=dpath / 'webds',
    >>>     hf_repo=None,
    >>>     #hf_repo='erotemic/shapes',
    >>> )
    >>> KwcocoToHFConfig.main(argv=False, **config)
    >>> # Test conversion can be read by a torch dataloader
    >>> check_webdataset_as_torch(dpath / 'webds/train/*.tar')
    >>> # xdoctest: +REQUIRES(--upload)
    >>> # Test upload
    >>> hf_repo = 'erotemic/shapes'
    >>> upload_to_hub(hf_repo, config.bundle_dir, config.output_dir)
"""

import json
import kwcoco
import kwutil
import os
import ubelt as ub
import webdataset
from PIL import Image
from huggingface_hub import HfApi, upload_file
from io import BytesIO
import scriptconfig as scfg


class KwcocoToHFConfig(scfg.DataConfig):
    """
    Convert a KWCoco bundle (train/vali/test .kwcoco.zip files) to Hugging Face WebDataset format.
    """

    bundle_dir = scfg.Value(
        None,
        help=ub.paragraph(
            """
            Directory with train/vali/test .kwcoco.zip files
            """
        ),
    )
    output_dir = scfg.Value(
        None,
        help=ub.paragraph(
            """
            Output dir for WebDataset .tar files
            """
        ),
    )
    hf_repo = scfg.Value(
        None,
        help=ub.paragraph(
            """
            If specified, push to this huggingface repo.
            (e.g. erotemic/shitspotter)
            """
        ),
    )

    @classmethod
    def main(cls, argv=None, **kwargs):
        import rich
        from rich.markup import escape
        config = cls.cli(argv=argv, data=kwargs, strict=True)
        rich.print('config = ' + escape(ub.urepr(config, nl=1)))

        bundle_dir = ub.Path(config.bundle_dir)
        output_dir = ub.Path(config.output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)

        splits = ["train", "validation", "test"]
        categories_out = output_dir / "categories.json"

        for split in splits:
            coco_fpath = bundle_dir / f"{split}.kwcoco.zip"
            out_tar = output_dir / f"{split}.tar"
            if not coco_fpath.exists():
                raise Exception(f"Missing {split} split at {coco_fpath}")

        for split in splits:
            coco_fpath = bundle_dir / f"{split}.kwcoco.zip"
            out_tar = output_dir / f"{split}/{split}-%06d.tar"
            categories_out = output_dir / f"{split}_categories.json"
            convert_coco_to_webdataset(coco_fpath, out_tar, categories_out)

        if config.hf_repo is not None:
            hf_repo = config.hf_repo
            upload_to_hub(hf_repo, bundle_dir, output_dir)


def check_webdataset_as_torch(urls):
    """
    Args:
        urls (str):
            glob pattern matching the tar files or directory containing them.
    """
    # Once converted, test that we can use a pytorch dataloader:
    import webdataset as wds
    import torch
    from torchvision.transforms import ToTensor
    import kwutil

    urls = list(map(os.fspath, kwutil.util_path.coerce_patterned_paths(urls, expected_extension='.tar')))
    print(f'urls = {ub.urepr(urls, nl=1)}')
    assert urls

    # decode to PIL, then map PIL→Tensor
    dset = (
        wds.WebDataset(urls)
        .decode("pil")
        .to_tuple("jpg", "json")
        .map_tuple(ToTensor(), lambda meta: meta)
    )
    loader = torch.utils.data.DataLoader(dset.batched(2))
    for imgs, metas in loader:
        # imgs is a list of torch.Tensors, metas is a list of dicts
        print(imgs[0].shape, metas[0])
        break


def convert_coco_to_webdataset(coco_dset, out_tar, categories_out=None):
    """
    Convert a coco dataset to a webdataset suitable for huggingface.

    Args:
        coco_dset (str | PathLike | CocoDataset):
            path to the coco dataset or the coco datset itself.

        out_tar (str | PathLike): this is the patterned path
            to write sharded tar files to.

        categories_out (str | PathLike | None):
            if True, write out the category json file to this path

    Example:
        >>> from kwcoco_to_hf_webdataset import *  # NOQA
        >>> import ubelt as ub
        >>> import kwcoco
        >>> dpath = ub.Path.appdir('kwcoco/test/hf-convert').ensuredir()
        >>> coco_dset = kwcoco.CocoDataset.demo('shapes8')
        >>> out_tar = dpath / f"test_wds/test-wds-%06d.tar"
        >>> categories_out = dpath / f"test_wds_categories.json"
        >>> urls = written_files = convert_coco_to_webdataset(coco_dset, out_tar, categories_out)
        >>> check_webdataset_as_torch(urls)
    """
    dset = kwcoco.CocoDataset.coerce(coco_dset)
    print(f"[INFO] Loaded {coco_dset}")

    if categories_out and not categories_out.exists():
        cats = dset.dataset.get("categories", [])
        categories_out.write_text(json.dumps(cats, indent=2))
        print(f"[INFO] Wrote categories.json with {len(cats)} categories")

    ub.Path(out_tar).parent.ensuredir()
    sink = webdataset.ShardWriter(pattern=str(out_tar), maxcount=1000)

    dset.conform(legacy=True)

    written_files = ub.oset()

    pman = kwutil.ProgressManager()
    with pman:
        coco_images = dset.images().coco_images
        prog_iter = pman.progiter(coco_images, desc=f"Processing {dset.tag}")
        for coco_img in prog_iter:
            image_id = coco_img.img["id"]
            img_path = coco_img.image_filepath()
            img_pil = Image.open(img_path).convert("RGB")

            # Save image to bytes
            img_bytes = BytesIO()
            img_pil.save(img_bytes, format="jpeg")
            img_bytes = img_bytes.getvalue()

            # Convert annots to basic JSON-serializable format

            # Attempt to make dataset object detection ready.
            # https://huggingface.co/docs/datasets/v2.14.5/en/object_detection
            objects = {
                "area": [],
                "bbox": [],
                "category": [],
                "id": [],
            }
            for ann in coco_img.annots().objs:
                objects["area"].append(int(ann["area"]))
                objects["bbox"].append(ann["bbox"])
                objects["category"].append(ann["category_id"])
                objects["id"].append(ann["id"])

            anns = []
            for ann in coco_img.annots().objs:
                anns.append(
                    {
                        "bbox": ann["bbox"],
                        "category_id": ann["category_id"],
                        "segmentation": ann.get("segmentation", None),
                        "iscrowd": ann.get("iscrowd", 0),
                    }
                )

            # Save JSON metadata
            sample = {
                "__key__": str(image_id),
                "jpg": img_bytes,
                # "image_id": image_id,
                # "width": coco_img.img["width"],
                # "height": coco_img.img["height"],
                # "objects": objects,
                "json": json.dumps(
                    {
                        "id": image_id,
                        "image_id": image_id,
                        "file_name": os.path.basename(img_path),
                        "width": coco_img.img["width"],
                        "height": coco_img.img["height"],
                        "objects": objects,
                        "annotations": anns,
                    }
                ),
            }

            sink.write(sample)
            written_files.append(sink.fname)

    sink.close()
    written_files = list(written_files)
    print(f"Saved {written_files}")
    return written_files


def upload_to_hub(hf_repo, bundle_dir, output_dir):
    api = HfApi()  # NOQA
    output_dir = ub.Path(output_dir)

    for file in output_dir.glob("*/**.tar"):
        print(f"[UPLOAD] Uploading {file.name} to {hf_repo}")
        upload_file(
            path_or_fileobj=str(file),
            path_in_repo=str(file.relative_to(bundle_dir)),
            repo_id=hf_repo,
            repo_type="dataset",
        )
    for categories_file in output_dir.glob("*categories.json"):
        upload_file(
            path_or_fileobj=str(categories_file),
            path_in_repo=str(categories_file.relative_to(bundle_dir)),
            repo_id=hf_repo,
            repo_type="dataset",
        )


if __name__ == "__main__":
    KwcocoToHFConfig.main()

John6666 · April 26, 2025, 11:58pm

Most information about DatasetViewer is available on the DatasetViewer GitHub page, but here are some additional restrictions. There are certain limitations on size and file formats.

Topic		Replies	Views
How to a build a dataset using s3 uris 🤗Datasets	6	509	February 7, 2025
Huggingface Vision Dataset - the right way to use it? 🤗Datasets	5	1277	July 11, 2022
Creating a object detection data set from one folder of several video frames Beginners	1	951	August 2, 2023
Load a COCO format database from disk for DETR 🤗Datasets	4	64	May 14, 2025
Prepare dataset from YOLO format to COCO for DETR 🤗Transformers	4	5080	May 6, 2025

Help making object detection dataset

Related topics