Generating Croissant Metadata for Custom Image Dataset

I tried generating a script to write Croissant files for existing Hugging Face datasets. It seems to work for now, but it probably needs improvement…


Goal: generate a valid croissant.json for an existing Hugging Face dataset repo that does not already expose Croissant.

Summary:

  • First try auto-Croissant. If your repo is Parquet or ImageFolder-like, Hugging Face exposes /croissant automatically. If it’s not exposed, author croissant.json yourself, validate with mlcroissant, and commit it at the repo root. (Hugging Face)

Background you need

  • Croissant is JSON-LD for ML datasets. It wraps four layers: metadata, resources, structure, ML semantics. Stable spec is 1.0; the reference repo’s latest release is v1.0.22 (2025-08-25). (docs.mlcommons.org)
  • Hugging Face publishes Croissant for datasets that can be converted to Parquet or follow ImageFolder. The API endpoint is documented and the JSON-LD is also embedded in dataset pages. (Hugging Face)

Way 0: check auto-Croissant (fast path)

  • Try either endpoint. Use whichever you prefer.

    • https://huggingface.co/api/datasets/<OWNER>/<REPO>/croissant (documented by MLCommons as an HF API example). (docs.mlcommons.org)
    • https://datasets-server.huggingface.co/croissant?dataset=<OWNER>/<REPO> (dataset viewer API doc). (Hugging Face)
  • If it returns JSON-LD, you are done. If it 404s, your repo likely isn’t Parquet/ImageFolder-convertible. Convert or proceed to manual. (Hugging Face)

Way 1: author and commit croissant.json (manual, reliable)

What must be in the file

Minimum, with names per spec:

  • @context, @type: "Dataset", name, url, conformsTo, distribution (list of cr:FileObject or cr:FileSet), and one or more recordSet with cr:Fields mapping columns to sources. See the spec’s minimal example using contentUrl, encodingFormat, and optional sha256. Use conformsTo: "http://mlcommons.org/croissant/1.0". (docs.mlcommons.org)

Where to put it

  • Commit croissant.json at the repo root. Many public datasets do this. Examples: CharXiv, TerraIncognita, worldmodelbench. Inspect their structure for patterns. (Hugging Face)

How to generate it quickly (scriptable)

Use the Hub to list files, build distribution, and infer simple recordSets from CSV headers.

# Generates a skeleton croissant.json for an HF dataset repo.
# References:
# - HfApi/HfFileSystem and upload: https://huggingface.co/docs/huggingface_hub/guides/upload
# - Build raw URLs with hf_hub_url: https://huggingface.co/docs/huggingface_hub/guides/download
# - Croissant minimal keys example: https://docs.mlcommons.org/croissant/
from huggingface_hub import HfApi, HfFileSystem, hf_hub_url
import json, re
import pandas as pd

REPO = "OWNER/REPO"  # e.g., "BGLab/TerraIncognita"
api = HfApi()
fs = HfFileSystem()

# List every file in the dataset repo
files = [p.split("datasets/")[1] for p in fs.glob(f"datasets/{REPO}/**")]

# Buckets
csvs = [f for f in files if f.lower().endswith((".csv", ".tsv"))]
images = [f for f in files if re.search(r"\.(jpg|jpeg|png|tif|tiff|bmp|gif)$", f, re.I)]

# Build distribution
dist = []
if images:
    # Group globs per top folder
    topdirs = sorted({f.split("/")[0] for f in images if "/" in f} or {"."})
    for d in topdirs:
        includes = f"{d}/**/*" if d != "." else "**/*"
        dist.append({
            "@type": "cr:FileSet",
            "@id": f"images-{d}".replace("/", "_"),
            "name": f"images-{d}",
            "encodingFormat": "image/*",
            "includes": includes
        })

for c in csvs:
    dist.append({
        "@type": "cr:FileObject",
        "@id": c,
        "name": c,
        "contentUrl": hf_hub_url(repo_id=REPO, filename=c, repo_type="dataset"),  # raw resolve URL
        "encodingFormat": "text/csv" if c.lower().endswith(".csv") else "text/tab-separated-values",
    })

# Basic recordSet from first CSV (optional but useful)
record_sets = []
if csvs:
    url = hf_hub_url(repo_id=REPO, filename=csvs[0], repo_type="dataset")
    cols = list(pd.read_csv(url, nrows=0).columns)
    fields = [{
        "@type": "cr:Field",
        "@id": f"samples/{col}",
        "name": col,
        "dataType": "Text",
        "source": {"fileObject": {"@id": csvs[0]}, "extract": {"column": col}}
    } for col in cols]
    record_sets.append({"@type": "cr:RecordSet", "@id": "samples", "name": "samples", "field": fields})

croissant = {
  "@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/" },
  "@type": "Dataset",
  "name": REPO.split("/")[-1],
  "url": f"https://huggingface.co/datasets/{REPO}",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "distribution": dist,
  "recordSet": record_sets
}

with open("croissant.json", "w") as f:
    json.dump(croissant, f, indent=2)
print("Wrote croissant.json")

Notes:

  • Use hf_hub_url(..., filename=..., repo_type="dataset") to produce a raw .../resolve/... URL. Do not point contentUrl at the web UI. (Hugging Face)
  • Prefer cr:FileSet with includes for folders and cr:FileObject with contentUrl for single files. The spec’s minimal example shows both patterns and sha256 support. (docs.mlcommons.org)

Validate and iterate

  • Install and validate:

    • pip install "mlcroissant[parquet]" then python -c "import mlcroissant, sys; print('ok')"
    • Load/validate in code or from CLI. The library docs show loading a Croissant URL and are kept current in HF docs. (Hugging Face)
  • Practical loop:

    1. Generate croissant.json.
    2. Validate by constructing an mlcroissant.Dataset(jsonld=...) and iterating records. The README and docs show exact calls. (github.com)
    3. Commit croissant.json to the dataset repo via upload_file or CLI. (Hugging Face)

Optional GUI instead of code

  • Use the Croissant Editor Space. It infers resources and RecordSets from your files and lets you export JSON-LD. Good for images or nested folders. (Hugging Face)

Way 2: restructure to get auto-Croissant (no manual file)

  • Convert to Parquet or organize as ImageFolder (+ simple metadata file). HF will auto-publish Parquet and then /croissant appears. Docs and maintainer posts confirm this behavior in 2025. (Hugging Face)

Working example repos to copy from

  • princeton-nlp/CharXiv uses a top-level croissant.json with recordSet and RAI keys. Shows complete structure. Updated 2024-06-11. (Hugging Face)
  • BGLab/TerraIncognita shows CSV-based distribution and fields. Updated ~5 months ago. (Hugging Face)
  • Efficient-Large-Model/worldmodelbench shows recent practice. Updated ~5 months ago. (Hugging Face)

Common pitfalls to avoid

  • Missing required top-level keys or wrong vocabulary prefix. Follow the minimal example in the spec. (docs.mlcommons.org)
  • contentUrl pointing to the web UI instead of the raw .../resolve/.... Use hf_hub_url. (Hugging Face)
  • Expecting your manual croissant.json to change HF’s /croissant endpoint. That endpoint is auto-generated from Parquet/ImageFolder. Your file serves external tools and users. (Hugging Face)
  • Big private repos: dataset viewer Parquet publishing has limits and requirements. See size and visibility rules. (Hugging Face)

End-to-end checklist (redundant on purpose)

  1. Probe /croissant. If present, reuse it. If absent, proceed. (Hugging Face)

  2. Generate croissant.json:

    • List files from the repo.
    • Build distribution with cr:FileSet globs and cr:FileObject URLs from hf_hub_url.
    • Create a recordSet per main table with cr:Fields mapped by column. (Hugging Face)
  3. Validate with mlcroissant and load a few records. Fix errors. (Hugging Face)

  4. Commit croissant.json at the repo root with upload_file or CLI. (Hugging Face)

  5. Optional: rebuild the repo as Parquet or ImageFolder to get HF auto-Croissant as well. (Hugging Face)

Supplemental materials (curated, dated)

Spec and core

  • MLCommons Croissant docs and minimal example. Accessed 2025-10-15. (docs.mlcommons.org)
  • Croissant repo README and examples. Latest release v1.0.22 on 2025-08-25. (github.com)

Hugging Face APIs

  • Get Croissant metadata via dataset viewer. Accessed 2025-10-15. (Hugging Face)
  • Dataset viewer overview and Parquet auto-publish rules. Updated 2024-2025. (Hugging Face)
  • mlcroissant usage on HF docs. Accessed 2025-10-15. (Hugging Face)
  • Upload files programmatically and with CLI. Updated 2024-07-22 and 2024-??. (Hugging Face)
  • Build raw URLs with hf_hub_url. Accessed 2025-10-15. (Hugging Face)

HF forum guidance

  • Auto-Croissant for Parquet/ImageFolder. Posts from 2025-04-14 and 2025-05-12. (Hugging Face Forums)

Editor

  • Croissant Editor Space. Accessed 2025-10-15. (Hugging Face)
2 Likes