I tried generating a script to write Croissant files for existing Hugging Face datasets. It seems to work for now, but it probably needs improvement…
Goal: generate a valid croissant.json for an existing Hugging Face dataset repo that does not already expose Croissant.
Summary:
- First try auto-Croissant. If your repo is Parquet or ImageFolder-like, Hugging Face exposes
/croissantautomatically. If it’s not exposed, authorcroissant.jsonyourself, validate withmlcroissant, and commit it at the repo root. (Hugging Face)
Background you need
- Croissant is JSON-LD for ML datasets. It wraps four layers: metadata, resources, structure, ML semantics. Stable spec is 1.0; the reference repo’s latest release is v1.0.22 (2025-08-25). (docs.mlcommons.org)
- Hugging Face publishes Croissant for datasets that can be converted to Parquet or follow ImageFolder. The API endpoint is documented and the JSON-LD is also embedded in dataset pages. (Hugging Face)
Way 0: check auto-Croissant (fast path)
-
Try either endpoint. Use whichever you prefer.
https://huggingface.co/api/datasets/<OWNER>/<REPO>/croissant(documented by MLCommons as an HF API example). (docs.mlcommons.org)https://datasets-server.huggingface.co/croissant?dataset=<OWNER>/<REPO>(dataset viewer API doc). (Hugging Face)
-
If it returns JSON-LD, you are done. If it 404s, your repo likely isn’t Parquet/ImageFolder-convertible. Convert or proceed to manual. (Hugging Face)
Way 1: author and commit croissant.json (manual, reliable)
What must be in the file
Minimum, with names per spec:
@context,@type: "Dataset",name,url,conformsTo,distribution(list ofcr:FileObjectorcr:FileSet), and one or morerecordSetwithcr:Fields mapping columns to sources. See the spec’s minimal example usingcontentUrl,encodingFormat, and optionalsha256. UseconformsTo: "http://mlcommons.org/croissant/1.0". (docs.mlcommons.org)
Where to put it
- Commit
croissant.jsonat the repo root. Many public datasets do this. Examples: CharXiv, TerraIncognita, worldmodelbench. Inspect their structure for patterns. (Hugging Face)
How to generate it quickly (scriptable)
Use the Hub to list files, build distribution, and infer simple recordSets from CSV headers.
# Generates a skeleton croissant.json for an HF dataset repo.
# References:
# - HfApi/HfFileSystem and upload: https://huggingface.co/docs/huggingface_hub/guides/upload
# - Build raw URLs with hf_hub_url: https://huggingface.co/docs/huggingface_hub/guides/download
# - Croissant minimal keys example: https://docs.mlcommons.org/croissant/
from huggingface_hub import HfApi, HfFileSystem, hf_hub_url
import json, re
import pandas as pd
REPO = "OWNER/REPO" # e.g., "BGLab/TerraIncognita"
api = HfApi()
fs = HfFileSystem()
# List every file in the dataset repo
files = [p.split("datasets/")[1] for p in fs.glob(f"datasets/{REPO}/**")]
# Buckets
csvs = [f for f in files if f.lower().endswith((".csv", ".tsv"))]
images = [f for f in files if re.search(r"\.(jpg|jpeg|png|tif|tiff|bmp|gif)$", f, re.I)]
# Build distribution
dist = []
if images:
# Group globs per top folder
topdirs = sorted({f.split("/")[0] for f in images if "/" in f} or {"."})
for d in topdirs:
includes = f"{d}/**/*" if d != "." else "**/*"
dist.append({
"@type": "cr:FileSet",
"@id": f"images-{d}".replace("/", "_"),
"name": f"images-{d}",
"encodingFormat": "image/*",
"includes": includes
})
for c in csvs:
dist.append({
"@type": "cr:FileObject",
"@id": c,
"name": c,
"contentUrl": hf_hub_url(repo_id=REPO, filename=c, repo_type="dataset"), # raw resolve URL
"encodingFormat": "text/csv" if c.lower().endswith(".csv") else "text/tab-separated-values",
})
# Basic recordSet from first CSV (optional but useful)
record_sets = []
if csvs:
url = hf_hub_url(repo_id=REPO, filename=csvs[0], repo_type="dataset")
cols = list(pd.read_csv(url, nrows=0).columns)
fields = [{
"@type": "cr:Field",
"@id": f"samples/{col}",
"name": col,
"dataType": "Text",
"source": {"fileObject": {"@id": csvs[0]}, "extract": {"column": col}}
} for col in cols]
record_sets.append({"@type": "cr:RecordSet", "@id": "samples", "name": "samples", "field": fields})
croissant = {
"@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/" },
"@type": "Dataset",
"name": REPO.split("/")[-1],
"url": f"https://huggingface.co/datasets/{REPO}",
"conformsTo": "http://mlcommons.org/croissant/1.0",
"distribution": dist,
"recordSet": record_sets
}
with open("croissant.json", "w") as f:
json.dump(croissant, f, indent=2)
print("Wrote croissant.json")
Notes:
- Use
hf_hub_url(..., filename=..., repo_type="dataset")to produce a raw.../resolve/...URL. Do not pointcontentUrlat the web UI. (Hugging Face) - Prefer
cr:FileSetwithincludesfor folders andcr:FileObjectwithcontentUrlfor single files. The spec’s minimal example shows both patterns andsha256support. (docs.mlcommons.org)
Validate and iterate
-
Install and validate:
pip install "mlcroissant[parquet]"thenpython -c "import mlcroissant, sys; print('ok')"- Load/validate in code or from CLI. The library docs show loading a Croissant URL and are kept current in HF docs. (Hugging Face)
-
Practical loop:
- Generate
croissant.json. - Validate by constructing an
mlcroissant.Dataset(jsonld=...)and iterating records. The README and docs show exact calls. (github.com) - Commit
croissant.jsonto the dataset repo viaupload_fileor CLI. (Hugging Face)
- Generate
Optional GUI instead of code
- Use the Croissant Editor Space. It infers resources and RecordSets from your files and lets you export JSON-LD. Good for images or nested folders. (Hugging Face)
Way 2: restructure to get auto-Croissant (no manual file)
- Convert to Parquet or organize as ImageFolder (+ simple metadata file). HF will auto-publish Parquet and then
/croissantappears. Docs and maintainer posts confirm this behavior in 2025. (Hugging Face)
Working example repos to copy from
princeton-nlp/CharXivuses a top-levelcroissant.jsonwithrecordSetandRAIkeys. Shows complete structure. Updated 2024-06-11. (Hugging Face)BGLab/TerraIncognitashows CSV-baseddistributionand fields. Updated ~5 months ago. (Hugging Face)Efficient-Large-Model/worldmodelbenchshows recent practice. Updated ~5 months ago. (Hugging Face)
Common pitfalls to avoid
- Missing required top-level keys or wrong vocabulary prefix. Follow the minimal example in the spec. (docs.mlcommons.org)
contentUrlpointing to the web UI instead of the raw.../resolve/.... Usehf_hub_url. (Hugging Face)- Expecting your manual
croissant.jsonto change HF’s/croissantendpoint. That endpoint is auto-generated from Parquet/ImageFolder. Your file serves external tools and users. (Hugging Face) - Big private repos: dataset viewer Parquet publishing has limits and requirements. See size and visibility rules. (Hugging Face)
End-to-end checklist (redundant on purpose)
-
Probe
/croissant. If present, reuse it. If absent, proceed. (Hugging Face) -
Generate
croissant.json:- List files from the repo.
- Build
distributionwithcr:FileSetglobs andcr:FileObjectURLs fromhf_hub_url. - Create a
recordSetper main table withcr:Fields mapped by column. (Hugging Face)
-
Validate with
mlcroissantand load a few records. Fix errors. (Hugging Face) -
Commit
croissant.jsonat the repo root withupload_fileor CLI. (Hugging Face) -
Optional: rebuild the repo as Parquet or ImageFolder to get HF auto-Croissant as well. (Hugging Face)
Supplemental materials (curated, dated)
Spec and core
- MLCommons Croissant docs and minimal example. Accessed 2025-10-15. (docs.mlcommons.org)
- Croissant repo README and examples. Latest release v1.0.22 on 2025-08-25. (github.com)
Hugging Face APIs
- Get Croissant metadata via dataset viewer. Accessed 2025-10-15. (Hugging Face)
- Dataset viewer overview and Parquet auto-publish rules. Updated 2024-2025. (Hugging Face)
mlcroissantusage on HF docs. Accessed 2025-10-15. (Hugging Face)- Upload files programmatically and with CLI. Updated 2024-07-22 and 2024-??. (Hugging Face)
- Build raw URLs with
hf_hub_url. Accessed 2025-10-15. (Hugging Face)
HF forum guidance
- Auto-Croissant for Parquet/ImageFolder. Posts from 2025-04-14 and 2025-05-12. (Hugging Face Forums)
Editor
- Croissant Editor Space. Accessed 2025-10-15. (Hugging Face)