It seems best not to rewrite Croissant as much as possible…?
Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant, then load it in code with mlcroissant. Keep the repo-level contentUrl the Hub generates (.../tree/refs%2Fconvert%2Fparquet, encodingFormat: "git+https"). Select files via FileSet.includes globs. Don’t swap tree→resolve unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)
Background
- Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (
FileObject or FileSet) and how to extract records (RecordSet + fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org)
- Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at
/api/datasets/<owner>/<repo>/croissant. The metadata contains a repo-level entry pointing to the Parquet branch and FileSet.includes patterns for each subset. (Hugging Face)
Default path: “upload → use”
-
Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)
-
Confirm availability
- List Parquet files:
/parquet?dataset=<owner>/<repo>.
- List splits/subsets:
/splits?dataset=<owner>/<repo>.
- Fetch Croissant JSON-LD:
/api/datasets/<owner>/<repo>/croissant. (Hugging Face)
-
Load in code
# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
# pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset
repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]]) # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
if i == 5: break; print(rec)
mlcroissant[parquet] and GitPython are required to read Parquet over git+https. (Hugging Face)
When you should edit Croissant
Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl.
- Globs first. Prefer tolerant
includes globs that match your Viewer layout:
"<subset>/*/*.parquet" or "<subset>/**/*.parquet". This is how the Hub’s own example is structured. (Hugging Face)
- Regex only to extract fields. If you need a
split field from the path, use a permissive regex transform, e.g.:
^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$. The spec shows includes for matching and transform.regex for parsing. (docs.mlcommons.org)
- Per-file links. If you hand-author a Croissant that references one file, build raw URLs with
hf_hub_url(..., repo_type="dataset") which returns /resolve/<rev>/<path>. Use this with cr:FileObject. Do not use /resolve/ for the repo-container that Auto-Croissant emits. (Hugging Face)
Why users hit the “Could not match … regex” warning
The generated RecordSet expected e.g. .../<subset>/(partial-)?train/..., but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes and any split-capturing regex, or use a working RecordSet. (Hugging Face)
Debug quickly
- See what exists: call
/parquet and compare the returned paths to your includes. (Hugging Face)
- See how the Hub structures Croissant: open
/croissant and note the repo-level FileObject with contentUrl: .../tree/refs%2Fconvert%2Fparquet and the per-subset FileSet.includes. (Hugging Face)
- Check splits/subsets before writing regex:
/splits. (Hugging Face)
Private/gated repos
mlcroissant reads git+https. Set:
CROISSANT_GIT_USERNAME=<hf-username> and CROISSANT_GIT_PASSWORD=<hf-access-token>. (PyPI)
Decision tree
- Parquet or ImageFolder-like? Use Auto-Croissant. Load via
/croissant. (Hugging Face Forums)
- Auto-Croissant missing or wrong subset? Adjust
includes/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-level contentUrl. (Hugging Face)
- Need a single file? Use
hf_hub_url and /resolve/ in a FileObject. (Hugging Face)
Minimal, version-safe templates
A. Tolerant FileSet (auto-Croissant style)
{
"@type": "cr:FileSet",
"@id": "#fs-subset",
"name": "single_true_multi_choice_qa",
"containedIn": { "@id": "repo" }, // repo-level FileObject
"encodingFormat": "application/x-parquet",
"includes": "single_true_multi_choice_qa/*/*.parquet"
}
The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet" on the repo object. (Hugging Face)
B. Extract split from filename with regex
{
"@type": "cr:Field",
"name": "split",
"source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
"transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}
Regex transforms in fields are standard. (docs.mlcommons.org)
C. Single file (manual FileObject)
{
"@type": "cr:FileObject",
"@id": "#tbl",
"contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
"encodingFormat": "application/x-parquet"
}
Build this URL with hf_hub_url in code. (Hugging Face)
Common pitfalls and fixes
- Pitfall: Changing
contentUrl to /resolve/ on Auto-Croissant’s repo container.
Fix: Leave it as .../tree/refs%2Fconvert%2Fparquet with git+https. Use includes to select files. (Hugging Face)
- Pitfall: Hard-coding
train in regex.
Fix: Capture any first subdir or enumerate all splits; allow partial- shards. (docs.mlcommons.org)
- Pitfall: Missing extras or auth.
Fix: Install mlcroissant[parquet] + GitPython; set CROISSANT_GIT_* for private repos. (Hugging Face)
- Pitfall: Assuming Auto-Croissant works for script-only datasets.
Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)
Short checklist
- Upload. Wait for Parquet. Confirm
/parquet, /splits, /croissant. (Hugging Face)
- Load with
mlcroissant and pick a working RecordSet. (Hugging Face)
- If a RecordSet fails, widen
includes and relax the regex. Keep the repo-level contentUrl. (Hugging Face)
Curated references
- HF docs: Auto-Croissant + example JSON (shows repo-level
FileObject with tree/refs%2Fconvert%2Fparquet, plus FileSet.includes). (Hugging Face)
- HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
- HF docs:
/parquet endpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face)
- Loader:
mlcroissant usage and the git+https requirement. (Hugging Face)
- Spec: Croissant
FileSet.includes and transform.regex usage. (docs.mlcommons.org)
- Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
- Hub utils: Build raw
/resolve/... URLs with hf_hub_url. (Hugging Face)
Bottom line: upload, rely on Auto-Croissant, load via /croissant. If a subset warns or yields no rows, align includes and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl. (Hugging Face)