It seems best not to rewrite Croissant as much as possible…?
Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant, then load it in code with mlcroissant. Keep the repo-level contentUrl the Hub generates (.../tree/refs%2Fconvert%2Fparquet, encodingFormat: "git+https"). Select files via FileSet.includes globs. Don’t swap tree→resolve unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)
Background
- Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (
FileObjectorFileSet) and how to extract records (RecordSet+ fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org) - Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at
/api/datasets/<owner>/<repo>/croissant. The metadata contains a repo-level entry pointing to the Parquet branch andFileSet.includespatterns for each subset. (Hugging Face)
Default path: “upload → use”
-
Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)
-
Confirm availability
- List Parquet files:
/parquet?dataset=<owner>/<repo>. - List splits/subsets:
/splits?dataset=<owner>/<repo>. - Fetch Croissant JSON-LD:
/api/datasets/<owner>/<repo>/croissant. (Hugging Face)
- List Parquet files:
-
Load in code
# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
# pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset
repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]]) # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
if i == 5: break; print(rec)
mlcroissant[parquet] and GitPython are required to read Parquet over git+https. (Hugging Face)
When you should edit Croissant
Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl.
- Globs first. Prefer tolerant
includesglobs that match your Viewer layout:
"<subset>/*/*.parquet"or"<subset>/**/*.parquet". This is how the Hub’s own example is structured. (Hugging Face) - Regex only to extract fields. If you need a
splitfield from the path, use a permissive regex transform, e.g.:
^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$. The spec showsincludesfor matching andtransform.regexfor parsing. (docs.mlcommons.org) - Per-file links. If you hand-author a Croissant that references one file, build raw URLs with
hf_hub_url(..., repo_type="dataset")which returns/resolve/<rev>/<path>. Use this withcr:FileObject. Do not use/resolve/for the repo-container that Auto-Croissant emits. (Hugging Face)
Why users hit the “Could not match … regex” warning
The generated RecordSet expected e.g. .../<subset>/(partial-)?train/..., but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes and any split-capturing regex, or use a working RecordSet. (Hugging Face)
Debug quickly
- See what exists: call
/parquetand compare the returned paths to yourincludes. (Hugging Face) - See how the Hub structures Croissant: open
/croissantand note the repo-levelFileObjectwithcontentUrl: .../tree/refs%2Fconvert%2Fparquetand the per-subsetFileSet.includes. (Hugging Face) - Check splits/subsets before writing regex:
/splits. (Hugging Face)
Private/gated repos
mlcroissant reads git+https. Set:
CROISSANT_GIT_USERNAME=<hf-username> and CROISSANT_GIT_PASSWORD=<hf-access-token>. (PyPI)
Decision tree
- Parquet or ImageFolder-like? Use Auto-Croissant. Load via
/croissant. (Hugging Face Forums) - Auto-Croissant missing or wrong subset? Adjust
includes/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-levelcontentUrl. (Hugging Face) - Need a single file? Use
hf_hub_urland/resolve/in aFileObject. (Hugging Face)
Minimal, version-safe templates
A. Tolerant FileSet (auto-Croissant style)
{
"@type": "cr:FileSet",
"@id": "#fs-subset",
"name": "single_true_multi_choice_qa",
"containedIn": { "@id": "repo" }, // repo-level FileObject
"encodingFormat": "application/x-parquet",
"includes": "single_true_multi_choice_qa/*/*.parquet"
}
The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet" on the repo object. (Hugging Face)
B. Extract split from filename with regex
{
"@type": "cr:Field",
"name": "split",
"source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
"transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}
Regex transforms in fields are standard. (docs.mlcommons.org)
C. Single file (manual FileObject)
{
"@type": "cr:FileObject",
"@id": "#tbl",
"contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
"encodingFormat": "application/x-parquet"
}
Build this URL with hf_hub_url in code. (Hugging Face)
Common pitfalls and fixes
- Pitfall: Changing
contentUrlto/resolve/on Auto-Croissant’s repo container.
Fix: Leave it as.../tree/refs%2Fconvert%2Fparquetwithgit+https. Useincludesto select files. (Hugging Face) - Pitfall: Hard-coding
trainin regex.
Fix: Capture any first subdir or enumerate all splits; allowpartial-shards. (docs.mlcommons.org) - Pitfall: Missing extras or auth.
Fix: Installmlcroissant[parquet]+GitPython; setCROISSANT_GIT_*for private repos. (Hugging Face) - Pitfall: Assuming Auto-Croissant works for script-only datasets.
Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)
Short checklist
- Upload. Wait for Parquet. Confirm
/parquet,/splits,/croissant. (Hugging Face) - Load with
mlcroissantand pick a working RecordSet. (Hugging Face) - If a RecordSet fails, widen
includesand relax the regex. Keep the repo-levelcontentUrl. (Hugging Face)
Curated references
- HF docs: Auto-Croissant + example JSON (shows repo-level
FileObjectwithtree/refs%2Fconvert%2Fparquet, plusFileSet.includes). (Hugging Face) - HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
- HF docs:
/parquetendpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face) - Loader:
mlcroissantusage and thegit+httpsrequirement. (Hugging Face) - Spec: Croissant
FileSet.includesandtransform.regexusage. (docs.mlcommons.org) - Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
- Hub utils: Build raw
/resolve/...URLs withhf_hub_url. (Hugging Face)
Bottom line: upload, rely on Auto-Croissant, load via /croissant. If a subset warns or yields no rows, align includes and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl. (Hugging Face)