It seems best not to rewrite Croissant as much as possible…?
Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant
, then load it in code with mlcroissant
. Keep the repo-level contentUrl
the Hub generates (.../tree/refs%2Fconvert%2Fparquet
, encodingFormat: "git+https"
). Select files via FileSet.includes
globs. Don’t swap tree
→resolve
unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)
Background
- Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (
FileObject
or FileSet
) and how to extract records (RecordSet
+ fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org)
- Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at
/api/datasets/<owner>/<repo>/croissant
. The metadata contains a repo-level entry pointing to the Parquet branch and FileSet.includes
patterns for each subset. (Hugging Face)
Default path: “upload → use”
-
Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)
-
Confirm availability
- List Parquet files:
/parquet?dataset=<owner>/<repo>
.
- List splits/subsets:
/splits?dataset=<owner>/<repo>
.
- Fetch Croissant JSON-LD:
/api/datasets/<owner>/<repo>/croissant
. (Hugging Face)
-
Load in code
# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
# pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset
repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]]) # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
if i == 5: break; print(rec)
mlcroissant[parquet]
and GitPython
are required to read Parquet over git+https
. (Hugging Face)
When you should edit Croissant
Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl
.
- Globs first. Prefer tolerant
includes
globs that match your Viewer layout:
"<subset>/*/*.parquet"
or "<subset>/**/*.parquet"
. This is how the Hub’s own example is structured. (Hugging Face)
- Regex only to extract fields. If you need a
split
field from the path, use a permissive regex transform, e.g.:
^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$
. The spec shows includes
for matching and transform.regex
for parsing. (docs.mlcommons.org)
- Per-file links. If you hand-author a Croissant that references one file, build raw URLs with
hf_hub_url(..., repo_type="dataset")
which returns /resolve/<rev>/<path>
. Use this with cr:FileObject
. Do not use /resolve/
for the repo-container that Auto-Croissant emits. (Hugging Face)
Why users hit the “Could not match … regex” warning
The generated RecordSet expected e.g. .../<subset>/(partial-)?train/...
, but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes
and any split-capturing regex, or use a working RecordSet. (Hugging Face)
Debug quickly
- See what exists: call
/parquet
and compare the returned paths to your includes
. (Hugging Face)
- See how the Hub structures Croissant: open
/croissant
and note the repo-level FileObject
with contentUrl: .../tree/refs%2Fconvert%2Fparquet
and the per-subset FileSet.includes
. (Hugging Face)
- Check splits/subsets before writing regex:
/splits
. (Hugging Face)
Private/gated repos
mlcroissant
reads git+https
. Set:
CROISSANT_GIT_USERNAME=<hf-username>
and CROISSANT_GIT_PASSWORD=<hf-access-token>
. (PyPI)
Decision tree
- Parquet or ImageFolder-like? Use Auto-Croissant. Load via
/croissant
. (Hugging Face Forums)
- Auto-Croissant missing or wrong subset? Adjust
includes
/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-level contentUrl
. (Hugging Face)
- Need a single file? Use
hf_hub_url
and /resolve/
in a FileObject
. (Hugging Face)
Minimal, version-safe templates
A. Tolerant FileSet (auto-Croissant style)
{
"@type": "cr:FileSet",
"@id": "#fs-subset",
"name": "single_true_multi_choice_qa",
"containedIn": { "@id": "repo" }, // repo-level FileObject
"encodingFormat": "application/x-parquet",
"includes": "single_true_multi_choice_qa/*/*.parquet"
}
The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet"
on the repo
object. (Hugging Face)
B. Extract split from filename with regex
{
"@type": "cr:Field",
"name": "split",
"source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
"transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}
Regex transforms in fields are standard. (docs.mlcommons.org)
C. Single file (manual FileObject
)
{
"@type": "cr:FileObject",
"@id": "#tbl",
"contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
"encodingFormat": "application/x-parquet"
}
Build this URL with hf_hub_url
in code. (Hugging Face)
Common pitfalls and fixes
- Pitfall: Changing
contentUrl
to /resolve/
on Auto-Croissant’s repo container.
Fix: Leave it as .../tree/refs%2Fconvert%2Fparquet
with git+https
. Use includes
to select files. (Hugging Face)
- Pitfall: Hard-coding
train
in regex.
Fix: Capture any first subdir or enumerate all splits; allow partial-
shards. (docs.mlcommons.org)
- Pitfall: Missing extras or auth.
Fix: Install mlcroissant[parquet]
+ GitPython
; set CROISSANT_GIT_*
for private repos. (Hugging Face)
- Pitfall: Assuming Auto-Croissant works for script-only datasets.
Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)
Short checklist
- Upload. Wait for Parquet. Confirm
/parquet
, /splits
, /croissant
. (Hugging Face)
- Load with
mlcroissant
and pick a working RecordSet. (Hugging Face)
- If a RecordSet fails, widen
includes
and relax the regex. Keep the repo-level contentUrl
. (Hugging Face)
Curated references
- HF docs: Auto-Croissant + example JSON (shows repo-level
FileObject
with tree/refs%2Fconvert%2Fparquet
, plus FileSet.includes
). (Hugging Face)
- HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
- HF docs:
/parquet
endpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face)
- Loader:
mlcroissant
usage and the git+https
requirement. (Hugging Face)
- Spec: Croissant
FileSet.includes
and transform.regex
usage. (docs.mlcommons.org)
- Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
- Hub utils: Build raw
/resolve/...
URLs with hf_hub_url
. (Hugging Face)
Bottom line: upload, rely on Auto-Croissant, load via /croissant
. If a subset warns or yields no rows, align includes
and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl
. (Hugging Face)