Possible issue of contentUrl in croissant file of the dataset

John6666 · October 18, 2025, 12:01am

It seems best not to rewrite Croissant as much as possible…?

Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant, then load it in code with mlcroissant. Keep the repo-level contentUrl the Hub generates (.../tree/refs%2Fconvert%2Fparquet, encodingFormat: "git+https"). Select files via FileSet.includes globs. Don’t swap tree→resolve unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)

Background

Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (FileObject or FileSet) and how to extract records (RecordSet + fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org)
Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at /api/datasets/<owner>/<repo>/croissant. The metadata contains a repo-level entry pointing to the Parquet branch and FileSet.includes patterns for each subset. (Hugging Face)

Default path: “upload → use”

Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)
Confirm availability
- List Parquet files: /parquet?dataset=<owner>/<repo>.
- List splits/subsets: /splits?dataset=<owner>/<repo>.
- Fetch Croissant JSON-LD: /api/datasets/<owner>/<repo>/croissant. (Hugging Face)
Load in code

# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
#   pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset

repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]])  # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
    if i == 5: break; print(rec)

mlcroissant[parquet] and GitPython are required to read Parquet over git+https. (Hugging Face)

When you should edit Croissant

Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl.

Globs first. Prefer tolerant includes globs that match your Viewer layout:
"<subset>/*/*.parquet" or "<subset>/**/*.parquet". This is how the Hub’s own example is structured. (Hugging Face)
Regex only to extract fields. If you need a split field from the path, use a permissive regex transform, e.g.:
^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$. The spec shows includes for matching and transform.regex for parsing. (docs.mlcommons.org)
Per-file links. If you hand-author a Croissant that references one file, build raw URLs with hf_hub_url(..., repo_type="dataset") which returns /resolve/<rev>/<path>. Use this with cr:FileObject. Do not use /resolve/ for the repo-container that Auto-Croissant emits. (Hugging Face)

Why users hit the “Could not match … regex” warning

The generated RecordSet expected e.g. .../<subset>/(partial-)?train/..., but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes and any split-capturing regex, or use a working RecordSet. (Hugging Face)

Debug quickly

See what exists: call /parquet and compare the returned paths to your includes. (Hugging Face)
See how the Hub structures Croissant: open /croissant and note the repo-level FileObject with contentUrl: .../tree/refs%2Fconvert%2Fparquet and the per-subset FileSet.includes. (Hugging Face)
Check splits/subsets before writing regex: /splits. (Hugging Face)

Private/gated repos

mlcroissant reads git+https. Set:
CROISSANT_GIT_USERNAME=<hf-username> and CROISSANT_GIT_PASSWORD=<hf-access-token>. (PyPI)

Decision tree

Parquet or ImageFolder-like? Use Auto-Croissant. Load via /croissant. (Hugging Face Forums)
Auto-Croissant missing or wrong subset? Adjust includes/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-level contentUrl. (Hugging Face)
Need a single file? Use hf_hub_url and /resolve/ in a FileObject. (Hugging Face)

Minimal, version-safe templates

A. Tolerant FileSet (auto-Croissant style)

{
  "@type": "cr:FileSet",
  "@id": "#fs-subset",
  "name": "single_true_multi_choice_qa",
  "containedIn": { "@id": "repo" },                // repo-level FileObject
  "encodingFormat": "application/x-parquet",
  "includes": "single_true_multi_choice_qa/*/*.parquet"
}

The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet" on the repo object. (Hugging Face)

B. Extract split from filename with regex

{
  "@type": "cr:Field",
  "name": "split",
  "source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
  "transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}

Regex transforms in fields are standard. (docs.mlcommons.org)

C. Single file (manual FileObject)

{
  "@type": "cr:FileObject",
  "@id": "#tbl",
  "contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
  "encodingFormat": "application/x-parquet"
}

Build this URL with hf_hub_url in code. (Hugging Face)

Common pitfalls and fixes

Pitfall: Changing contentUrl to /resolve/ on Auto-Croissant’s repo container.
Fix: Leave it as .../tree/refs%2Fconvert%2Fparquet with git+https. Use includes to select files. (Hugging Face)
Pitfall: Hard-coding train in regex.
Fix: Capture any first subdir or enumerate all splits; allow partial- shards. (docs.mlcommons.org)
Pitfall: Missing extras or auth.
Fix: Install mlcroissant[parquet] + GitPython; set CROISSANT_GIT_* for private repos. (Hugging Face)
Pitfall: Assuming Auto-Croissant works for script-only datasets.
Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)

Short checklist

Upload. Wait for Parquet. Confirm /parquet, /splits, /croissant. (Hugging Face)
Load with mlcroissant and pick a working RecordSet. (Hugging Face)
If a RecordSet fails, widen includes and relax the regex. Keep the repo-level contentUrl. (Hugging Face)

Curated references

HF docs: Auto-Croissant + example JSON (shows repo-level FileObject with tree/refs%2Fconvert%2Fparquet, plus FileSet.includes). (Hugging Face)
HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
HF docs: /parquet endpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face)
Loader: mlcroissant usage and the git+https requirement. (Hugging Face)
Spec: Croissant FileSet.includes and transform.regex usage. (docs.mlcommons.org)
Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
Hub utils: Build raw /resolve/... URLs with hf_hub_url. (Hugging Face)

Bottom line: upload, rely on Auto-Croissant, load via /croissant. If a subset warns or yields no rows, align includes and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl. (Hugging Face)

Topic		Replies	Views
Generating Croissant Metadata for Custom Image Dataset 🤗Datasets	15	551	October 15, 2025
Load dataset failure Beginners	1	1766	October 26, 2020
HF Datasets loading csv Beginners	1	1102	January 30, 2021
ArrowNotImplementedError when loading json dataset 🤗Datasets	3	1766	December 17, 2021
Loading Custom Datasets 🤗Datasets	7	10771	May 25, 2021