Possible issue of contentUrl in croissant file of the dataset

from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/ibm-research/FailureSensorIQ/croissant")
records = ds.records(record_set="single_true_multi_choice_qa")

for i, record in enumerate(records):
    print(f"Record {i}:")
    print(f"  Subject: {record.get('subject')}")
    print(f"  Question: {record.get('question')}")
    print(f"  Options: {record.get('options')}")
    print(f"  Correct: {record.get('correct')}")
    if i >= 5:  # Show first 5 records
        break

Hi HF community, I have encountered an error when I want to use the auto-generated croissant file to load my dataset. The code above gives me warning and it does not load data

WARNING:root:Could not match re.compile('single_true_multi_choice_qa/(?:partial-)?(train)/.+parquet$') in train

After investigation I feel

```

“contentUrl”:“https://huggingface.co/datasets/ibm-research/FailureSensorIQ/tree/refs%2Fconvert%2Fparquet”
```

in the croissant file might be wrong. I found contentUrl should contain resolve ( Generating Croissant Metadata for Custom Image Dataset - #15 by John6666 )

I tried to fix it by replacing tree to resolvebut it still does not work. Can anyone help me for this issue? Thanks!

1 Like

just an update the following code works for me.

from mlcroissant import Dataset
import itertools

dataset = Dataset(jsonld="https://huggingface.co/api/datasets/ibm-research/FailureSensorIQ/croissant")

records = dataset.records(record_set="multi_true_multi_choice_qa")

for i, record in enumerate(records):
    if i > 5:
        break
    print(f"\nRecord {i+1}:")
    for key, value in record.items():
        print(f"  {key}: {value}")

Although it still got the warning, I could load the data successfully.

1 Like

It seems best not to rewrite Croissant as much as possible…?


Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant, then load it in code with mlcroissant. Keep the repo-level contentUrl the Hub generates (.../tree/refs%2Fconvert%2Fparquet, encodingFormat: "git+https"). Select files via FileSet.includes globs. Don’t swap tree→resolve unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)

Background

  • Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (FileObject or FileSet) and how to extract records (RecordSet + fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org)
  • Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at /api/datasets/<owner>/<repo>/croissant. The metadata contains a repo-level entry pointing to the Parquet branch and FileSet.includes patterns for each subset. (Hugging Face)

Default path: “upload → use”

  1. Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)

  2. Confirm availability

    • List Parquet files: /parquet?dataset=<owner>/<repo>.
    • List splits/subsets: /splits?dataset=<owner>/<repo>.
    • Fetch Croissant JSON-LD: /api/datasets/<owner>/<repo>/croissant. (Hugging Face)
  3. Load in code

# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
#   pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset

repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]])  # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
    if i == 5: break; print(rec)

mlcroissant[parquet] and GitPython are required to read Parquet over git+https. (Hugging Face)

When you should edit Croissant

Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl.

  • Globs first. Prefer tolerant includes globs that match your Viewer layout:
    "<subset>/*/*.parquet" or "<subset>/**/*.parquet". This is how the Hub’s own example is structured. (Hugging Face)
  • Regex only to extract fields. If you need a split field from the path, use a permissive regex transform, e.g.:
    ^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$. The spec shows includes for matching and transform.regex for parsing. (docs.mlcommons.org)
  • Per-file links. If you hand-author a Croissant that references one file, build raw URLs with hf_hub_url(..., repo_type="dataset") which returns /resolve/<rev>/<path>. Use this with cr:FileObject. Do not use /resolve/ for the repo-container that Auto-Croissant emits. (Hugging Face)

Why users hit the “Could not match … regex” warning

The generated RecordSet expected e.g. .../<subset>/(partial-)?train/..., but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes and any split-capturing regex, or use a working RecordSet. (Hugging Face)

Debug quickly

  • See what exists: call /parquet and compare the returned paths to your includes. (Hugging Face)
  • See how the Hub structures Croissant: open /croissant and note the repo-level FileObject with contentUrl: .../tree/refs%2Fconvert%2Fparquet and the per-subset FileSet.includes. (Hugging Face)
  • Check splits/subsets before writing regex: /splits. (Hugging Face)

Private/gated repos

mlcroissant reads git+https. Set:
CROISSANT_GIT_USERNAME=<hf-username> and CROISSANT_GIT_PASSWORD=<hf-access-token>. (PyPI)

Decision tree

  • Parquet or ImageFolder-like? Use Auto-Croissant. Load via /croissant. (Hugging Face Forums)
  • Auto-Croissant missing or wrong subset? Adjust includes/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-level contentUrl. (Hugging Face)
  • Need a single file? Use hf_hub_url and /resolve/ in a FileObject. (Hugging Face)

Minimal, version-safe templates

A. Tolerant FileSet (auto-Croissant style)

{
  "@type": "cr:FileSet",
  "@id": "#fs-subset",
  "name": "single_true_multi_choice_qa",
  "containedIn": { "@id": "repo" },                // repo-level FileObject
  "encodingFormat": "application/x-parquet",
  "includes": "single_true_multi_choice_qa/*/*.parquet"
}

The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet" on the repo object. (Hugging Face)

B. Extract split from filename with regex

{
  "@type": "cr:Field",
  "name": "split",
  "source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
  "transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}

Regex transforms in fields are standard. (docs.mlcommons.org)

C. Single file (manual FileObject)

{
  "@type": "cr:FileObject",
  "@id": "#tbl",
  "contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
  "encodingFormat": "application/x-parquet"
}

Build this URL with hf_hub_url in code. (Hugging Face)

Common pitfalls and fixes

  • Pitfall: Changing contentUrl to /resolve/ on Auto-Croissant’s repo container.
    Fix: Leave it as .../tree/refs%2Fconvert%2Fparquet with git+https. Use includes to select files. (Hugging Face)
  • Pitfall: Hard-coding train in regex.
    Fix: Capture any first subdir or enumerate all splits; allow partial- shards. (docs.mlcommons.org)
  • Pitfall: Missing extras or auth.
    Fix: Install mlcroissant[parquet] + GitPython; set CROISSANT_GIT_* for private repos. (Hugging Face)
  • Pitfall: Assuming Auto-Croissant works for script-only datasets.
    Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)

Short checklist

  • Upload. Wait for Parquet. Confirm /parquet, /splits, /croissant. (Hugging Face)
  • Load with mlcroissant and pick a working RecordSet. (Hugging Face)
  • If a RecordSet fails, widen includes and relax the regex. Keep the repo-level contentUrl. (Hugging Face)

Curated references

  • HF docs: Auto-Croissant + example JSON (shows repo-level FileObject with tree/refs%2Fconvert%2Fparquet, plus FileSet.includes). (Hugging Face)
  • HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
  • HF docs: /parquet endpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face)
  • Loader: mlcroissant usage and the git+https requirement. (Hugging Face)
  • Spec: Croissant FileSet.includes and transform.regex usage. (docs.mlcommons.org)
  • Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
  • Hub utils: Build raw /resolve/... URLs with hf_hub_url. (Hugging Face)

Bottom line: upload, rely on Auto-Croissant, load via /croissant. If a subset warns or yields no rows, align includes and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl. (Hugging Face)