Problem of Dataset formatting and Croissant metadata

Hi :waving_hand:

I am uploading my first dataset to HF and I am running into an issue: the “Use this dataset” button does not appear. I am particularly interested in the generated Croissant file. Right now, I can manually type the API endpoint, but I get a quite empty Croissant file with no record set.

My dataset is a bit unusual as it a graph dataset with hundreds of graphs (each associated to two CSV files). Thus I wonder whether the problem may come from data formatting expectations that I ignore. Maybe, it is just a matter of time [I uploaded the dataset a few hours ago]?

I am completely new to HF data processing pipeline, so any help would be welcome! BTW, here is the dataset: https://huggingface.co/datasets/MarcDamie/Fedivertex-Reduced/tree/main

This applies to the dataset itself, but especially for the dataset viewer, the README.md file serves as a configuration file. If a model or dataset isn’t recognized automatically, editing the beginning of the README.md file (YAML section) may resolve the issue:

However, there’s always a possibility of a bug, so don’t worry if it doesn’t work…


What is going wrong

Hugging Face is currently treating your repo like one CSV dataset, but your repo actually contains two different kinds of tables:

  • instances.csv = node table
  • interactions.csv = edge table

Those two files do not have the same columns. Your dataset page already shows that exact problem: Hugging Face inferred only one subset (default) and one split (train), then failed with DatasetGenerationCastError because one group of files has columns Source, Target, Weight while another has host, version, registration_enabled, Id, Label. (Hugging Face)

So is it just a matter of time?

Probably not.

This does not look like “the upload is still processing.” It looks like a real schema error. The page is already showing a specific failure, not just a temporary loading state. (Hugging Face)

Why this happens

The Hugging Face dataset viewer is built around a tabular idea: one data point is one row, and features are columns. If it auto-detects many CSV files as belonging to one dataset split, it expects them to share one schema. Your graph data breaks that assumption because each graph is stored as two different tables with different columns. (Hugging Face)

Why the Croissant file is almost empty

On Hugging Face, generated Croissant metadata is built from the dataset-viewer / Parquet pipeline. The official Croissant example shows recordSet entries tied to Hugging Face–converted Parquet files for each config. So if the viewer cannot cleanly build the dataset first, the generated Croissant will often be thin or missing useful recordSet entries. (Hugging Face)

The real cause, in one sentence

Your dataset is not failing because it is a graph dataset.

It is failing because Hugging Face is currently reading it as one mixed CSV dataset with incompatible schemas. (Hugging Face)

The easiest fix

Tell Hugging Face explicitly that these are two separate dataset parts.

Use manual config in README.md, with one config for instances.csv and one for interactions.csv. The docs show that dataset configs use config_name, data_files, split, and path. (Hugging Face)

A simple starting point is:

---
configs:
  - config_name: instances
    data_files:
      - split: train
        path: "*/*/*/instances.csv"

  - config_name: interactions
    data_files:
      - split: train
        path: "*/*/*/interactions.csv"
---

That tells Hugging Face: “do not mix these files together.” This is exactly the kind of problem manual configuration is for. (Hugging Face)

An even better fix

Add a few columns to both table types so every row says which graph it belongs to, for example:

  • graph_id
  • software
  • graph_type
  • snapshot_date

That way:

  • all node rows can live in one clean schema
  • all edge rows can live in one clean schema
  • users can filter by graph
  • Croissant has a much better chance of becoming meaningful

This is not required by the docs word-for-word, but it matches the viewer’s row/column design much better. (Hugging Face)

Best long-term design

The most Hugging Face-friendly structure is often:

one row = one graph snapshot

For example, one processed dataset where each row contains:

  • graph metadata
  • counts
  • maybe paths to node/edge files
  • or another structured representation

That works better with the viewer because the viewer is fundamentally row-based. (Hugging Face)

About the missing “Use this dataset” button

I would treat that as a symptom, not the main problem.

First fix the dataset structure so the viewer can understand it. Then check the page again. Right now the clearer signal is the cast error on the page itself. (Hugging Face)

What to do next

  1. Add manual configs to separate instances.csv and interactions.csv. (Hugging Face)

  2. Re-push the repo.

  3. Check Hugging Face’s dataset server endpoints:

    • /is-valid
    • /splits
    • /first-rows
      The docs recommend these endpoints for checking validity, available configs/splits, and preview rows. (Hugging Face)
  4. Only after that, check /croissant again. (Hugging Face)

Bottom line

You are close.

The issue is not that your dataset is “too unusual” for Hugging Face. The issue is that Hugging Face needs clearer instructions for how to separate your two table types. Once you stop the node CSVs and edge CSVs from being merged into one inferred split, the viewer should improve, and the Croissant output should likely improve too. (Hugging Face)

Thanks for this detailed answer. It is super helpful and I now better understand the philosophy of Hugging Face when it comes to dataset formatting.

I will start with the easy fix and think about the more complete options. They may create redundancies in the dataset, so I’ll weigh the pros and cons in the upcoming days/weeks.

Thank you again!