Hi,
I’m encountering an issue I can’t find the solution to. I have created a dataset ( iguanodon-ai/kubhist2 · Datasets at Hugging Face ).
It has one split (train) but several configs, and for doing so I have followed the documentation here: Structure your repository
- config_name: '1710'
data_files:
- split: train
path: data/1710/train/*.parquet
- config_name: '1720'
data_files:
- split: train
path: data/1720/train/*.parquet
The structure is the same as this HF-released dataset ( HuggingFaceFW/finepdfs at main ).
The problem I’m having is that on my dataset the viewer shows only the split and not the subset:
but on the HF dataset, the viewer does show all the subsets:
What is going on? Can someone please enlighten me? Is it because I’m uploading parquet files and the automated parquet-converter tries to change the data?
Thanks!
1 Like
Oh. Seems working now (maybe by your own commit).
Hi! Thanks but I don’t see it changed – the viewer only shows one split (train), and not the subsets (1640, 1650, 1660, etc.). Can you please provide a screenshot of why you see?
1 Like
Thanks. Unfortunately that’s the issue I don´t understand. I have the same exact config as HuggingFaceFW/finepdfs · Datasets at Hugging Face (in the README.md) and yet the result is different from them.
1 Like
How about this…?
Why it still shows one subset: your README YAML is using metadata: dataset_info:
rather than the viewer’s configs:
schema. The Hub viewer reads configs:
for subsets. Your README currently lists decades under metadata → dataset_info
, plus an all
config marked default: true
. That does not populate the viewer’s subset dropdown. (Hugging Face)
Fix precisely:
- Put a YAML front-matter block at the very top of
README.md
using configs:
.
- Keep your
all
entry if you want it as default, but list every decade under configs:
too.
- Remove the
metadata:
wrapper. Optional: keep separate dataset_info:
if you want, but it does not drive the viewer.
Minimal example to paste at the very top of README.md
:
---
configs:
- config_name: "1640"
data_files:
- split: train
path: data/1640/train/*.parquet
- config_name: "1650"
data_files:
- split: train
path: data/1650/train/*.parquet
# …repeat for all decades…
- config_name: "all"
default: true
data_files:
- split: train
path: data/all/train/*.parquet
---
Reference syntax and behavior are defined in the Hub docs: use configs:
with data_files
for manual subsets; this is distinct from the metadata block. (Hugging Face)
If you commit that change to main
, the viewer should show a “Subset” dropdown with all decades.
2 Likes
That did it! Thank you very much John!
Looks like I failed to edit the heading with migrating from the older dataset_info to the newer configs, my bad.
2 Likes