I hope it helps.
I believe so. Just for people Googling in the future.
For a dataset repository to use the right data, you must consider:
- You must have a configuration that is YAML instead a README markdown document that points to your training splits. The
README.mddocument is not prepared bysave_to_disk. save_to_diskwill create adataset_info.jsonandstate.json, but that doesn’t do anything as far as the UI is concerned.- The UI will ignore the file extension/files (
.arrow) that are produced bysave_to_diskand instead relies on a hierarchy of extensions to find while crawling the repository.
Do I have this correct? This was unexpected for myself, but if this is the way it works this is the way it works.
I have updated the README.md to reflect the arrow files, but it still reports the wrong number of rows:
https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext/raw/main/README.md