Got wrong row number of dataset viewer

I hope it helps.

I believe so. Just for people Googling in the future.

For a dataset repository to use the right data, you must consider:

  • You must have a configuration that is YAML instead a README markdown document that points to your training splits. The README.md document is not prepared by save_to_disk.
  • save_to_disk will create a dataset_info.json and state.json, but that doesn’t do anything as far as the UI is concerned.
  • The UI will ignore the file extension/files (.arrow) that are produced by save_to_diskand instead relies on a hierarchy of extensions to find while crawling the repository.

Do I have this correct? This was unexpected for myself, but if this is the way it works this is the way it works.

I have updated the README.md to reflect the arrow files, but it still reports the wrong number of rows:

https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext/raw/main/README.md