Got wrong row number of dataset viewer

Hi,
Iā€™m experiencing a strange problem. I have a dataset with the type DatasetDict, here is its basic info:

DatasetDict({
    train: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
})

I use the following code to upload it into the hub:

dataset_small.push_to_hub(
    hub_path, private=False, commit_message="Upload example dataset."
)

After it is converted into parquet and uploaded into the hub, the dataset viewer shows the wrong number of rows:

But if I use the following code to download the dataset:

from datasets import load_dataset
dataset_demo = load_dataset(hub_path)
print(dataset_demo)

I will get the correct info (1k/1k/1k for each split and 3k in total).

Is there something wrong with my usage of datasets and huggingface hub?

Update:
I also got the wrong number when I used the datasets-server API in Get the number of rows and the size in bytes to get the dataset size. It seems that the backend of the datasets-server API has some bugs.

Looking forward to any suggestions and thanks a lot in advance!

well spotted, thanks for opening the issue The API returns the wrong row number Ā· Issue #2581 Ā· huggingface/datasets-server Ā· GitHub, weā€™re on it

fixed. Thanks again for the investigation!

Thank you for the reply and fix~

Still displaying the wrong number of rows: CaptionEmporium/coyo-hd-11m-llavanext Ā· Datasets at Hugging Face

>>> len(df)
11397144

It is also autogenerating parquet files for the parquets I uploaded?

@lhoestq:

I guess we regenerate Parquet files because of the different row group size? The number of rows is estimated on the autogenerated Parquet files, which is why itā€™s not exact. In this case, we could get the exact value very easily, maybe something to fix?

I used the datasets library, uploaded it to data, and it still doesnā€™t give the right number of rows.

import gc
import orjsonl
foo = orjsonl.load('train.jsonl.gz')
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(foo)
table = pa.Table.from_pandas(df)
pq.write_table(table, './train_full.parquet')
del foo, df, table
gc.collect()
from datasets import Dataset
dataset = Dataset.from_parquet('./train_full.parquet')
dataset.save_to_disk('./huggingface')

The repository currently contains three kinds of data: parquet files, json.gz, and arrow files. Moreover, the README YAML does not contain configuration to help us determine which data files should be used for which split, so: itā€™s best guess using heuristics.


To get the expected results, you can add a configs: field in the README frontmatter. We have detailed documentation pages at Data files Configuration, and example datasets with the supported structures/configurations that you can replicate at datasets-examples (Datasets examples).

I hope it helps.

I hope it helps.

I believe so. Just for people Googling in the future.

For a dataset repository to use the right data, you must consider:

  • You must have a configuration that is YAML instead a README markdown document that points to your training splits. The README.md document is not prepared by save_to_disk.
  • save_to_disk will create a dataset_info.json and state.json, but that doesnā€™t do anything as far as the UI is concerned.
  • The UI will ignore the file extension/files (.arrow) that are produced by save_to_diskand instead relies on a hierarchy of extensions to find while crawling the repository.

Do I have this correct? This was unexpected for myself, but if this is the way it works this is the way it works.

I have updated the README.md to reflect the arrow files, but it still reports the wrong number of rows:

https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext/raw/main/README.md

I deleted the data folder and re-uploaded only the arrow files, and only the arrow files listed in the config. That seems to have broken the auto-parquet bot.

Job manager crashed while running this job (missing heartbeats).

Error code:   JobManagerCrashedError

At this point Iā€™m not sure what it going wrong, but my hope without delving deeply is that at least loading the dataset using the HF library works.

Note that .push_to_hub(Main classes) does all the job: upload the Dataset object to a repository on the Hub, in Parquet files, and it prepares the README.md file accordingly.

For dataset-info.json and state.json, and more generally on the recommended way to prepare your dataset, Iā€™ll let @lhoestq or @albertvillanova answer, as Iā€™m not an expert.

dataset-info.json and state.json are not meant to be uploaded to HF

They are part of the save_to_disk() output which is meant for local Arrow serialization of a dataset to be reused in the same environment.

You should use push_to_hub() if you want to upload on HF