Got wrong row number of dataset viewer

shizue · March 12, 2024, 7:12pm

Hi,
I’m experiencing a strange problem. I have a dataset with the type DatasetDict, here is its basic info:

DatasetDict({
    train: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
})

I use the following code to upload it into the hub:

dataset_small.push_to_hub(
    hub_path, private=False, commit_message="Upload example dataset."
)

After it is converted into parquet and uploaded into the hub, the dataset viewer shows the wrong number of rows:

But if I use the following code to download the dataset:

from datasets import load_dataset
dataset_demo = load_dataset(hub_path)
print(dataset_demo)

I will get the correct info (1k/1k/1k for each split and 3k in total).

Is there something wrong with my usage of datasets and huggingface hub?

Update:
I also got the wrong number when I used the datasets-server API in Get the number of rows and the size in bytes to get the dataset size. It seems that the backend of the datasets-server API has some bugs.

Looking forward to any suggestions and thanks a lot in advance!

severo · March 13, 2024, 9:49am

well spotted, thanks for opening the issue The API returns the wrong row number · Issue #2581 · huggingface/datasets-server · GitHub, we’re on it

severo · March 13, 2024, 10:58am

fixed. Thanks again for the investigation!

shizue · March 13, 2024, 12:06pm

Thank you for the reply and fix~

animepfp · June 24, 2024, 3:23am

Still displaying the wrong number of rows: CaptionEmporium/coyo-hd-11m-llavanext · Datasets at Hugging Face

>>> len(df)
11397144

It is also autogenerating parquet files for the parquets I uploaded?

severo · June 24, 2024, 9:37am

@lhoestq:

I guess we regenerate Parquet files because of the different row group size? The number of rows is estimated on the autogenerated Parquet files, which is why it’s not exact. In this case, we could get the exact value very easily, maybe something to fix?

animepfp · June 24, 2024, 8:53pm

I used the datasets library, uploaded it to data, and it still doesn’t give the right number of rows.

import gc
import orjsonl
foo = orjsonl.load('train.jsonl.gz')
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(foo)
table = pa.Table.from_pandas(df)
pq.write_table(table, './train_full.parquet')
del foo, df, table
gc.collect()
from datasets import Dataset
dataset = Dataset.from_parquet('./train_full.parquet')
dataset.save_to_disk('./huggingface')

severo · June 25, 2024, 12:29pm

The repository currently contains three kinds of data: parquet files, json.gz, and arrow files. Moreover, the README YAML does not contain configuration to help us determine which data files should be used for which split, so: it’s best guess using heuristics.

To get the expected results, you can add a configs: field in the README frontmatter. We have detailed documentation pages at Data files Configuration, and example datasets with the supported structures/configurations that you can replicate at datasets-examples (Datasets examples).

I hope it helps.

animepfp · June 25, 2024, 1:15pm

I hope it helps.

I believe so. Just for people Googling in the future.

For a dataset repository to use the right data, you must consider:

You must have a configuration that is YAML instead a README markdown document that points to your training splits. The README.md document is not prepared by save_to_disk.
save_to_disk will create a dataset_info.json and state.json, but that doesn’t do anything as far as the UI is concerned.
The UI will ignore the file extension/files (.arrow) that are produced by save_to_diskand instead relies on a hierarchy of extensions to find while crawling the repository.

Do I have this correct? This was unexpected for myself, but if this is the way it works this is the way it works.

I have updated the README.md to reflect the arrow files, but it still reports the wrong number of rows:

https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext/raw/main/README.md

animepfp · June 25, 2024, 6:48pm

I deleted the data folder and re-uploaded only the arrow files, and only the arrow files listed in the config. That seems to have broken the auto-parquet bot.

Job manager crashed while running this job (missing heartbeats).

Error code:   JobManagerCrashedError

At this point I’m not sure what it going wrong, but my hope without delving deeply is that at least loading the dataset using the HF library works.

severo · June 26, 2024, 8:52am

Note that .push_to_hub(Main classes) does all the job: upload the Dataset object to a repository on the Hub, in Parquet files, and it prepares the README.md file accordingly.

For dataset-info.json and state.json, and more generally on the recommended way to prepare your dataset, I’ll let @lhoestq or @albertvillanova answer, as I’m not an expert.

lhoestq · June 26, 2024, 1:11pm

dataset-info.json and state.json are not meant to be uploaded to HF

They are part of the save_to_disk() output which is meant for local Arrow serialization of a dataset to be reused in the same environment.

You should use push_to_hub() if you want to upload on HF

monperrus · September 13, 2025, 4:57am

Hi all,

For ASSERT-KTH/DISL · Datasets at Hugging Face the number of raws shown is incorrect:

Expected Number: 3.3M (3,298,271)

Actual Number: 210k rows

How to fix this?

Thanks!

John6666 · September 13, 2025, 5:52am

I think it’s a size limit rather than a bug…

Topic		Replies	Views
Dataset shows 0 rows when loaded but full when pushed 🤗Datasets	0	426	July 26, 2023
Dataset preview not showing for uploaded DatasetDict 🤗Datasets	6	2152	December 7, 2021
Uploading Dataset: GUI vs Python "Error" 🤗Datasets	4	449	February 15, 2023
Ull dataset viewer is not available 🤗Datasets	4	165	January 8, 2025
The Full Dataset Viewer is Not available, Only showing preview of rows 🤗Datasets	0	80	July 18, 2024

Got wrong row number of dataset viewer

Related topics