Got wrong row number of dataset viewer

Hi,
I’m experiencing a strange problem. I have a dataset with the type DatasetDict, here is its basic info:

DatasetDict({
    train: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
        num_rows: 1000
    })
})

I use the following code to upload it into the hub:

dataset_small.push_to_hub(
    hub_path, private=False, commit_message="Upload example dataset."
)

After it is converted into parquet and uploaded into the hub, the dataset viewer shows the wrong number of rows:

But if I use the following code to download the dataset:

from datasets import load_dataset
dataset_demo = load_dataset(hub_path)
print(dataset_demo)

I will get the correct info (1k/1k/1k for each split and 3k in total).

Is there something wrong with my usage of datasets and huggingface hub?

Update:
I also got the wrong number when I used the datasets-server API in Get the number of rows and the size in bytes to get the dataset size. It seems that the backend of the datasets-server API has some bugs.

Looking forward to any suggestions and thanks a lot in advance!

well spotted, thanks for opening the issue The API returns the wrong row number Ā· Issue #2581 Ā· huggingface/datasets-server Ā· GitHub, we’re on it

fixed. Thanks again for the investigation!

Thank you for the reply and fix~

Still displaying the wrong number of rows: CaptionEmporium/coyo-hd-11m-llavanext Ā· Datasets at Hugging Face

>>> len(df)
11397144

It is also autogenerating parquet files for the parquets I uploaded?

@lhoestq:

I guess we regenerate Parquet files because of the different row group size? The number of rows is estimated on the autogenerated Parquet files, which is why it’s not exact. In this case, we could get the exact value very easily, maybe something to fix?

I used the datasets library, uploaded it to data, and it still doesn’t give the right number of rows.

import gc
import orjsonl
foo = orjsonl.load('train.jsonl.gz')
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(foo)
table = pa.Table.from_pandas(df)
pq.write_table(table, './train_full.parquet')
del foo, df, table
gc.collect()
from datasets import Dataset
dataset = Dataset.from_parquet('./train_full.parquet')
dataset.save_to_disk('./huggingface')

The repository currently contains three kinds of data: parquet files, json.gz, and arrow files. Moreover, the README YAML does not contain configuration to help us determine which data files should be used for which split, so: it’s best guess using heuristics.


To get the expected results, you can add a configs: field in the README frontmatter. We have detailed documentation pages at Data files Configuration, and example datasets with the supported structures/configurations that you can replicate at datasets-examples (Datasets examples).

I hope it helps.

I hope it helps.

I believe so. Just for people Googling in the future.

For a dataset repository to use the right data, you must consider:

  • You must have a configuration that is YAML instead a README markdown document that points to your training splits. The README.md document is not prepared by save_to_disk.
  • save_to_disk will create a dataset_info.json and state.json, but that doesn’t do anything as far as the UI is concerned.
  • The UI will ignore the file extension/files (.arrow) that are produced by save_to_diskand instead relies on a hierarchy of extensions to find while crawling the repository.

Do I have this correct? This was unexpected for myself, but if this is the way it works this is the way it works.

I have updated the README.md to reflect the arrow files, but it still reports the wrong number of rows:

https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext/raw/main/README.md

I deleted the data folder and re-uploaded only the arrow files, and only the arrow files listed in the config. That seems to have broken the auto-parquet bot.

Job manager crashed while running this job (missing heartbeats).

Error code:   JobManagerCrashedError

At this point I’m not sure what it going wrong, but my hope without delving deeply is that at least loading the dataset using the HF library works.

Note that .push_to_hub(Main classes) does all the job: upload the Dataset object to a repository on the Hub, in Parquet files, and it prepares the README.md file accordingly.

For dataset-info.json and state.json, and more generally on the recommended way to prepare your dataset, I’ll let @lhoestq or @albertvillanova answer, as I’m not an expert.

dataset-info.json and state.json are not meant to be uploaded to HF

They are part of the save_to_disk() output which is meant for local Arrow serialization of a dataset to be reused in the same environment.

You should use push_to_hub() if you want to upload on HF