Dataset shows 0 rows when loaded but full when pushed

GeneralCognition-Has · July 26, 2023, 3:17am

Hi,

I have a dataset dict with 2 splits and 4 features

DatasetDict({
    train: Dataset({
        features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
        num_rows: 3996
    })
    test: Dataset({
        features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
        num_rows: 444
    })
})

The screenshot feature is a PIL image, the html features are strings, the metadata feature is a dict with strings and numbers in the values.

I am trying to push it to the hub with

dataset_dict.push_to_hub(
    'GeneralCognition/SEC_Tables_Lite', 
     token='<MY_WRITE_TOKEN>', 
     private=True
)

And then I see the parquet files in the UI

But then if I go to colab and try to load the dataset

!pip install datasets --quiet
from datasets import load_dataset

ds = load_dataset(
    'GeneralCognition/SEC_Tables_Lite', 
    token='<MY_READ_TOKEN>', 
    # verification_mode='no_checks', 
    download_mode='force_redownload'
)

print(ds)

I get the following error (I formatted it for clarity)

NonMatchingSplitsSizesError: [
    {
        "expected": SplitInfo(
            name="train",
            num_bytes=157882062.6,
            num_examples=3996,
            shard_lengths=None,
            dataset_name=None,
        ),
        "recorded": SplitInfo(
            name="train",
            num_bytes=0,
            num_examples=0,
            shard_lengths=None,
            dataset_name="sec_tables_lite",
        ),
    },
    {
        "expected": SplitInfo(
            name="test",
            num_bytes=17542451.4,
            num_examples=444,
            shard_lengths=None,
            dataset_name=None,
        ),
        "recorded": SplitInfo(
            name="test",
            num_bytes=0,
            num_examples=0,
            shard_lengths=None,
            dataset_name="sec_tables_lite",
        ),
    },
]

Now if instead I load it with verification_mode='no_checks' I get an empty dataset.

DatasetDict({
    test: Dataset({
        features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
        num_rows: 0
    })
    train: Dataset({
        features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
        num_rows: 0
    })
})

This is my first time uploading a dataset to the hub, why is this not working, what am I doing wrong? Thanks for the help.

Note: I did try using the same token for reading and writing, I tried just uploading from a dataset instead of a dataset_dict. I tried messing with max_shard_size. To no avail.

Topic		Replies	Views
Got wrong row number of dataset viewer 🤗Hub	11	606	June 26, 2024
Dataset preview not showing for uploaded DatasetDict 🤗Datasets	6	2132	December 7, 2021
Dataset is showing 0 number of rows 🤗Datasets	2	506	January 25, 2023
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023
Dataset viewer issue Site Feedback	7	284	January 17, 2025

Dataset shows 0 rows when loaded but full when pushed

Related topics