Hi,
I have a dataset dict with 2 splits and 4 features
DatasetDict({
train: Dataset({
features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
num_rows: 3996
})
test: Dataset({
features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
num_rows: 444
})
})
The screenshot feature is a PIL image, the html features are strings, the metadata feature is a dict with strings and numbers in the values.
I am trying to push it to the hub with
dataset_dict.push_to_hub(
'GeneralCognition/SEC_Tables_Lite',
token='<MY_WRITE_TOKEN>',
private=True
)
And then I see the parquet files in the UI
But then if I go to colab and try to load the dataset
!pip install datasets --quiet
from datasets import load_dataset
ds = load_dataset(
'GeneralCognition/SEC_Tables_Lite',
token='<MY_READ_TOKEN>',
# verification_mode='no_checks',
download_mode='force_redownload'
)
print(ds)
I get the following error (I formatted it for clarity)
NonMatchingSplitsSizesError: [
{
"expected": SplitInfo(
name="train",
num_bytes=157882062.6,
num_examples=3996,
shard_lengths=None,
dataset_name=None,
),
"recorded": SplitInfo(
name="train",
num_bytes=0,
num_examples=0,
shard_lengths=None,
dataset_name="sec_tables_lite",
),
},
{
"expected": SplitInfo(
name="test",
num_bytes=17542451.4,
num_examples=444,
shard_lengths=None,
dataset_name=None,
),
"recorded": SplitInfo(
name="test",
num_bytes=0,
num_examples=0,
shard_lengths=None,
dataset_name="sec_tables_lite",
),
},
]
Now if instead I load it with verification_mode='no_checks'
I get an empty dataset.
DatasetDict({
test: Dataset({
features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
num_rows: 0
})
train: Dataset({
features: ['screenshot', 'raw_html', 'clean_html', 'metadata'],
num_rows: 0
})
})
This is my first time uploading a dataset to the hub, why is this not working, what am I doing wrong? Thanks for the help.
Note: I did try using the same token for reading and writing, I tried just uploading from a dataset instead of a dataset_dict. I tried messing with max_shard_size. To no avail.