Hi,
Iām experiencing a strange problem. I have a dataset with the type DatasetDict
, here is its basic info:
DatasetDict({
train: Dataset({
features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
num_rows: 1000
})
valid: Dataset({
features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
num_rows: 1000
})
test: Dataset({
features: ['subject', 'grade', 'skill', 'pic_choice', 'pic_prob', 'problem', 'problem_pic', 'choices', 'choices_pic', 'answer_idx'],
num_rows: 1000
})
})
I use the following code to upload it into the hub:
dataset_small.push_to_hub(
hub_path, private=False, commit_message="Upload example dataset."
)
After it is converted into parquet and uploaded into the hub, the dataset viewer shows the wrong number of rows:
But if I use the following code to download the dataset:
from datasets import load_dataset
dataset_demo = load_dataset(hub_path)
print(dataset_demo)
I will get the correct info (1k/1k/1k for each split and 3k in total).
Is there something wrong with my usage of datasets
and huggingface hub?
Update:
I also got the wrong number when I used the datasets-server API in Get the number of rows and the size in bytes to get the dataset size. It seems that the backend of the datasets-server API has some bugs.
Looking forward to any suggestions and thanks a lot in advance!