Dataset preview not showing for uploaded DatasetDict

I created a DatasetDict and pushed it here.

I’m getting the message

Server Error
Status code:   400
Exception:     Status400Error
Message:       could not get the config name for this dataset

Am I supposed to create a config file somewhere that I missed so the dataset viewer works?

Thanks!

Hey @dansbecker as far as I know, the dataset viewer currently supports datasets with a loading script (example) or raw data files in common formats like JSON and CSV (example).

I agree that being able view the contents of DatasetDict objects would be a nice feature, so I’m tagging @lhoestq and @severo in case they have any additional insights here :slight_smile:

I created an issue at Dataset viewer issue for `dansbecker/hackernews_hiring_posts` · Issue #3392 · huggingface/datasets · GitHub

2 Likes

Hi ! Please use my_dataset.push_to_hub() to save your dataset on the Hub. Then to reload you can use load_dataset(). You can see the documentation here

Datasets saved with save_to_disk and uploaded manually to the Hub are not supported (yet). This is because saving locally uses the Arrow format: while this format allows a dataset to be immediately reloaded, it’s not the preferred format to store in the cloud since it’s uncompressed (it requires more bandwidth)

Do you all have a preference for continuing the conversation here vs in the issue?

I’d like to make sure I understand the advice above:

I tried the steps @lhoestq suggested

repo_url = 'https://huggingface.co/datasets/dansbecker/hackernews_hiring_posts'
repo = Repository(local_dir=".", clone_from=repo_url)
all_datasets.push_to_hub(repo)

That gives an error AttributeError: 'Repository' object has no attribute 'split'

I assume the split attribute is specified in the loading script that @lewtun mentioned? I see a _split_generators method in that example loading script. Does that create the split attribute in some way I’m missing?

Hi,

the repo argument should be of type str, so try this instead: all_datasets.push_to_hub("hackernews_hiring_posts")

Thanks @mariosasko. That works great.