Load dataset who has been automatically processed by AutoNLP

Iā€™m trying to load a dataset that was automatically processed by AutoNLP, as shown in the example:

from datasets import load_dataset
dataset = load_dataset(ā€œgiggio/autonlp-data-farmā€, use_auth_token=True)

When I run this command, I get this error:

ValueError: Couldnā€™t cast
_data_files: list<item: struct<filename: string>>
child 0, item: struct<filename: string>
child 0, filename: string
_fingerprint: string
_format_columns: list<item: string>
child 0, item: string
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_indices_data_files: null
_output_all_columns: bool
_split: null
to
{ā€˜builder_nameā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜citationā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜config_nameā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜dataset_sizeā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜descriptionā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜download_checksumsā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜download_sizeā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜featuresā€™: {ā€˜tagsā€™: {ā€˜featureā€™: {ā€˜num_classesā€™: Value(dtype=ā€˜int64ā€™, id=None), ā€˜namesā€™: Sequence(feature=Value(dtype=ā€˜stringā€™, id=None), length=-1, id=None), ā€˜names_fileā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜idā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜_typeā€™: Value(dtype=ā€˜stringā€™, id=None)}, ā€˜lengthā€™: Value(dtype=ā€˜int64ā€™, id=None), ā€˜idā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜_typeā€™: Value(dtype=ā€˜stringā€™, id=None)}, ā€˜tokensā€™: {ā€˜featureā€™: {ā€˜dtypeā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜idā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜_typeā€™: Value(dtype=ā€˜stringā€™, id=None)}, ā€˜lengthā€™: Value(dtype=ā€˜int64ā€™, id=None), ā€˜idā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜_typeā€™: Value(dtype=ā€˜stringā€™, id=None)}}, ā€˜homepageā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜licenseā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜post_processedā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜post_processing_sizeā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜size_in_bytesā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜splitsā€™: {ā€˜trainā€™: {ā€˜nameā€™: Value(dtype=ā€˜stringā€™, id=None), ā€˜num_bytesā€™: Value(dtype=ā€˜int64ā€™, id=None), ā€˜num_examplesā€™: Value(dtype=ā€˜int64ā€™, id=None), ā€˜dataset_nameā€™: Value(dtype=ā€˜nullā€™, id=None)}}, ā€˜supervised_keysā€™: Value(dtype=ā€˜nullā€™, id=None), ā€˜task_templatesā€™: Valuā€¦
because column names donā€™t match

Is there anything I can do?

Hi ! Currently to load a dataset created with AutoNLP you have to clone your dataset repository locally and then do

from datasets import load_from_disk

d = load_from_disk("path/to/repository")

This is because AutoNLP uses on-disk serialization instead of the Hub format from push_to_hub (cc @sbrandeis).