Load dataset who has been automatically processed by AutoNLP

I’m trying to load a dataset that was automatically processed by AutoNLP, as shown in the example:

from datasets import load_dataset
dataset = load_dataset(ā€œgiggio/autonlp-data-farmā€, use_auth_token=True)

When I run this command, I get this error:

ValueError: Couldn’t cast
_data_files: list<item: struct<filename: string>>
child 0, item: struct<filename: string>
child 0, filename: string
_fingerprint: string
_format_columns: list<item: string>
child 0, item: string
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_indices_data_files: null
_output_all_columns: bool
_split: null
to
{ā€˜builder_name’: Value(dtype=ā€˜null’, id=None), ā€˜citation’: Value(dtype=ā€˜string’, id=None), ā€˜config_name’: Value(dtype=ā€˜null’, id=None), ā€˜dataset_size’: Value(dtype=ā€˜null’, id=None), ā€˜description’: Value(dtype=ā€˜string’, id=None), ā€˜download_checksums’: Value(dtype=ā€˜null’, id=None), ā€˜download_size’: Value(dtype=ā€˜null’, id=None), ā€˜features’: {ā€˜tags’: {ā€˜feature’: {ā€˜num_classes’: Value(dtype=ā€˜int64’, id=None), ā€˜names’: Sequence(feature=Value(dtype=ā€˜string’, id=None), length=-1, id=None), ā€˜names_file’: Value(dtype=ā€˜null’, id=None), ā€˜id’: Value(dtype=ā€˜null’, id=None), ā€˜_type’: Value(dtype=ā€˜string’, id=None)}, ā€˜length’: Value(dtype=ā€˜int64’, id=None), ā€˜id’: Value(dtype=ā€˜null’, id=None), ā€˜_type’: Value(dtype=ā€˜string’, id=None)}, ā€˜tokens’: {ā€˜feature’: {ā€˜dtype’: Value(dtype=ā€˜string’, id=None), ā€˜id’: Value(dtype=ā€˜null’, id=None), ā€˜_type’: Value(dtype=ā€˜string’, id=None)}, ā€˜length’: Value(dtype=ā€˜int64’, id=None), ā€˜id’: Value(dtype=ā€˜null’, id=None), ā€˜_type’: Value(dtype=ā€˜string’, id=None)}}, ā€˜homepage’: Value(dtype=ā€˜string’, id=None), ā€˜license’: Value(dtype=ā€˜string’, id=None), ā€˜post_processed’: Value(dtype=ā€˜null’, id=None), ā€˜post_processing_size’: Value(dtype=ā€˜null’, id=None), ā€˜size_in_bytes’: Value(dtype=ā€˜null’, id=None), ā€˜splits’: {ā€˜train’: {ā€˜name’: Value(dtype=ā€˜string’, id=None), ā€˜num_bytes’: Value(dtype=ā€˜int64’, id=None), ā€˜num_examples’: Value(dtype=ā€˜int64’, id=None), ā€˜dataset_name’: Value(dtype=ā€˜null’, id=None)}}, ā€˜supervised_keys’: Value(dtype=ā€˜null’, id=None), ā€˜task_templates’: Valu…
because column names don’t match

Is there anything I can do?

Hi ! Currently to load a dataset created with AutoNLP you have to clone your dataset repository locally and then do

from datasets import load_from_disk

d = load_from_disk("path/to/repository")

This is because AutoNLP uses on-disk serialization instead of the Hub format from push_to_hub (cc @sbrandeis).