Load dataset who has been automatically processed by AutoNLP

giggio · February 18, 2022, 1:16pm

I’m trying to load a dataset that was automatically processed by AutoNLP, as shown in the example:

from datasets import load_dataset
dataset = load_dataset(“giggio/autonlp-data-farm”, use_auth_token=True)

When I run this command, I get this error:

ValueError: Couldn’t cast
_data_files: list<item: struct<filename: string>>
child 0, item: struct<filename: string>
child 0, filename: string
_fingerprint: string
_format_columns: list<item: string>
child 0, item: string
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_indices_data_files: null
_output_all_columns: bool
_split: null
to
{‘builder_name’: Value(dtype=‘null’, id=None), ‘citation’: Value(dtype=‘string’, id=None), ‘config_name’: Value(dtype=‘null’, id=None), ‘dataset_size’: Value(dtype=‘null’, id=None), ‘description’: Value(dtype=‘string’, id=None), ‘download_checksums’: Value(dtype=‘null’, id=None), ‘download_size’: Value(dtype=‘null’, id=None), ‘features’: {‘tags’: {‘feature’: {‘num_classes’: Value(dtype=‘int64’, id=None), ‘names’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None), ‘names_file’: Value(dtype=‘null’, id=None), ‘id’: Value(dtype=‘null’, id=None), ‘_type’: Value(dtype=‘string’, id=None)}, ‘length’: Value(dtype=‘int64’, id=None), ‘id’: Value(dtype=‘null’, id=None), ‘_type’: Value(dtype=‘string’, id=None)}, ‘tokens’: {‘feature’: {‘dtype’: Value(dtype=‘string’, id=None), ‘id’: Value(dtype=‘null’, id=None), ‘_type’: Value(dtype=‘string’, id=None)}, ‘length’: Value(dtype=‘int64’, id=None), ‘id’: Value(dtype=‘null’, id=None), ‘_type’: Value(dtype=‘string’, id=None)}}, ‘homepage’: Value(dtype=‘string’, id=None), ‘license’: Value(dtype=‘string’, id=None), ‘post_processed’: Value(dtype=‘null’, id=None), ‘post_processing_size’: Value(dtype=‘null’, id=None), ‘size_in_bytes’: Value(dtype=‘null’, id=None), ‘splits’: {‘train’: {‘name’: Value(dtype=‘string’, id=None), ‘num_bytes’: Value(dtype=‘int64’, id=None), ‘num_examples’: Value(dtype=‘int64’, id=None), ‘dataset_name’: Value(dtype=‘null’, id=None)}}, ‘supervised_keys’: Value(dtype=‘null’, id=None), ‘task_templates’: Valu…
because column names don’t match

Is there anything I can do?

lhoestq · March 2, 2022, 10:48am

Hi ! Currently to load a dataset created with AutoNLP you have to clone your dataset repository locally and then do

from datasets import load_from_disk

d = load_from_disk("path/to/repository")

This is because AutoNLP uses on-disk serialization instead of the Hub format from push_to_hub (cc @sbrandeis).

Topic		Replies	Views
AutoNLP Error for Entity Dataset 🤗AutoTrain	0	997	February 10, 2022
Unable to Load Dataset Using `load_dataset` 🤗Datasets	10	181	March 11, 2025
Loading the pile 🤗Datasets	2	1055	April 29, 2023
Can't automatically load_dataset due to network 🤗Datasets	1	4816	April 7, 2022
Handle errors when loading images (404, corrupted, etc) 🤗Datasets	4	809	August 17, 2023

Load dataset who has been automatically processed by AutoNLP

Related topics