Can't load script-based dataset, clearing I'm doing something wrong

So I followed the documentation for dataset loading from script as closely as I could. I’ve got a dataset that loads from compressed numpy files (npz) into Array2D features ultimately output as PyTorch tensors.

I can run the dataset test & metadata generation just fine, but then when I actually try to load the dataset using:

ds = datasets.load_dataset('./asl_embeddings/', "default")

I get a yaml exception deep in the code.

File "~/project/dataset_test.py", line 3, in <module>
    ds = datasets.load_dataset('./asl_embeddings/', "default")
  File "~/project/venv/lib/python3.9/site-packages/datasets/load.py", line 2128, in load_dataset
    builder_instance = load_dataset_builder(
  File "~/project/venv/lib/python3.9/site-packages/datasets/load.py", line 1851, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "~/project/venv/lib/python3.9/site-packages/datasets/builder.py", line 383, in __init__
    info = self.get_exported_dataset_info()
  File "~/project/venv/lib/python3.9/site-packages/datasets/builder.py", line 507, in get_exported_dataset_info
    return self.get_all_exported_dataset_infos().get(self.config.name, DatasetInfo())
  File "~/project/venv/lib/python3.9/site-packages/datasets/builder.py", line 493, in get_all_exported_dataset_infos
    return DatasetInfosDict.from_directory(cls.get_imported_module_dir())
  File "~/project/venv/lib/python3.9/site-packages/datasets/info.py", line 430, in from_directory
    dataset_card_data = DatasetCard.load(Path(dataset_infos_dir) / "README.md").data
  File "~/project/venv/lib/python3.9/site-packages/huggingface_hub/repocard.py", line 186, in load
    return cls(f.read(), ignore_metadata_errors=ignore_metadata_errors)
  File "~/project/venv/lib/python3.9/site-packages/huggingface_hub/repocard.py", line 77, in __init__
    self.content = content
  File "~/project/venv/lib/python3.9/site-packages/huggingface_hub/repocard.py", line 95, in content
    data_dict = yaml.safe_load(yaml_block)
  File "~/project/venv/lib/python3.9/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "~/project/venv/lib/python3.9/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 51, in get_single_data
    return self.construct_document(node)
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 60, in construct_document
    for dummy in generator:
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 413, in construct_yaml_map
    value = self.construct_mapping(node)
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 218, in construct_mapping
    return super().construct_mapping(node, deep=deep)
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 143, in construct_mapping
    value = self.construct_object(value_node, deep=deep)
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 100, in construct_object
    data = constructor(self, node)
  File "~/project/venv/lib/python3.9/site-packages/yaml/constructor.py", line 427, in construct_undefined
    raise ConstructorError(None, None,
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
  in "<unicode string>", line 10, column 16:
            shape: !!python/tuple

It seems to be choking on the metadata declaring the 2d array, but I don’t understand the nitty-gritty enough to grok it. Any thoughts on what I’m doing wrong?

Hey @amharrison, for a cleaner workaround, you could go into your huggingface dataset repo to remove the line that has !!python/tuple. With that, you’ll never have to manually go into cache file to remove the metadata again. See how I do it (just in case you need)