Datasets-cli test failed when generating metadata due to the use of Array2D

jherng · December 30, 2023, 7:40am

Hi, I’ve followed the documentation closely to write my dataset loading script. Loading dataset with datasets.load_dataset works fine and everything.

But when I attempted to generate a dataset metadata using the way as specified in this link, the following errors occurred:

➜  datasets-cli test jherng/xd-violence --save_info --all_configs
Loading Dataset Infos from C:\Users\Jia Herng\.cache\huggingface\modules\datasets_modules\datasets\jherng--xd-violence\364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Jia Herng\.cache\huggingface\datasets/jherng___xd-violence/video/0.0.0/364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e
Testing builder 'video' (1/4)
Generating dataset xd-violence (C:/Users/Jia Herng/.cache/huggingface/datasets/jherng___xd-violence/video/0.0.0/364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e)
Downloading and preparing dataset xd-violence/video (download: 79.64 GiB, generated: 929.76 KiB, post-processed: Unknown size, total: 79.64 GiB) to C:/Users/Jia Herng/.cache/huggingface/datasets/jherng___xd-violence/video/0.0.0/364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e...
Downloading took 0.0 min
Checksum Computation took 0.0 min
Downloading took 0.0 min
Checksum Computation took 0.0 min
Downloading took 0.0 min
Checksum Computation took 0.0 min
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████| 3950/3950 [00:03<00:00, 1192.69it/s]
Downloading took 0.0 min
Checksum Computation took 0.0 min
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████| 800/800 [00:00<00:00, 1220.36it/s]
Downloading took 0.0 min
Checksum Computation took 0.0 min
Generating train split
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████| 3950/3950 [00:00<00:00, 21280.08 examples/s]
Generating test split
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████| 800/800 [00:00<00:00, 10255.68 examples/s]
Dataset xd-violence downloaded and prepared to C:/Users/Jia Herng/.cache/huggingface/datasets/jherng___xd-violence/video/0.0.0/364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e. Subsequent calls will reuse this data.
Loading Dataset Infos from C:\Users\Jia Herng\.cache\huggingface\modules\datasets_modules\datasets\jherng--xd-violence\364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e
Dataset card saved at C:\Users\Jia Herng\.cache\huggingface\modules\datasets_modules\datasets\jherng--xd-violence\364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e\README.md
Loading Dataset Infos from C:\Users\Jia Herng\.cache\huggingface\modules\datasets_modules\datasets\jherng--xd-violence\364a20a2942b6ff05e759ca668d2770b88448d3a7aaff11abb07ede7a7b56f8e
Traceback (most recent call last):
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\Scripts\datasets-cli.exe\__main__.py", line 7, in <module>
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\commands\datasets_cli.py", line 39, in main
    service.run()
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\commands\test.py", line 141, in run
    for j, builder in enumerate(get_builders()):
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\commands\test.py", line 124, in get_builders
    yield builder_cls(
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\builder.py", line 383, in __init__
    info = self.get_exported_dataset_info()
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\builder.py", line 507, in get_exported_dataset_info
    return self.get_all_exported_dataset_infos().get(self.config.name, DatasetInfo())
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\builder.py", line 493, in get_all_exported_dataset_infos
    return DatasetInfosDict.from_directory(cls.get_imported_module_dir())
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\datasets\info.py", line 430, in from_directory
    dataset_card_data = DatasetCard.load(Path(dataset_infos_dir) / "README.md").data
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\huggingface_hub\repocard.py", line 186, in load
    return cls(f.read(), ignore_metadata_errors=ignore_metadata_errors)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\huggingface_hub\repocard.py", line 77, in __init__
    self.content = content
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\huggingface_hub\repocard.py", line 95, in content
    data_dict = yaml.safe_load(yaml_block)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\__init__.py", line 81, in load
    return loader.get_single_data()
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 51, in get_single_data
    return self.construct_document(node)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 60, in construct_document
    for dummy in generator:
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 413, in construct_yaml_map
    value = self.construct_mapping(node)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 218, in construct_mapping
    return super().construct_mapping(node, deep=deep)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 143, in construct_mapping
    value = self.construct_object(value_node, deep=deep)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 100, in construct_object
    data = constructor(self, node)
  File "C:\Users\Jia Herng\miniconda3\envs\fyp-env\lib\site-packages\yaml\constructor.py", line 427, in construct_undefined
    raise ConstructorError(None, None,
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
  in "<unicode string>", line 10, column 16:
            shape: !!python/tuple
                   ^

I suspect it’s due to the use of Array2D in the dataset, as it generates “!!python/tuple” in the metadata file, and the underlying datasets implementation uses yaml.safe_load(), which then causes this error.

This is the half generated dataset metadata README.md:

---
dataset_info:
- config_name: i3d_rgb
  features:
  - name: id
    dtype: string
  - name: feature
    dtype:
      array2_d:
        shape: !!python/tuple
        - 2048
        dtype: float32
  - name: binary_target
    dtype:
      class_label:
        names:
          '0': Non-violence
          '1': Violence
  - name: multilabel_target
    sequence:
      class_label:
        names:
          '0': Normal
          '1': Fighting
          '2': Shooting
          '3': Riot
          '4': Abuse
          '5': Car accident
          '6': Explosion
  - name: frame_annotations
    sequence:
    - name: start
      dtype: int32
    - name: end
      dtype: int32
  splits:
  - name: train
    num_bytes: 10535081525
    num_examples: 19750
  - name: test
    num_bytes: 1512537525
    num_examples: 4000
  download_size: 12040668091
  dataset_size: 12047619050
- config_name: video
  features:
  - name: id
    dtype: string
  - name: path
    dtype: string
  - name: binary_target
    dtype:
      class_label:
        names:
          '0': Non-violence
          '1': Violence
  - name: multilabel_target
    sequence:
      class_label:
        names:
          '0': Normal
          '1': Fighting
          '2': Shooting
          '3': Riot
          '4': Abuse
          '5': Car accident
          '6': Explosion
  - name: frame_annotations
    sequence:
    - name: start
      dtype: int32
    - name: end
      dtype: int32
  splits:
  - name: train
    num_bytes: 782565
    num_examples: 3950
  - name: test
    num_bytes: 169505
    num_examples: 800
  download_size: 85510639707
  dataset_size: 952070
---

Appreciate any help!

amharrison · December 30, 2023, 3:02pm

Not that I can add anything useful, other than this appears to be the same issue (at least it’s the same exception) that I’ve run into. I can run datasets-cli, but can’t load the dataset - getting the same exception. It appears as though the first run of the dataset works, but any subsequent attempts fail.

amharrison · January 5, 2024, 1:55pm

@jherng I’ve found a hack fix. After the metadata is generated and the dataset is cached, delete the metadata section of the README.md, and somehow the tuple exception is circumvented.

jherng · January 6, 2024, 6:33am

Yes, I can confirm that it works this way if just having 1 configuration! First, run the script to generate the metadata in the README, then remove the !!python/tuple from the README, upload the README to huggingface hub, then loading the dataset with datasets.load_dataset() will just work.

However, in my current case, I have 4 configurations (i.e., video, i3d_rgb, c3d_rgb, swin_rgb), running the following line, will cause the program to stop halfway due to this issue (it can’t be circumvented by removing !!python/tuple in the middle)

datasets-cli test jherng/xd-violence --save_info --all_configs

My current workaround is to run each of the datasets-cli test separately as follows:

datasets-cli test jherng/xd-violence --save_info --name video
datasets-cli test jherng/xd-violence --save_info --name i3_rgb
datasets-cli test jherng/xd-violence --save_info --name c3d_rgb
datasets-cli test jherng/xd-violence --save_info --name swin_rgb

Each of them will same come up with the yaml.constructor.ConstructorError at the end but that’s fine, at least the metadata files will be generated anyway. I copy and paste each of the metadata file and combine them into one as in this (Of course, I removed the all the !!python/tuple in the file).

Hope this helps!

system · January 24, 2024, 3:10am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom loading dataset script 🤗Datasets	4	511	January 3, 2023
Testing and dummy data required for dataset loading script? 🤗Datasets	2	708	August 8, 2022
Sharing a community provided dataset Beginners	3	450	October 4, 2020
Increase on disk space when using map() in Accelerate environment 🤗Datasets	2	1169	August 18, 2022
Can't load script-based dataset, clearing I'm doing something wrong 🤗Datasets	1	266	January 6, 2024

Datasets-cli test failed when generating metadata due to the use of Array2D

Related topics