Unable to Load Dataset Using `load_dataset`

I converted ImageNet and its corresponding depth images into Parquet format using save_to_disk, storing them as a DatasetDict object. I can successfully load the dataset using load_from_disk as follows:

from datasets import load_from_disk

ds = load_from_disk("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")
ds

This returns:

DatasetDict({
    train: Dataset({
        features: ['rgb', 'd', 'label'],
        num_rows: 1281167
    })
    val: Dataset({
        features: ['rgb', 'd', 'label'],
        num_rows: 50000
    })
})

However, during training, the data loading process intermittently stalls for a few iterations—loading is generally fast, but it randomly pauses for several seconds. To resolve this, I attempted to load the dataset using load_dataset, but encountered the following error:

from datasets import load_dataset

ds = load_dataset("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")
Failed to read file '/defaultShare/pubdata/ImageNet_arrow_rgbdpa/train/data-00000-of-00096.arrow' with error <class 'datasets.table.CastError'>: Couldn't cast
rgb: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
d: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
label: int64
-- schema metadata --
huggingface: '{"info": {"features": {"rgb": {"mode": "RGB", "_type": "Ima' + 24766
to
{'indices': Value(dtype='uint64', id=None)}
because column names don't match

I have not found a solution to this issue yet.

1 Like

Detailed trace back is:

---------------------------------------------------------------------------
CastError                                 Traceback (most recent call last)
File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1854, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1853 _time = time.time()
-> 1854 for _, table in generator:
   1855     if max_shard_size is not None and writer._num_bytes > max_shard_size:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:76, in Arrow._generate_tables(self, files)
     73         # Uncomment for debugging (will print the Arrow table size and elements)
     74         # logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
     75         # logger.warning('\n'.join(str(pa_table.slice(i, 1).to_pydict()) for i in range(pa_table.num_rows)))
---> 76         yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table)
     77 except ValueError as e:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:59, in Arrow._cast_table(self, pa_table)
     56 if self.info.features is not None:
     57     # more expensive cast to support nested features with keys in a different order
     58     # allows str <-> int/float or str to Audio for example
---> 59     pa_table = table_cast(pa_table, self.info.features.arrow_schema)
     60 return pa_table

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/table.py:2292, in table_cast(table, schema)
   2291 if table.schema != schema:
-> 2292     return cast_table_to_schema(table, schema)
   2293 elif table.schema.metadata != schema.metadata:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/table.py:2240, in cast_table_to_schema(table, schema)
   2239 if not table_column_names <= set(schema.names):
-> 2240     raise CastError(
   2241         f"Couldn't cast\n{_short_str(table.schema)}\nto\n{_short_str(features)}\nbecause column names don't match",
   2242         table_column_names=table.column_names,
   2243         requested_column_names=list(features),
   2244     )
   2245 arrays = [
   2246     cast_array_to_feature(
   2247         table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type),
   (...)   2250     for name, feature in features.items()
   2251 ]

CastError: Couldn't cast
rgb: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
d: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
label: int64
-- schema metadata --
huggingface: '{"info": {"features": {"rgb": {"mode": "RGB", "_type": "Ima' + 24766
to
{'indices': Value(dtype='uint64', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[2], line 3
      1 from datasets import load_dataset
----> 3 ds = load_dataset("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/load.py:2151, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2148     return builder_instance.as_streaming_dataset(split=split)
   2150 # Download and prepare data
-> 2151 builder_instance.download_and_prepare(
   2152     download_config=download_config,
   2153     download_mode=download_mode,
   2154     verification_mode=verification_mode,
   2155     num_proc=num_proc,
   2156     storage_options=storage_options,
   2157 )
   2159 # Build dataset for splits
   2160 keep_in_memory = (
   2161     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2162 )

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:924, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    922 if num_proc is not None:
    923     prepare_split_kwargs["num_proc"] = num_proc
--> 924 self._download_and_prepare(
    925     dl_manager=dl_manager,
    926     verification_mode=verification_mode,
    927     **prepare_split_kwargs,
    928     **download_and_prepare_kwargs,
    929 )
    930 # Sync info
    931 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1000, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    996 split_dict.add(split_generator.split_info)
    998 try:
    999     # Prepare split will record examples associated to the split
-> 1000     self._prepare_split(split_generator, **prepare_split_kwargs)
   1001 except OSError as e:
   1002     raise OSError(
   1003         "Cannot find data file. "
   1004         + (self.manual_download_instructions or "")
   1005         + "\nOriginal error:\n"
   1006         + str(e)
   1007     ) from None

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1741, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1739 job_id = 0
   1740 with pbar:
-> 1741     for job_id, done, content in self._prepare_split_single(
   1742         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1743     ):
   1744         if done:
   1745             result = content

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1897, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1895     if isinstance(e, DatasetGenerationError):
   1896         raise
-> 1897     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1899 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset
1 Like

The load_dataset() function in the Hugging Face datasets library is for loading datasets that have been converted for use with HF, so you should either convert the dataset to HF format and save it, or load it using another function.


To resolve the data loading issue, follow these steps:

  1. Use the Correct Loading Function: Since your data is saved in the Arrow format using save_to_disk, you should use load_from_disk to load it. This function is designed for Arrow files and supports the DatasetDict structure.

    from datasets import load_from_disk
    
    ds = load_from_disk("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")
    
  2. Avoid Using load_dataset for Arrow Files: The function load_dataset is intended for loading from specific formats like Parquet, CSV, or JSON, not Arrow. Using it for Arrow files can lead to schema mismatches and errors.

  3. Investigate Data Loading Performance: If you’re experiencing stalling during training, consider the following:

    • Caching: Ensure that your data is being read efficiently. Using load_from_disk may require additional optimizations for caching.
    • Disk I/O: Check if the disk where your data is stored is experiencing high latency or contention. Using faster storage solutions might help.
    • Data Sharding: If your Arrow files are large, consider sharding them into smaller files to improve parallel reading.
    • Batching: Optimize how data is batched during training to reduce I/O bottlenecks.
  4. Consider Converting to Parquet: If performance remains an issue, you can convert your DatasetDict to Parquet format for potentially faster access. This involves saving each split as a Parquet file and then loading using load_dataset with the Parquet option.

    # Convert and save each split to Parquet
    ds['train'].to_parquet('/path/to/train.parquet')
    ds['val'].to_parquet('/path/to/val.parquet')
    
    # Load using load_dataset
    train_ds = load_dataset('parquet', data_files={'train': '/path/to/train.parquet'})
    val_ds = load_dataset('parquet', data_files={'val': '/path/to/val.parquet'})
    

By adhering to these steps, you ensure compatibility with your data format and address potential performance issues during training.

Thank you for your response. However, the Arrow format has already been saved as Parquet, which should be compatible with Hugging Face, so this error shouldn’t occur. Additionally, even after converting to Parquet, the training process still randomly pauses for several seconds. Do you have any ideas about it?

1 Like

Hmm…
Maybe it would be better to shard the data set.

Thanks again, but actually, when saving the dataset, I already sharded each split into 96 pieces using:

imagenet.save_to_disk("./Imagenet_arrow_rgbdpa", num_proc=96, max_shard_size="8GB")

Therefore, I have no clear explanation for the performance issues or the errors encountered.

1 Like

The complete conversion script is as follows:

# rgb_paths, d_paths, and labels are lists containing image paths
imagenet_train = Dataset.from_dict({"rgb": rgb_paths_train, "d": d_paths_train, "label": labels_train})
imagenet_val = Dataset.from_dict({"rgb": rgb_paths_val, "d": d_paths_val, "label": labels_val})

# Convert columns to appropriate data types
imagenet_train = imagenet_train.cast_column("rgb", Image(mode="RGB"))
imagenet_train = imagenet_train.cast_column("d", Image(mode="L"))
imagenet_val = imagenet_val.cast_column("rgb", Image(mode="RGB"))
imagenet_val = imagenet_val.cast_column("d", Image(mode="L"))

# Assign class labels
imagenet_train = imagenet_train.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.keys())))
imagenet_train = imagenet_train.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.values())))
imagenet_val = imagenet_val.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.keys())))
imagenet_val = imagenet_val.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.values())))

# Create DatasetDict and save to disk
imagenet = DatasetDict({"train": imagenet_train, "val": imagenet_val})
imagenet.save_to_disk("./Imagenet_arrow_rgbdpa", num_proc=96, max_shard_size="8GB")

This setup ensures the dataset is properly structured and efficiently sharded, yet the performance issues and errors persist.

1 Like

max_shard_size may be too large.

Thank you very much! I regenerated the dataset with max_shard_size="1GB", and now it can be loaded successfully using both load_dataset and load_from_disk.

However, the training stalls remain unresolved and may be related to hardware issues. I have also discussed this in the TIMM framework forum. Inconsistent Training Throughput Across Epochs · huggingface/pytorch-image-models · Discussion #2449

1 Like

Unless it’s simply a case of not having enough VRAM, it could be that the trainer’s optimization options are causing the problem. If you’re using Lightning, that could also be a factor.

Data type format issue

Cache issue

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.