Unable to Load Dataset Using `load_dataset`

wyrx · March 7, 2025, 8:28am

I converted ImageNet and its corresponding depth images into Parquet format using save_to_disk, storing them as a DatasetDict object. I can successfully load the dataset using load_from_disk as follows:

from datasets import load_from_disk

ds = load_from_disk("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")
ds

This returns:

DatasetDict({
    train: Dataset({
        features: ['rgb', 'd', 'label'],
        num_rows: 1281167
    })
    val: Dataset({
        features: ['rgb', 'd', 'label'],
        num_rows: 50000
    })
})

However, during training, the data loading process intermittently stalls for a few iterations—loading is generally fast, but it randomly pauses for several seconds. To resolve this, I attempted to load the dataset using load_dataset, but encountered the following error:

from datasets import load_dataset

ds = load_dataset("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")

Failed to read file '/defaultShare/pubdata/ImageNet_arrow_rgbdpa/train/data-00000-of-00096.arrow' with error <class 'datasets.table.CastError'>: Couldn't cast
rgb: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
d: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
label: int64
-- schema metadata --
huggingface: '{"info": {"features": {"rgb": {"mode": "RGB", "_type": "Ima' + 24766
to
{'indices': Value(dtype='uint64', id=None)}
because column names don't match

I have not found a solution to this issue yet.

wyrx · March 7, 2025, 8:29am

Detailed trace back is:

---------------------------------------------------------------------------
CastError                                 Traceback (most recent call last)
File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1854, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1853 _time = time.time()
-> 1854 for _, table in generator:
   1855     if max_shard_size is not None and writer._num_bytes > max_shard_size:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:76, in Arrow._generate_tables(self, files)
     73         # Uncomment for debugging (will print the Arrow table size and elements)
     74         # logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
     75         # logger.warning('\n'.join(str(pa_table.slice(i, 1).to_pydict()) for i in range(pa_table.num_rows)))
---> 76         yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table)
     77 except ValueError as e:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:59, in Arrow._cast_table(self, pa_table)
     56 if self.info.features is not None:
     57     # more expensive cast to support nested features with keys in a different order
     58     # allows str <-> int/float or str to Audio for example
---> 59     pa_table = table_cast(pa_table, self.info.features.arrow_schema)
     60 return pa_table

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/table.py:2292, in table_cast(table, schema)
   2291 if table.schema != schema:
-> 2292     return cast_table_to_schema(table, schema)
   2293 elif table.schema.metadata != schema.metadata:

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/table.py:2240, in cast_table_to_schema(table, schema)
   2239 if not table_column_names <= set(schema.names):
-> 2240     raise CastError(
   2241         f"Couldn't cast\n{_short_str(table.schema)}\nto\n{_short_str(features)}\nbecause column names don't match",
   2242         table_column_names=table.column_names,
   2243         requested_column_names=list(features),
   2244     )
   2245 arrays = [
   2246     cast_array_to_feature(
   2247         table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type),
   (...)   2250     for name, feature in features.items()
   2251 ]

CastError: Couldn't cast
rgb: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
d: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
label: int64
-- schema metadata --
huggingface: '{"info": {"features": {"rgb": {"mode": "RGB", "_type": "Ima' + 24766
to
{'indices': Value(dtype='uint64', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[2], line 3
      1 from datasets import load_dataset
----> 3 ds = load_dataset("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/load.py:2151, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2148     return builder_instance.as_streaming_dataset(split=split)
   2150 # Download and prepare data
-> 2151 builder_instance.download_and_prepare(
   2152     download_config=download_config,
   2153     download_mode=download_mode,
   2154     verification_mode=verification_mode,
   2155     num_proc=num_proc,
   2156     storage_options=storage_options,
   2157 )
   2159 # Build dataset for splits
   2160 keep_in_memory = (
   2161     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2162 )

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:924, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    922 if num_proc is not None:
    923     prepare_split_kwargs["num_proc"] = num_proc
--> 924 self._download_and_prepare(
    925     dl_manager=dl_manager,
    926     verification_mode=verification_mode,
    927     **prepare_split_kwargs,
    928     **download_and_prepare_kwargs,
    929 )
    930 # Sync info
    931 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1000, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    996 split_dict.add(split_generator.split_info)
    998 try:
    999     # Prepare split will record examples associated to the split
-> 1000     self._prepare_split(split_generator, **prepare_split_kwargs)
   1001 except OSError as e:
   1002     raise OSError(
   1003         "Cannot find data file. "
   1004         + (self.manual_download_instructions or "")
   1005         + "\nOriginal error:\n"
   1006         + str(e)
   1007     ) from None

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1741, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1739 job_id = 0
   1740 with pbar:
-> 1741     for job_id, done, content in self._prepare_split_single(
   1742         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1743     ):
   1744         if done:
   1745             result = content

File /opt/conda/envs/cuda118/lib/python3.12/site-packages/datasets/builder.py:1897, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1895     if isinstance(e, DatasetGenerationError):
   1896         raise
-> 1897     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1899 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

John6666 · March 7, 2025, 9:04am

The load_dataset() function in the Hugging Face datasets library is for loading datasets that have been converted for use with HF, so you should either convert the dataset to HF format and save it, or load it using another function.

To resolve the data loading issue, follow these steps:

Use the Correct Loading Function: Since your data is saved in the Arrow format using save_to_disk, you should use load_from_disk to load it. This function is designed for Arrow files and supports the DatasetDict structure.
```
from datasets import load_from_disk

ds = load_from_disk("/defaultShare/pubdata/ImageNet_arrow_rgbdpa")
```
Avoid Using load_dataset for Arrow Files: The function load_dataset is intended for loading from specific formats like Parquet, CSV, or JSON, not Arrow. Using it for Arrow files can lead to schema mismatches and errors.
Investigate Data Loading Performance: If you’re experiencing stalling during training, consider the following:
- Caching: Ensure that your data is being read efficiently. Using load_from_disk may require additional optimizations for caching.
- Disk I/O: Check if the disk where your data is stored is experiencing high latency or contention. Using faster storage solutions might help.
- Data Sharding: If your Arrow files are large, consider sharding them into smaller files to improve parallel reading.
- Batching: Optimize how data is batched during training to reduce I/O bottlenecks.

Consider Converting to Parquet: If performance remains an issue, you can convert your DatasetDict to Parquet format for potentially faster access. This involves saving each split as a Parquet file and then loading using load_dataset with the Parquet option.

# Convert and save each split to Parquet
ds['train'].to_parquet('/path/to/train.parquet')
ds['val'].to_parquet('/path/to/val.parquet')

# Load using load_dataset
train_ds = load_dataset('parquet', data_files={'train': '/path/to/train.parquet'})
val_ds = load_dataset('parquet', data_files={'val': '/path/to/val.parquet'})

By adhering to these steps, you ensure compatibility with your data format and address potential performance issues during training.

wyrx · March 7, 2025, 10:57am

Thank you for your response. However, the Arrow format has already been saved as Parquet, which should be compatible with Hugging Face, so this error shouldn’t occur. Additionally, even after converting to Parquet, the training process still randomly pauses for several seconds. Do you have any ideas about it?

John6666 · March 7, 2025, 12:55pm

Hmm…
Maybe it would be better to shard the data set.

wyrx · March 7, 2025, 1:53pm

Thanks again, but actually, when saving the dataset, I already sharded each split into 96 pieces using:

imagenet.save_to_disk("./Imagenet_arrow_rgbdpa", num_proc=96, max_shard_size="8GB")

Therefore, I have no clear explanation for the performance issues or the errors encountered.

wyrx · March 7, 2025, 1:57pm

The complete conversion script is as follows:

# rgb_paths, d_paths, and labels are lists containing image paths
imagenet_train = Dataset.from_dict({"rgb": rgb_paths_train, "d": d_paths_train, "label": labels_train})
imagenet_val = Dataset.from_dict({"rgb": rgb_paths_val, "d": d_paths_val, "label": labels_val})

# Convert columns to appropriate data types
imagenet_train = imagenet_train.cast_column("rgb", Image(mode="RGB"))
imagenet_train = imagenet_train.cast_column("d", Image(mode="L"))
imagenet_val = imagenet_val.cast_column("rgb", Image(mode="RGB"))
imagenet_val = imagenet_val.cast_column("d", Image(mode="L"))

# Assign class labels
imagenet_train = imagenet_train.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.keys())))
imagenet_train = imagenet_train.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.values())))
imagenet_val = imagenet_val.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.keys())))
imagenet_val = imagenet_val.cast_column("label", ClassLabel(names=list(IMAGENET2012_CLASSES.values())))

# Create DatasetDict and save to disk
imagenet = DatasetDict({"train": imagenet_train, "val": imagenet_val})
imagenet.save_to_disk("./Imagenet_arrow_rgbdpa", num_proc=96, max_shard_size="8GB")

This setup ensures the dataset is properly structured and efficiently sharded, yet the performance issues and errors persist.

John6666 · March 7, 2025, 3:21pm

max_shard_size may be too large.

github.com/huggingface/datasets

PyArrow Dataset error when calling `load_dataset`

opened 01:16AM - 20 Jul 22 UTC

piraka9011

bug

## Describe the bug I am fine tuning a wav2vec2 model following the script he…re using my own dataset: https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py Loading my Audio dataset from the hub which was originally generated from disk results in the following PyArrow error: ```sh File "/home/ubuntu/w2v2/run_speech_recognition_ctc.py", line 227, in main raw_datasets = load_dataset( File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/datasets/load.py", line 1679, in load_dataset builder_instance.download_and_prepare( File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/datasets/builder.py", line 704, in download_and_prepare self._download_and_prepare( File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/datasets/builder.py", line 793, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/datasets/builder.py", line 1268, in _prepare_split for key, table in logging.tqdm( File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__ for obj in iterable: File "/home/ubuntu/.virtualenvs/meval/lib/python3.8/site-packages/datasets/packaged_modules/parquet/parquet.py", line 68, in _generate_tables for batch_idx, record_batch in enumerate( File "pyarrow/_parquet.pyx", line 1309, in iter_batches File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs ``` ## Steps to reproduce the bug I created a dataset from a JSON lines manifest of `audio_filepath`, `text`, and `duration`. When creating the dataset, I do something like this: ```python import json from datasets import Dataset, Audio # manifest_lines is a list of dicts w/ "audio_filepath", "duration", and "text for line in manifest_lines: line = line.strip() if line: line_dict = json.loads(line) manifest_dict["audio"].append(f"{root_path}/{line_dict['audio_filepath']}") manifest_dict["duration"].append(line_dict["duration"]) manifest_dict["transcription"].append(line_dict["text"]) # Create a HF dataset dataset = Dataset.from_dict(manifest_dict).cast_column( "audio", Audio(sampling_rate=16_000), ) # From the docs for saving to disk # https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.Dataset.save_to_disk def read_audio_file(example): with open(example["audio"]["path"], "rb") as f: return {"audio": {"bytes": f.read()}} dataset = dataset.map(read_audio_file, num_proc=70) dataset.save_to_disk(f"/audio-data/hf/{artifact_name}") dataset.push_to_hub(f"{org-name}/{artifact_name}", max_shard_size="5GB", private=True) ``` Then when I call `load_dataset()` in my training script, with the same dataset I generated above, and download from the huggingface hub I get the above stack trace. I am able to load the dataset fine if I use `load_from_disk()`. ## Expected results `load_dataset()` should behave just like `load_from_disk()` and not cause any errors. ## Actual results See above ## Environment info I am using the `huggingface/transformers-pytorch-gpu:latest` image - `datasets` version: 2.3.0 - Platform: Docker/Ubuntu 20.04 - Python version: 3.8 - PyArrow version: 8.0.0

wyrx · March 10, 2025, 10:04am

Thank you very much! I regenerated the dataset with max_shard_size="1GB", and now it can be loaded successfully using both load_dataset and load_from_disk.

However, the training stalls remain unresolved and may be related to hardware issues. I have also discussed this in the TIMM framework forum. Inconsistent Training Throughput Across Epochs · huggingface/pytorch-image-models · Discussion #2449

John6666 · March 10, 2025, 12:46pm

Unless it’s simply a case of not having enough VRAM, it could be that the trainer’s optimization options are causing the problem. If you’re using Lightning, that could also be a factor.

Data type format issue

github.com/huggingface/transformers

Out of Memory at Seemingly Inconsistent Steps Using Trainer and Deepspeed with Llama2 7b

opened 04:07PM - 05 Feb 24 UTC

closed 09:12AM - 22 Apr 24 UTC

ianmcampbell

trainer DeepSpeed bug

### System Info - `transformers` version: 4.37.2 - Platform: Linux-5.14.0-16…2.6.1.el9_1.x86_64-x86_64-with-glibc2.34 - Python version: 3.11.7 - Huggingface_hub version: 0.20.3 - Safetensors version: 0.4.2 - Accelerate version: 0.26.1 - Deepspeed version: 0.13.1 - Flash-attention version: 2.5.2 - Datasets version: 2.16.1 - PyTorch version (GPU?): 2.1.2+cu118 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: Yes ### Who can help? @pacman100 ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I am further pre-training Llama2-7b-chat-hf on a 3,273,686,325 token corpus of my own data. However, training fails at seemingly inconsistent times. My cluster contains GPU nodes with 4 x A100-80GB GPUs. The out of memory error occurs at seemingly inconsistent times depending on how many GPUs are used. Here is the training script: ``` import datasets import os import torch import argparse from mpi4py import MPI from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM from transformers import DataCollatorForSeq2Seq, default_data_collator torch.backends.cuda.matmul.allow_tf32 = True def set_mpi(masteradd): """ Set Open MPI environment variables :param masteradd: Value for setting MASTER_ADDR environment variable :type masteradd: String :return: None """ comm = MPI.COMM_WORLD os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"] os.environ["RANK"] = str(comm.Get_rank()) os.environ['WORLD_SIZE'] = str(comm.Get_size()) os.environ["MASTER_ADDR"] = masteradd os.environ["MASTER_PORT"] = "9978" def main(): """ Set training parameters and train model :return: None """ parser = argparse.ArgumentParser() parser.add_argument("-m", "--master_add", dest="masteradd") args = parser.parse_args() set_mpi(args.masteradd) experiment_name = "" tokenizer_name = 'resized_tokenizer/' model_name = 'llama2-7b-chat-hf/' out_dir = 'out/' os.makedirs(out_dir, exist_ok=True) dataset_path = "datasets/" dataset_files = [os.path.join(dataset_path,x) for x in os.listdir(dataset_path)] dataset = datasets.load_dataset('json', data_files=dataset_files, split='train', cache_dir="cache/") tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=False) training_args = TrainingArguments( output_dir=out_dir, deepspeed='multi_node_7b.json', do_eval=False, logging_strategy="steps", logging_steps=10, learning_rate=2e-5, warmup_steps=1000, gradient_checkpointing=False, per_device_train_batch_size=1, gradient_accumulation_steps=4, tf32=True, bf16=True, weight_decay=0.1, save_total_limit=40, push_to_hub=False, save_strategy="steps", num_train_epochs=1, save_steps=1000, report_to="tensorboard" ) model=AutoModelForCausalLM.from_pretrained(model_name, do_sample=True, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16) trainer=Trainer( model=model, args=training_args, train_dataset=dataset, data_collator=DataCollatorForSeq2Seq(tokenizer) ) trainer.train( resume_from_checkpoint = False, ) trainer.save_model() if __name__ == "__main__": main() ``` Here is the Deepspeed config: ``` { "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 1, "offload_optimizer": { "device": "none" }, "offload_param": { "device": "none" }, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": "auto" }, "gradient_accumulation_steps": 4, "gradient_clipping": "auto", "gradient_checkpointing": false, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 200, "wall_clock_breakdown": false } ``` I launch training from a bash script. Here is the relevant line. ``` deepspeed -H hostfile --master_port 9978 --master_addr $PARENT --no_ssh_check --launcher OPENMPI --launcher_args '--oversubscribe ' deepspeed_7b_finetune.py -m $PARENT ``` ``` 19%|█▉ | 3237/16700 [3:34:12<38:35:22, 10.32s/it]Traceback (most recent call last): File "/home/user/Hope-Alpha/src/scripts/deepspeed_7b_finetune.py", line 87, in <module> main() File "/home/user/Hope-Alpha/src/scripts/deepspeed_7b_finetune.py", line 80, in main trainer.train( File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/trainer.py", line 2772, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/trainer.py", line 2795, in compute_loss outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1842, in forward loss = self.module(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward outputs = self.model( ^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 795, in forward hidden_states = self.input_layernorm(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/train-transformers/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 116, in forward hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 116.00 MiB. GPU 3 has a total capacty of 79.32 GiB of which 101.56 MiB is free. Including non-PyTorch memory, this process has 79.22 GiB memory in use. Of the allocated memory 75.96 GiB is allocated by PyTorch, and 1.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF g-10-01:2356899:2357762 [3] NCCL INFO [Service thread] Connection closed by localRank 3 g-10-01:2356899:2356899 [3] NCCL INFO comm 0x9e8f6ea0 rank 3 nranks 12 cudaDev 3 busId e3000 - Abort COMPLETE ``` The dataset contains 12 `.json` files which are assembled and cached. Training can complete on any one of the 12 files. However, when assembled, there is the above out of memory error. If the files are re-arranged (ie `[2,0,1,3,4,5,6,7,8,9,10,11]`), the step on which training fails changes slightly. If training is restarted from a saved checkpoint using `resume_from_checkpoint = 'checkpoint_dir'`, training errors out of memory at exactly the same step. Training of the same dataset using `accelerate` and FSDP completes without issue. I am at a loss for what could be causing this. ### Expected behavior The expected behavior is that training does not run out of memory at inconsistent times and completes a single epoch.

Cache issue

system · March 11, 2025, 12:47am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
`load_from_disk` fails to load dataset after a `.map` operation Beginners	0	157	November 10, 2023
Dataset viewer issue Site Feedback	7	286	January 17, 2025
Handle errors when loading images (404, corrupted, etc) 🤗Datasets	4	820	August 17, 2023
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1783	August 12, 2022
Best practice loading images files 🤗Datasets	3	1592	March 27, 2024

Unable to Load Dataset Using `load_dataset`

Data type format issue

Cache issue

Related topics