"OSError: [Errno 27] File too large" on AFS when caching the dataset

ctaguchi · December 13, 2024, 6:11pm

Hi, I’m training an ASR model by fine-tuning Wav2Vec2 with the Common Voice dataset, and I’m encountering OSError: [Errno 27] File too large: '../../.cache/downloads/extracted/704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802/ja_other_0/common_voice_ja_36362701.mp3' when loading the data.

I’m pretty sure the cause is the file name (path) being too long, exceeding the limit allowed in AFS. The file system actually still has a plenty of free storage.

Environment

OS: Linux
File system: AFS

Code

def load_data(dataset: str,
              lang: str,
              cache_dir: str,
              enable_caching: bool) -> Dict[str, Dataset]:
    """Load data from Hugging Face Datasets."""
    if not enable_caching:
        print("Caching disabled; redownloading the data...")
        download_mode = "force_redownload"
    else:
        print("Caching enabled; using the cached data...")
        download_mode = "reuse_dataset_if_exists" # default                                                                                                                                                           
    train = load_dataset(dataset,
                         lang,
                         split="train",
                         trust_remote_code=True,
                         cache_dir=cache_dir,
                         download_mode=download_mode)
    valid = load_dataset(dataset,
                         lang,
                         split="valid",
                         trust_remote_code=True,
                         cache_dir=cache_dir,
                         download_mode=download_mode)
    return {"train": train,
            "valid": valid}

To prevent caching, I forced redownload of the data and disabled caching by calling datasets.disable_caching() globally, but it still seems to store the data locally (temporarily)? and is causing the same error.

I know that this can be prevented by using the streaming mode, but can we somehow shorten the filename? If not, I’d love to have an option where the 704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802 part can be shortened safely.
Or is there any other solutions to this error?

John6666 · December 14, 2024, 1:57am

Perhaps this?

github.com/huggingface/transformers

OSerror, when loading 'wav2vec2-large-xlsr-53' Model of Wav2vec2

opened 04:23AM - 13 Mar 21 UTC

closed 09:08AM - 15 Mar 21 UTC

LifaSun

## Environment info - `transformers` version: 4.3.3 - Platform: Linux-4.15.0-2…9-generic-x86_64-with-glibc2.10 - Python version: 3.8.8 - PyTorch version (GPU?): 1.8.0 (False) - Tensorflow version (GPU?): not installed (NA) @patrickvonplaten ## Information Model I am using Wav2vec2.0: The problem arises when using: Scipts: import soundfile as sf import torch from transformers import AutoTokenizer, AutoModel,Wav2Vec2ForCTC, Wav2Vec2Tokenizer tokenizer4 = AutoTokenizer.from_pretrained("facebook/wav2vec2-large-xlsr-53") model4 = AutoModel.from_pretrained("facebook/wav2vec2-large-xlsr-53") OSError: OSError: Can't load tokenizer for 'facebook/wav2vec2-large-xlsr-53'. Make sure that: - 'facebook/wav2vec2-large-xlsr-53' is a correct model identifier listed on 'https://huggingface.co/models' - or 'facebook/wav2vec2-large-xlsr-53' is the correct path to a directory containing relevant tokenizer files The tasks I am working on is: * an official wav2vec task: facebook/wav2vec2-large-xlsr-53 ## To reproduce Steps to reproduce the behavior: Follow the instructions https://huggingface.co/facebook/wav2vec2-large-xlsr-53 ## Expected behavior I try to use xlsr model as the pre-trained model to finetune my own ASR model, but the xlsr model, especially tokenizer, can't be loaded smoothly. Could you tell me how to modify it? Thank you very much!

Topic		Replies	Views
"Too many open files" when loading Common Voice 🤗Datasets	4	1368	February 8, 2022
Understanding the `Datasets` cache system 🤗Datasets	2	3244	May 19, 2023
Erorro loading arxive dataset 🤗Datasets	3	498	February 16, 2023
Caching only one feature, from a read-only dataset 🤗Datasets	5	39	April 7, 2025
Problems and solution on Trainer Beginners	3	794	December 17, 2021

"OSError: [Errno 27] File too large" on AFS when caching the dataset

Related topics