"OSError: [Errno 27] File too large" on AFS when caching the dataset

Hi, I’m training an ASR model by fine-tuning Wav2Vec2 with the Common Voice dataset, and I’m encountering OSError: [Errno 27] File too large: '../../.cache/downloads/extracted/704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802/ja_other_0/common_voice_ja_36362701.mp3' when loading the data.

I’m pretty sure the cause is the file name (path) being too long, exceeding the limit allowed in AFS. The file system actually still has a plenty of free storage.

Environment

  • OS: Linux
  • File system: AFS

Code

def load_data(dataset: str,
              lang: str,
              cache_dir: str,
              enable_caching: bool) -> Dict[str, Dataset]:
    """Load data from Hugging Face Datasets."""
    if not enable_caching:
        print("Caching disabled; redownloading the data...")
        download_mode = "force_redownload"
    else:
        print("Caching enabled; using the cached data...")
        download_mode = "reuse_dataset_if_exists" # default                                                                                                                                                           
    train = load_dataset(dataset,
                         lang,
                         split="train",
                         trust_remote_code=True,
                         cache_dir=cache_dir,
                         download_mode=download_mode)
    valid = load_dataset(dataset,
                         lang,
                         split="valid",
                         trust_remote_code=True,
                         cache_dir=cache_dir,
                         download_mode=download_mode)
    return {"train": train,
            "valid": valid}

To prevent caching, I forced redownload of the data and disabled caching by calling datasets.disable_caching() globally, but it still seems to store the data locally (temporarily)? and is causing the same error.

I know that this can be prevented by using the streaming mode, but can we somehow shorten the filename? If not, I’d love to have an option where the 704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802 part can be shortened safely.
Or is there any other solutions to this error?

1 Like

Perhaps this?