Hi, I’m training an ASR model by fine-tuning Wav2Vec2 with the Common Voice dataset, and I’m encountering OSError: [Errno 27] File too large: '../../.cache/downloads/extracted/704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802/ja_other_0/common_voice_ja_36362701.mp3'
when loading the data.
I’m pretty sure the cause is the file name (path) being too long, exceeding the limit allowed in AFS. The file system actually still has a plenty of free storage.
Environment
- OS: Linux
- File system: AFS
Code
def load_data(dataset: str,
lang: str,
cache_dir: str,
enable_caching: bool) -> Dict[str, Dataset]:
"""Load data from Hugging Face Datasets."""
if not enable_caching:
print("Caching disabled; redownloading the data...")
download_mode = "force_redownload"
else:
print("Caching enabled; using the cached data...")
download_mode = "reuse_dataset_if_exists" # default
train = load_dataset(dataset,
lang,
split="train",
trust_remote_code=True,
cache_dir=cache_dir,
download_mode=download_mode)
valid = load_dataset(dataset,
lang,
split="valid",
trust_remote_code=True,
cache_dir=cache_dir,
download_mode=download_mode)
return {"train": train,
"valid": valid}
To prevent caching, I forced redownload of the data and disabled caching by calling datasets.disable_caching()
globally, but it still seems to store the data locally (temporarily)? and is causing the same error.
I know that this can be prevented by using the streaming mode, but can we somehow shorten the filename? If not, I’d love to have an option where the 704d7c5b5c20bb11fc3c9632e47ae1406eb27b1056839928f95e52f8b6051802
part can be shortened safely.
Or is there any other solutions to this error?