Why load_dataset on Audiofolder with metadata is returning Filenotfound error

vsrinivas · August 11, 2023, 10:56am

I am using Kaggle notebook kernels. I have a folder ‘traindata’ in my working directory that contains audio files and a metadata.csv file (you can see teh audio files, and the .csv file in the outputlist below). The .csv file name contains a column ‘file_name’ containing the file names of all audiofiles. In my first notebook, when I try to load this data using load_dataset function (Load audio data), there is absolutely no problem. It loads perfectly as expected. However, when I run exactly the same code, in a 2nd Kaggle notebook, I get Filenotfound error. Full details are as follows. Can someone @sanchit-gandhi help me with this, as I intend to use the 2nd notebook for a different prediction model/ approach using Whisper (I tried Wav2Vec2 in my 1st notebook and am not getting desired result)?

os.listdir("/kaggle/working/traindata")

['17d469e3c0f8.mp3',
 '7f64f5ad7c72.mp3',
 '792590a0c97b.mp3',
 '40a932482ffa.mp3',
 '20bff6808089.mp3',
 '072c952790a8.mp3',
 'b960faf6e6c9.mp3',
 '7144e5a3951c.mp3',
 'metadata.csv',
 'e101861d7fc6.mp3',
 'dec206df575b.mp3',
 'fd9da8a487a6.mp3',
 'd4bf563e8d74.mp3',
 'a1db8eecfa15.mp3',
 'c4de30d87c19.mp3',
 '35ab3905df36.mp3',
 'd0dcd7a9aa9d.mp3',
 'd510c4a0f4c3.mp3',
 '67a9e9be989d.mp3',
 '665f9d30c16c.mp3',
 '6cc0c4fcd376.mp3',
 'bc11e6300bab.mp3',
 'f4f100cc5126.mp3',
 '82b81f884c22.mp3',
 'e3e147532ab4.mp3',
 'c4eb00849950.mp3',
 '86c743dbdffc.mp3',
 '1dc891cad82d.mp3',
 '05bb928d483e.mp3',
 'dcca4da43c55.mp3',
 '2ff894872320.mp3',
 '3df8624d57a5.mp3',
 'bdce6383b6a3.mp3',
 '9e156c339843.mp3',
 '45c2362f6c24.mp3',
 '802b1445767e.mp3',
 'c6d7f0e0d016.mp3',
 '151abc026c93.mp3',
 'a1317f179adb.mp3',
 '83e1b5ce808a.mp3',
 '278ed2187132.mp3',
 '90b8207f80a3.mp3',
 '7989b39c6806.mp3',
 '77738ec9edbc.mp3',
 '3ea951d7af47.mp3',
 'e3665ff03a0c.mp3',
 '6c4e4d9823ac.mp3',
 '27846b8f8edd.mp3',
 '8df70760f935.mp3',
 'd6f06a5c0e02.mp3',
 'e2ab17915a45.mp3',
 '16841afc8002.mp3',
 '284eeb420025.mp3',
 '99e567500082.mp3',
 'aa101d9351a2.mp3',
 '2d97709e1321.mp3',
 '1bcdd2ab7204.mp3',
 'f01dd698a636.mp3',
 '52cb9dc45a60.mp3',
 '56497258f4d4.mp3',
 'b81628311b82.mp3',
 'af9cfe48184c.mp3',
 '87961540a611.mp3',
 'aad63a719baf.mp3',
 '67ff0d4f0abe.mp3',
 'ba00881866dd.mp3',
 'b3793565c709.mp3',
 '7ceca0306fa1.mp3',
 '9db958779825.mp3',
 '24bc00853dfd.mp3',
 '82facfcaf4af.mp3',
 '9e849a13f4d2.mp3',
 '5b7577a65f36.mp3',
 '0e0cd7ae0a4b.mp3',
 '948012d60dbc.mp3',
 'dce2b585586b.mp3',
 'cc29d450c5fc.mp3',
 '178c4fbf1765.mp3',
 'ab2905e4bc54.mp3',
 'ef8953b95e6a.mp3',
 'c6eda3ea8c01.mp3',
 '244d9567f4ba.mp3',
 '078b6f9629b2.mp3',
 '75fb7aa16b1c.mp3',
 'c7d2473e379f.mp3',
 'd29197658275.mp3',
 '9680b1e57366.mp3',
 '403f13f1b957.mp3',
 '85391aa9a25f.mp3',
 '36229829da97.mp3',
 'ec9c15cb1e78.mp3',
 '84eb69de8f29.mp3',
 '973f2efec47a.mp3',
 'd4541f36bb70.mp3',
 '15b565b3a352.mp3',
 '9574a61cc33c.mp3',
 'a8d6bf1285d5.mp3',
 '62e2a304f67d.mp3',
 'e7b575cc3a88.mp3',
 '69a85e880d81.mp3',
 '3beed37e15c5.mp3']

train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata")
train_dataset

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
│                                                                                                  │
│ ❱ 1 train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata", drop_l     │
│   2                                                                                              │
│   3 train_dataset                                                                                │
│   4                                                                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1664 in load_dataset                    │
│                                                                                                  │
│   1661 │   ignore_verifications = ignore_verifications or save_infos                             │
│   1662 │                                                                                         │
│   1663 │   # Create a dataset builder                                                            │
│ ❱ 1664 │   builder_instance = load_dataset_builder(                                              │
│   1665 │   │   path=path,                                                                        │
│   1666 │   │   name=name,                                                                        │
│   1667 │   │   data_dir=data_dir,                                                                │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1490 in load_dataset_builder            │
│                                                                                                  │
│   1487 │   if use_auth_token is not None:                                                        │
│   1488 │   │   download_config = download_config.copy() if download_config else DownloadConfig(  │
│   1489 │   │   download_config.use_auth_token = use_auth_token                                   │
│ ❱ 1490 │   dataset_module = dataset_module_factory(                                              │
│   1491 │   │   path,                                                                             │
│   1492 │   │   revision=revision,                                                                │
│   1493 │   │   download_config=download_config,                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1238 in dataset_module_factory          │
│                                                                                                  │
│   1235 │   │   │   │   if isinstance(e1, OfflineModeIsEnabled):                                  │
│   1236 │   │   │   │   │   raise ConnectionError(f"Couln't reach the Hugging Face Hub for datas  │
│   1237 │   │   │   │   if isinstance(e1, FileNotFoundError):                                     │
│ ❱ 1238 │   │   │   │   │   raise FileNotFoundError(                                              │
│   1239 │   │   │   │   │   │   f"Couldn't find a dataset script at {relative_to_absolute_path(c  │
│   1240 │   │   │   │   │   │   f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e  │
│   1241 │   │   │   │   │   ) from None                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Couldn't find a dataset script at /kaggle/working/audiofolder/audiofolder.py or any data file in
the same directory. Couldn't find 'audiofolder' on the Hugging Face Hub either: FileNotFoundError: Couldn't find 
file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/audiofolder/audiofolder.py

vsrinivas · August 15, 2023, 10:00am

@ lewtun @ Rocketknight1 @ sgugger - can you guys help on my query in the post?

mariosasko · August 16, 2023, 1:37pm

You can find a solution to this issue here

(Kaggle installs an outdated datasets version by default, which does not contain the audiofolder loader)

vsrinivas · August 17, 2023, 8:08am

@mariosasko thanks for the clarification. As far as I remember, I did update datasets using !pip install -U datasets huggingface-hub in both notebooks. I will check again by creating a new notebook (I have deleted my 2nd notebook since then as I could not move forward )

vsrinivas · August 18, 2023, 7:37am

@mariosasko , I created a new Kaggle notebook (2nd notebook for the same competition for the same previously mentioned reason) and tried all the below mentioned. However, the problem remains.

1. !pip install -U datasets huggingface-hub
2. !pip install -U datasets
3. !pip install -U datasets>=2.5.0

import datasets
train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata")

FileNotFoundError: Couldn't find a dataset script at /kaggle/working/audiofolder/audiofolder.py or any data file in
the same directory. Couldn't find 'audiofolder' on the Hugging Face Hub either: FileNotFoundError: Couldn't find 
file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/audiofolder/audiofolder.py

As I have mentioned this code works fine in the 1st notebook.

vsrinivas · August 18, 2023, 8:46am

To add some more data here, below is the output I get when I first check version, then update datasets and re-check version after import. Obviously, it is not changing to new version, getting stuck to the old ‘2.1.0’ version. Appreciate any advise on what can be done here…

datasets.__version__
'2.1.0'

!pip install -U datasets
Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (2.14.4)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from datasets) (1.23.5)
Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (9.0.0)
Requirement already satisfied: dill<0.3.8,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.3.6)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (4.65.0)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets) (3.2.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets) (3.8.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.14.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.16.4)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from datasets) (21.3)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (6.0)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (3.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (3.12.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (4.6.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->datasets) (3.0.9)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2023.5.7)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)

import datasets
datasets.__version__
'2.1.0'

I am now facing this problem even with my first notebook

vsrinivas · August 18, 2023, 9:42am

After several attempts, my understanding is that it is an inconsistency in Kaggle environment or certain dependencies/ priorities among various packages that I am installing/ importing, that is causing the issue. After clearing all outputs/ restarting the kernel, sometimes the datasets version is changing to 2.14.0, but most of the times, it remains at the default 2.1.0.

However, as @mariosasko suggested the ‘audiofolder’ feature does not work with 2.1.0, but works with 2.14.0 or some other in between higher versions.

Topic		Replies	Views
Audio dataset without uploading the data to the hub 🤗Datasets	6	1958	March 20, 2023
Can't load dataset with simple CSV files 🤗Datasets	1	355	March 11, 2024
Dataset loading script for an audio dataset 🤗Datasets	5	672	September 2, 2022
Unable to load dataset knkarthick/dialogsum Beginners	2	600	October 23, 2024
Dataset load_datasets from directory when metadata and datafile in different folder 🤗Datasets	1	396	August 16, 2023

Why load_dataset on Audiofolder with metadata is returning Filenotfound error

Related topics