Why load_dataset on Audiofolder with metadata is returning Filenotfound error

I am using Kaggle notebook kernels. I have a folder โ€˜traindataโ€™ in my working directory that contains audio files and a metadata.csv file (you can see teh audio files, and the .csv file in the outputlist below). The .csv file name contains a column โ€˜file_nameโ€™ containing the file names of all audiofiles. In my first notebook, when I try to load this data using load_dataset function (Load audio data), there is absolutely no problem. It loads perfectly as expected. However, when I run exactly the same code, in a 2nd Kaggle notebook, I get Filenotfound error. Full details are as follows. Can someone @sanchit-gandhi help me with this, as I intend to use the 2nd notebook for a different prediction model/ approach using Whisper (I tried Wav2Vec2 in my 1st notebook and am not getting desired result)?

os.listdir("/kaggle/working/traindata")

['17d469e3c0f8.mp3',
 '7f64f5ad7c72.mp3',
 '792590a0c97b.mp3',
 '40a932482ffa.mp3',
 '20bff6808089.mp3',
 '072c952790a8.mp3',
 'b960faf6e6c9.mp3',
 '7144e5a3951c.mp3',
 'metadata.csv',
 'e101861d7fc6.mp3',
 'dec206df575b.mp3',
 'fd9da8a487a6.mp3',
 'd4bf563e8d74.mp3',
 'a1db8eecfa15.mp3',
 'c4de30d87c19.mp3',
 '35ab3905df36.mp3',
 'd0dcd7a9aa9d.mp3',
 'd510c4a0f4c3.mp3',
 '67a9e9be989d.mp3',
 '665f9d30c16c.mp3',
 '6cc0c4fcd376.mp3',
 'bc11e6300bab.mp3',
 'f4f100cc5126.mp3',
 '82b81f884c22.mp3',
 'e3e147532ab4.mp3',
 'c4eb00849950.mp3',
 '86c743dbdffc.mp3',
 '1dc891cad82d.mp3',
 '05bb928d483e.mp3',
 'dcca4da43c55.mp3',
 '2ff894872320.mp3',
 '3df8624d57a5.mp3',
 'bdce6383b6a3.mp3',
 '9e156c339843.mp3',
 '45c2362f6c24.mp3',
 '802b1445767e.mp3',
 'c6d7f0e0d016.mp3',
 '151abc026c93.mp3',
 'a1317f179adb.mp3',
 '83e1b5ce808a.mp3',
 '278ed2187132.mp3',
 '90b8207f80a3.mp3',
 '7989b39c6806.mp3',
 '77738ec9edbc.mp3',
 '3ea951d7af47.mp3',
 'e3665ff03a0c.mp3',
 '6c4e4d9823ac.mp3',
 '27846b8f8edd.mp3',
 '8df70760f935.mp3',
 'd6f06a5c0e02.mp3',
 'e2ab17915a45.mp3',
 '16841afc8002.mp3',
 '284eeb420025.mp3',
 '99e567500082.mp3',
 'aa101d9351a2.mp3',
 '2d97709e1321.mp3',
 '1bcdd2ab7204.mp3',
 'f01dd698a636.mp3',
 '52cb9dc45a60.mp3',
 '56497258f4d4.mp3',
 'b81628311b82.mp3',
 'af9cfe48184c.mp3',
 '87961540a611.mp3',
 'aad63a719baf.mp3',
 '67ff0d4f0abe.mp3',
 'ba00881866dd.mp3',
 'b3793565c709.mp3',
 '7ceca0306fa1.mp3',
 '9db958779825.mp3',
 '24bc00853dfd.mp3',
 '82facfcaf4af.mp3',
 '9e849a13f4d2.mp3',
 '5b7577a65f36.mp3',
 '0e0cd7ae0a4b.mp3',
 '948012d60dbc.mp3',
 'dce2b585586b.mp3',
 'cc29d450c5fc.mp3',
 '178c4fbf1765.mp3',
 'ab2905e4bc54.mp3',
 'ef8953b95e6a.mp3',
 'c6eda3ea8c01.mp3',
 '244d9567f4ba.mp3',
 '078b6f9629b2.mp3',
 '75fb7aa16b1c.mp3',
 'c7d2473e379f.mp3',
 'd29197658275.mp3',
 '9680b1e57366.mp3',
 '403f13f1b957.mp3',
 '85391aa9a25f.mp3',
 '36229829da97.mp3',
 'ec9c15cb1e78.mp3',
 '84eb69de8f29.mp3',
 '973f2efec47a.mp3',
 'd4541f36bb70.mp3',
 '15b565b3a352.mp3',
 '9574a61cc33c.mp3',
 'a8d6bf1285d5.mp3',
 '62e2a304f67d.mp3',
 'e7b575cc3a88.mp3',
 '69a85e880d81.mp3',
 '3beed37e15c5.mp3']

train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata")
train_dataset

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ in <module>:1                                                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ โฑ 1 train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata", drop_l     โ”‚
โ”‚   2                                                                                              โ”‚
โ”‚   3 train_dataset                                                                                โ”‚
โ”‚   4                                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1664 in load_dataset                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1661 โ”‚   ignore_verifications = ignore_verifications or save_infos                             โ”‚
โ”‚   1662 โ”‚                                                                                         โ”‚
โ”‚   1663 โ”‚   # Create a dataset builder                                                            โ”‚
โ”‚ โฑ 1664 โ”‚   builder_instance = load_dataset_builder(                                              โ”‚
โ”‚   1665 โ”‚   โ”‚   path=path,                                                                        โ”‚
โ”‚   1666 โ”‚   โ”‚   name=name,                                                                        โ”‚
โ”‚   1667 โ”‚   โ”‚   data_dir=data_dir,                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1490 in load_dataset_builder            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1487 โ”‚   if use_auth_token is not None:                                                        โ”‚
โ”‚   1488 โ”‚   โ”‚   download_config = download_config.copy() if download_config else DownloadConfig(  โ”‚
โ”‚   1489 โ”‚   โ”‚   download_config.use_auth_token = use_auth_token                                   โ”‚
โ”‚ โฑ 1490 โ”‚   dataset_module = dataset_module_factory(                                              โ”‚
โ”‚   1491 โ”‚   โ”‚   path,                                                                             โ”‚
โ”‚   1492 โ”‚   โ”‚   revision=revision,                                                                โ”‚
โ”‚   1493 โ”‚   โ”‚   download_config=download_config,                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1238 in dataset_module_factory          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1235 โ”‚   โ”‚   โ”‚   โ”‚   if isinstance(e1, OfflineModeIsEnabled):                                  โ”‚
โ”‚   1236 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise ConnectionError(f"Couln't reach the Hugging Face Hub for datas  โ”‚
โ”‚   1237 โ”‚   โ”‚   โ”‚   โ”‚   if isinstance(e1, FileNotFoundError):                                     โ”‚
โ”‚ โฑ 1238 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise FileNotFoundError(                                              โ”‚
โ”‚   1239 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f"Couldn't find a dataset script at {relative_to_absolute_path(c  โ”‚
โ”‚   1240 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e  โ”‚
โ”‚   1241 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   ) from None                                                           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
FileNotFoundError: Couldn't find a dataset script at /kaggle/working/audiofolder/audiofolder.py or any data file in
the same directory. Couldn't find 'audiofolder' on the Hugging Face Hub either: FileNotFoundError: Couldn't find 
file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/audiofolder/audiofolder.py

@ lewtun @ Rocketknight1 @ sgugger - can you guys help on my query in the post?

You can find a solution to this issue here

(Kaggle installs an outdated datasets version by default, which does not contain the audiofolder loader)

1 Like

@mariosasko thanks for the clarification. As far as I remember, I did update datasets using !pip install -U datasets huggingface-hub in both notebooks. I will check again by creating a new notebook (I have deleted my 2nd notebook since then as I could not move forward :smiling_face_with_tear:)

@mariosasko , I created a new Kaggle notebook (2nd notebook for the same competition for the same previously mentioned reason) and tried all the below mentioned. However, the problem remains.

1. !pip install -U datasets huggingface-hub
2. !pip install -U datasets
3. !pip install -U datasets>=2.5.0

import datasets
train_dataset = load_dataset("audiofolder", data_dir="/kaggle/working/traindata")

FileNotFoundError: Couldn't find a dataset script at /kaggle/working/audiofolder/audiofolder.py or any data file in
the same directory. Couldn't find 'audiofolder' on the Hugging Face Hub either: FileNotFoundError: Couldn't find 
file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/audiofolder/audiofolder.py

As I have mentioned this code works fine in the 1st notebook.

To add some more data here, below is the output I get when I first check version, then update datasets and re-check version after import. Obviously, it is not changing to new version, getting stuck to the old โ€˜2.1.0โ€™ version. Appreciate any advise on what can be done hereโ€ฆ

datasets.__version__
'2.1.0'

!pip install -U datasets
Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (2.14.4)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from datasets) (1.23.5)
Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (9.0.0)
Requirement already satisfied: dill<0.3.8,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.3.6)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (4.65.0)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets) (3.2.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets) (3.8.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.14.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.16.4)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from datasets) (21.3)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (6.0)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (3.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (3.12.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (4.6.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->datasets) (3.0.9)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2023.5.7)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)

import datasets
datasets.__version__
'2.1.0'

I am now facing this problem even with my first notebook :upside_down_face: :roll_eyes: :unamused:

After several attempts, my understanding is that it is an inconsistency in Kaggle environment or certain dependencies/ priorities among various packages that I am installing/ importing, that is causing the issue. After clearing all outputs/ restarting the kernel, sometimes the datasets version is changing to 2.14.0, but most of the times, it remains at the default 2.1.0.

However, as @mariosasko suggested the โ€˜audiofolderโ€™ feature does not work with 2.1.0, but works with 2.14.0 or some other in between higher versions.