Stuck in 'Preprocessing audio data' in HG's Audio Course while following 'Filtering the dataset'

Hello,

I’ve come across many tutorial articles but this is one of the most well organized and insightful course I’ve encountered so far. Amazing work and I really appreciate your time putting these goodies together.

While I was cherishing every bit of information in the course, I come across the situation - likely my local file structure problem - where I wasn’t able to run:

new_column = [librosa.get_duration(path=x) for x in minds["path"]]

because librosa can’t find the file at the path provided by minds, outputting the following error message:

/var/folders/4b/xjs02jw50cn561qgx_9d2z200000gn/T/ipykernel_78132/3624879537.py:2: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  new_column = [librosa.get_duration(path=x) for x in minds["path"]]
---------------------------------------------------------------------------
LibsndfileError                           Traceback (most recent call last)
File /opt/homebrew/lib/python3.11/site-packages/librosa/core/audio.py:795, in get_duration(y, sr, S, n_fft, hop_length, center, path, filename)
    794 try:
--> 795     return sf.info(path).duration  # type: ignore
    796 except sf.SoundFileRuntimeError:

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:467, in info(file, verbose)
    460 """Returns an object with information about a `SoundFile`.
    461 
    462 Parameters
   (...)
    465     Whether to print additional information.
    466 """
--> 467 return _SoundFileInfo(file, verbose)

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:412, in _SoundFileInfo.__init__(self, file, verbose)
    411 self.verbose = verbose
--> 412 with SoundFile(file) as f:
    413     self.name = f.name

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:658, in SoundFile.__init__(self, file, mode, samplerate, channels, subtype, endian, format, closefd)
    656 self._info = _create_info_struct(file, mode, samplerate, channels,
    657                                  format, subtype, endian)
--> 658 self._file = self._open(file, mode_int, closefd)
    659 if set(mode).issuperset('r+') and self.seekable():
    660     # Move write position to 0 (like in Python file objects)

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:1216, in SoundFile._open(self, file, mode_int, closefd)
   1215     err = _snd.sf_error(file_ptr)
-> 1216     raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
   1217 if mode_int == _snd.SFM_WRITE:
   1218     # Due to a bug in libsndfile version <= 1.0.25, frames != 0
   1219     # when opening a named pipe in SFM_WRITE mode.
   1220     # See http://github.com/erikd/libsndfile/issues/77.

LibsndfileError: Error opening '/storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav': System error.

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
Cell In[14], line 2
      1 # use librosa to get example's duration from the audio file
----> 2 new_column = [librosa.get_duration(path=x) for x in minds["path"]]
      3 minds = minds.add_column("duration", new_column)
      5 # # use 🤗 Datasets' `filter` method to apply the filtering function
      6 # minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])
      7 
      8 # # remove the temporary helper column
      9 # minds = minds.remove_columns(["duration"])
     10 # minds

Cell In[14], line 2, in <listcomp>(.0)
      1 # use librosa to get example's duration from the audio file
----> 2 new_column = [librosa.get_duration(path=x) for x in minds["path"]]
      3 minds = minds.add_column("duration", new_column)
      5 # # use 🤗 Datasets' `filter` method to apply the filtering function
      6 # minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])
      7 
      8 # # remove the temporary helper column
      9 # minds = minds.remove_columns(["duration"])
     10 # minds

File /opt/homebrew/lib/python3.11/site-packages/librosa/core/audio.py:804, in get_duration(y, sr, S, n_fft, hop_length, center, path, filename)
    796     except sf.SoundFileRuntimeError:
    797         warnings.warn(
    798             "PySoundFile failed. Trying audioread instead."
    799             "\n\tAudioread support is deprecated in librosa 0.10.0"
   (...)
    802             category=FutureWarning,
    803         )
--> 804         with audioread.audio_open(path) as fdesc:
    805             return fdesc.duration  # type: ignore
    807 if y is None:

File /opt/homebrew/lib/python3.11/site-packages/audioread/__init__.py:127, in audio_open(path, backends)
    125 for BackendClass in backends:
    126     try:
--> 127         return BackendClass(path)
    128     except DecodeError:
    129         pass

File /opt/homebrew/lib/python3.11/site-packages/audioread/rawread.py:59, in RawAudioFile.__init__(self, filename)
     58 def __init__(self, filename):
---> 59     self._fh = open(filename, 'rb')
     61     try:
     62         self._file = aifc.open(self._fh)

FileNotFoundError: [Errno 2] No such file or directory: '/storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav'

To debug the issue, I print minds[0]['path'] and I get /storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav.

Which doesn’t seem too weird and librosa is trying to find the audio file from that path.

What could be the issue?

Thank you for your help

I have the same issue, any ideas on how to fix this? Do the data files actually get downloaded?