Stuck in 'Preprocessing audio data' in HG's Audio Course while following 'Filtering the dataset'

mannerism · January 4, 2024, 4:28am

Hello,

I’ve come across many tutorial articles but this is one of the most well organized and insightful course I’ve encountered so far. Amazing work and I really appreciate your time putting these goodies together.

While I was cherishing every bit of information in the course, I come across the situation - likely my local file structure problem - where I wasn’t able to run:

new_column = [librosa.get_duration(path=x) for x in minds["path"]]

because librosa can’t find the file at the path provided by minds, outputting the following error message:

/var/folders/4b/xjs02jw50cn561qgx_9d2z200000gn/T/ipykernel_78132/3624879537.py:2: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  new_column = [librosa.get_duration(path=x) for x in minds["path"]]
---------------------------------------------------------------------------
LibsndfileError                           Traceback (most recent call last)
File /opt/homebrew/lib/python3.11/site-packages/librosa/core/audio.py:795, in get_duration(y, sr, S, n_fft, hop_length, center, path, filename)
    794 try:
--> 795     return sf.info(path).duration  # type: ignore
    796 except sf.SoundFileRuntimeError:

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:467, in info(file, verbose)
    460 """Returns an object with information about a `SoundFile`.
    461 
    462 Parameters
   (...)
    465     Whether to print additional information.
    466 """
--> 467 return _SoundFileInfo(file, verbose)

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:412, in _SoundFileInfo.__init__(self, file, verbose)
    411 self.verbose = verbose
--> 412 with SoundFile(file) as f:
    413     self.name = f.name

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:658, in SoundFile.__init__(self, file, mode, samplerate, channels, subtype, endian, format, closefd)
    656 self._info = _create_info_struct(file, mode, samplerate, channels,
    657                                  format, subtype, endian)
--> 658 self._file = self._open(file, mode_int, closefd)
    659 if set(mode).issuperset('r+') and self.seekable():
    660     # Move write position to 0 (like in Python file objects)

File /opt/homebrew/lib/python3.11/site-packages/soundfile.py:1216, in SoundFile._open(self, file, mode_int, closefd)
   1215     err = _snd.sf_error(file_ptr)
-> 1216     raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
   1217 if mode_int == _snd.SFM_WRITE:
   1218     # Due to a bug in libsndfile version <= 1.0.25, frames != 0
   1219     # when opening a named pipe in SFM_WRITE mode.
   1220     # See http://github.com/erikd/libsndfile/issues/77.

LibsndfileError: Error opening '/storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav': System error.

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
Cell In[14], line 2
      1 # use librosa to get example's duration from the audio file
----> 2 new_column = [librosa.get_duration(path=x) for x in minds["path"]]
      3 minds = minds.add_column("duration", new_column)
      5 # # use 🤗 Datasets' `filter` method to apply the filtering function
      6 # minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])
      7 
      8 # # remove the temporary helper column
      9 # minds = minds.remove_columns(["duration"])
     10 # minds

Cell In[14], line 2, in <listcomp>(.0)
      1 # use librosa to get example's duration from the audio file
----> 2 new_column = [librosa.get_duration(path=x) for x in minds["path"]]
      3 minds = minds.add_column("duration", new_column)
      5 # # use 🤗 Datasets' `filter` method to apply the filtering function
      6 # minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])
      7 
      8 # # remove the temporary helper column
      9 # minds = minds.remove_columns(["duration"])
     10 # minds

File /opt/homebrew/lib/python3.11/site-packages/librosa/core/audio.py:804, in get_duration(y, sr, S, n_fft, hop_length, center, path, filename)
    796     except sf.SoundFileRuntimeError:
    797         warnings.warn(
    798             "PySoundFile failed. Trying audioread instead."
    799             "\n\tAudioread support is deprecated in librosa 0.10.0"
   (...)
    802             category=FutureWarning,
    803         )
--> 804         with audioread.audio_open(path) as fdesc:
    805             return fdesc.duration  # type: ignore
    807 if y is None:

File /opt/homebrew/lib/python3.11/site-packages/audioread/__init__.py:127, in audio_open(path, backends)
    125 for BackendClass in backends:
    126     try:
--> 127         return BackendClass(path)
    128     except DecodeError:
    129         pass

File /opt/homebrew/lib/python3.11/site-packages/audioread/rawread.py:59, in RawAudioFile.__init__(self, filename)
     58 def __init__(self, filename):
---> 59     self._fh = open(filename, 'rb')
     61     try:
     62         self._file = aifc.open(self._fh)

FileNotFoundError: [Errno 2] No such file or directory: '/storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav'

To debug the issue, I print minds[0]['path'] and I get /storage/hf-datasets-cache/all/datasets/27907695716030-config-parquet-and-info-PolyAI-minds14-941a5af2/downloads/extracted/a87e442545495cdb67dfdcbc9d4f35d234c9f8e471449b2db58d7c81b62f001a/en-AU~PAY_BILL/response_4.wav.

Which doesn’t seem too weird and librosa is trying to find the audio file from that path.

What could be the issue?

Thank you for your help

samlansley · January 14, 2024, 6:58pm

I have the same issue, any ideas on how to fix this? Do the data files actually get downloaded?

LJulio · January 15, 2025, 6:30am

May I ask if your issue has been resolved? I also encountered the same problem, thank you~

Topic		Replies	Views
Common Voice dataset: librosa.load() leads to LibsndfileError 🤗Datasets	0	1762	March 21, 2023
soundfile.LibsndfileError: Internal psf_fseek() failed 🤗Datasets	3	25	July 12, 2025
Error io.BufferReader 🤗Datasets	2	545	June 27, 2023
Error loading and preprocessing librispeech 🤗Datasets	1	800	August 29, 2022
Dataset viewer won't play/stream m4a audio files 🤗Datasets	1	673	August 24, 2023

Stuck in 'Preprocessing audio data' in HG's Audio Course while following 'Filtering the dataset'

Related topics