How to access path to audio recording in datasets 4.0?

Hello everyone,

I’m wondering if someone can show me how to access audio path in datasets 4.0 for an audio dataset that does not have metadata info to cross reference.

In datasets=3.6 with soundfile for audio encoding, the loaded audio dataset will have properties such as path and audio, which leaves room for audio recordings with no metadata to inference. With the recent update, the path property seems gone:

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
samples = audio_dataset[0]["audio"]
samples
# <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()
# actual array samples

But the "path/to/audio_1" is somehow missing in this process. Can anyone show me how to access this info? Thanks in advance!

1 Like

The way audio datasets are handled seems to have changed quite a bit.


Short answer: cast the column with decode=False and read example["audio"]["path"]. The AudioDecoder object in Datasets 4.x doesn’t expose a file path; it only decodes samples. (PyTorch Documentation)

Minimal fixes

Local files (your case):

from datasets import Dataset, Audio

ds = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2"]})
ds = ds.cast_column("audio", Audio(decode=False))     # keep path/bytes
print(ds[0]["audio"]["path"])                         # "path/to/audio_1"

Docs show this exact pattern and return structure. (Hugging Face)

Keep path + still use decoders:

# 1) expose path
ds = ds.cast_column("audio", Audio(decode=False))
ds = ds.map(lambda ex: {"audio_path": ex["audio"]["path"]})

# 2) switch back to decoder objects for modeling
ds = ds.cast_column("audio", Audio())                 # now AudioDecoder
# audio_path column stays available

Behavior and recommendation to use decode=False to get path/bytes are documented. (Hugging Face)

Streaming datasets:

from datasets import load_dataset
ds = load_dataset("username/dataset", split="train", streaming=True).decode(False)
first = next(iter(ds))
print(first["audio"]["path"])                          # path or None if only bytes

.decode(False) disables feature decoding on streaming so you can iterate paths/bytes. (Hugging Face)

Notes

  • In v4, audio_dataset[0]["audio"] returns a TorchCodec AudioDecoder. Use .get_all_samples() for samples, but do not expect a path on that object. (Hugging Face)
  • Depending on the dataset, you may see a cache path or raw bytes when decoding is disabled. The docs show both possibilities. (Hugging Face)
  • v4 moved audio decoding from SoundFile to TorchCodec; release notes confirm the new AudioDecoder default and legacy indexing only for array and sampling rate. (GitHub)

Helpful refs: HF “Load audio data” and “Dataset features” pages and the v4.0 release notes. (Hugging Face)