How to access path to audio recording in datasets 4.0?

alerio · September 17, 2025, 7:37pm

Hello everyone,

I’m wondering if someone can show me how to access audio path in datasets 4.0 for an audio dataset that does not have metadata info to cross reference.

In datasets=3.6 with soundfile for audio encoding, the loaded audio dataset will have properties such as path and audio, which leaves room for audio recordings with no metadata to inference. With the recent update, the path property seems gone:

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
samples = audio_dataset[0]["audio"]
samples
# <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()
# actual array samples

But the "path/to/audio_1" is somehow missing in this process. Can anyone show me how to access this info? Thanks in advance!

John6666 · September 17, 2025, 9:40pm

The way audio datasets are handled seems to have changed quite a bit.

Short answer: cast the column with decode=False and read example["audio"]["path"]. The AudioDecoder object in Datasets 4.x doesn’t expose a file path; it only decodes samples. (PyTorch Documentation)

Minimal fixes

Local files (your case):

from datasets import Dataset, Audio

ds = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2"]})
ds = ds.cast_column("audio", Audio(decode=False))     # keep path/bytes
print(ds[0]["audio"]["path"])                         # "path/to/audio_1"

Docs show this exact pattern and return structure. (Hugging Face)

Keep path + still use decoders:

# 1) expose path
ds = ds.cast_column("audio", Audio(decode=False))
ds = ds.map(lambda ex: {"audio_path": ex["audio"]["path"]})

# 2) switch back to decoder objects for modeling
ds = ds.cast_column("audio", Audio())                 # now AudioDecoder
# audio_path column stays available

Behavior and recommendation to use decode=False to get path/bytes are documented. (Hugging Face)

Streaming datasets:

from datasets import load_dataset
ds = load_dataset("username/dataset", split="train", streaming=True).decode(False)
first = next(iter(ds))
print(first["audio"]["path"])                          # path or None if only bytes

.decode(False) disables feature decoding on streaming so you can iterate paths/bytes. (Hugging Face)

Notes

In v4, audio_dataset[0]["audio"] returns a TorchCodec AudioDecoder. Use .get_all_samples() for samples, but do not expect a path on that object. (Hugging Face)
Depending on the dataset, you may see a cache path or raw bytes when decoding is disabled. The docs show both possibilities. (Hugging Face)
v4 moved audio decoding from SoundFile to TorchCodec; release notes confirm the new AudioDecoder default and legacy indexing only for array and sampling rate. (GitHub)

Helpful refs: HF “Load audio data” and “Dataset features” pages and the v4.0 release notes. (Hugging Face)

Topic		Replies	Views
Create datasets object from multiple remote audio paths residing in Google Cloud Storage 🤗Datasets	2	383	June 28, 2022
Audio dataset without uploading the data to the hub 🤗Datasets	6	2035	March 20, 2023
Dataset loading script for an audio dataset 🤗Datasets	5	713	September 2, 2022
Stuck in 'Preprocessing audio data' in HG's Audio Course while following 'Filtering the dataset' Course	4	764	November 10, 2025
Custom dataset and cast_column 🤗Datasets	1	1465	April 7, 2022

How to access path to audio recording in datasets 4.0?

Minimal fixes

Notes

Related topics