The way audio datasets are handled seems to have changed quite a bit.
Short answer: cast the column with decode=False
and read example["audio"]["path"]
. The AudioDecoder
object in Datasets 4.x doesn’t expose a file path; it only decodes samples. (PyTorch Documentation)
Minimal fixes
Local files (your case):
from datasets import Dataset, Audio
ds = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2"]})
ds = ds.cast_column("audio", Audio(decode=False)) # keep path/bytes
print(ds[0]["audio"]["path"]) # "path/to/audio_1"
Docs show this exact pattern and return structure. (Hugging Face)
Keep path + still use decoders:
# 1) expose path
ds = ds.cast_column("audio", Audio(decode=False))
ds = ds.map(lambda ex: {"audio_path": ex["audio"]["path"]})
# 2) switch back to decoder objects for modeling
ds = ds.cast_column("audio", Audio()) # now AudioDecoder
# audio_path column stays available
Behavior and recommendation to use decode=False
to get path/bytes are documented. (Hugging Face)
Streaming datasets:
from datasets import load_dataset
ds = load_dataset("username/dataset", split="train", streaming=True).decode(False)
first = next(iter(ds))
print(first["audio"]["path"]) # path or None if only bytes
.decode(False)
disables feature decoding on streaming so you can iterate paths/bytes. (Hugging Face)
Notes
- In v4,
audio_dataset[0]["audio"]
returns a TorchCodecAudioDecoder
. Use.get_all_samples()
for samples, but do not expect a path on that object. (Hugging Face) - Depending on the dataset, you may see a cache path or raw bytes when decoding is disabled. The docs show both possibilities. (Hugging Face)
- v4 moved audio decoding from SoundFile to TorchCodec; release notes confirm the new
AudioDecoder
default and legacy indexing only for array and sampling rate. (GitHub)
Helpful refs: HF “Load audio data” and “Dataset features” pages and the v4.0 release notes. (Hugging Face)