Problem with Dataset Preview with audio files

j-krzywdziak · February 10, 2023, 9:38pm

Hi, I am looking for a help with my test dataset. I plan to upload large dataset on Hugging Face but first I wanted to experiment and learn how to stream audio data. I have a problem with Dataset Preview. I cannot get the audio probes to listen in preview. My dataset will consist of audio in mp3 format packed in tar.gz file and a metadata in .tsv file format with audio name and some description. Can you please help me and maybe tell what I did wrong in my dataloader test.py · j-krzywdziak/test at main that I cannot preview the audio correctly. Any tip will be helpful, because this is my first time with Hugging Face and I am still a little bit unsure about how it works. Unfortunately documentation did not help. Thanks in advance!

lhoestq · February 23, 2023, 11:39am

cc @severo do you know what could be the issue ?

The dataset script is correct and the dataset is streamable, but it seems to return one mp3 and one wav for the preview for some reason

polinaeterna · February 24, 2023, 12:09pm

@lhoestq @j-krzywdziak hey! this is weird but I tried to change config name from “Test Dataset” to “test-dataset” (to exclude space) and it worked (I cloned the repo, you can check here: polinaeterna/test-user · Datasets at Hugging Face)

severo · February 27, 2023, 8:33am

I created an issue: Support all the characters in dataset, config and split · Issue #853 · huggingface/datasets-server · GitHub

gcjavi · March 5, 2024, 1:42pm

Hi @polinaeterna, I need to publish an ASR dataset an I would like to show the audios in the dataviewer along with the .tsv data but I don’t know how. After several unsuccessful attempts, I found your repo and tried to replicate it as a first step gcjavi/dataviewer-tests only changing the content of the .zip file with some audio files from my dataset and also the .tsv file with the transcriptions. However, the dataviewer only shows the audio fragments and the .tsv files are ignored. Do you know why it could be happening? Thanks in advance!

polinaeterna · March 7, 2024, 11:15am

hi @gcjavi ! the recommended approach currently is to use no-code dataset configuration without custom dataset scripts, in your case you can use AudioFolder structure for your repository to make the viewer work correctly. You need to structure your data according to the documentation, note that file with transcriptions must be called metadata.csv / metadata.jsonl and column names also should be strictly file_name and transcription. and you should delete python script, and update/delete README’s dataset_info field too to avoid mismatch between features and config names.

This should work, until you have a really huge dataset. In the latter case I recommend to first create a dataset locally with your custom code in python (you might find Dataset.from_generator() useful) and then use .push_to_hub() to push the data to a Hub repo in parquet format.

Let me know if that worked

gcjavi · March 7, 2024, 12:58pm

Thank you very much for your quick reply. I tried in a small dataset and this structure works perfectly. Next step is trying to upload a bigger dataset, I will test the the push_to_hub() method as you pointed. Thanks!

pr0mila-gh0sh · April 17, 2025, 12:26pm

I’ve developed a project that streams audio in the datasets viewer on Hugging Face using Parquet format.

Topic		Replies	Views
Audio files view error 🤗Datasets	7	925	March 27, 2023
Error when setting up the Dataset Viewer - StreamingRowsError 🤗Datasets	4	345	August 21, 2023
Steps to have audio-playing UI with dataset viewer Beginners	0	70	June 19, 2024
Dataset preview rendering with NULL 🤗Datasets	0	47	January 13, 2025
Audio dataset without uploading the data to the hub 🤗Datasets	6	1958	March 20, 2023

Problem with Dataset Preview with audio files

Related topics