Problem with Dataset Preview with audio files

Hi, I am looking for a help with my test dataset. I plan to upload large dataset on Hugging Face but first I wanted to experiment and learn how to stream audio data. I have a problem with Dataset Preview. I cannot get the audio probes to listen in preview. My dataset will consist of audio in mp3 format packed in tar.gz file and a metadata in .tsv file format with audio name and some description. Can you please help me and maybe tell what I did wrong in my dataloader test.py · j-krzywdziak/test at main that I cannot preview the audio correctly. Any tip will be helpful, because this is my first time with Hugging Face and I am still a little bit unsure about how it works. Unfortunately documentation did not help. Thanks in advance!

cc @severo do you know what could be the issue ?

The dataset script is correct and the dataset is streamable, but it seems to return one mp3 and one wav for the preview for some reason

@lhoestq @j-krzywdziak hey! this is weird but I tried to change config name from “Test Dataset” to “test-dataset” (to exclude space) and it worked (I cloned the repo, you can check here: polinaeterna/test-user · Datasets at Hugging Face)

1 Like

I created an issue: Support all the characters in dataset, config and split · Issue #853 · huggingface/datasets-server · GitHub

1 Like

Hi @polinaeterna, I need to publish an ASR dataset an I would like to show the audios in the dataviewer along with the .tsv data but I don’t know how. After several unsuccessful attempts, I found your repo and tried to replicate it as a first step gcjavi/dataviewer-tests only changing the content of the .zip file with some audio files from my dataset and also the .tsv file with the transcriptions. However, the dataviewer only shows the audio fragments and the .tsv files are ignored. Do you know why it could be happening? Thanks in advance!

hi @gcjavi ! the recommended approach currently is to use no-code dataset configuration without custom dataset scripts, in your case you can use AudioFolder structure for your repository to make the viewer work correctly. You need to structure your data according to the documentation, note that file with transcriptions must be called metadata.csv / metadata.jsonl and column names also should be strictly file_name and transcription. and you should delete python script, and update/delete README’s dataset_info field too to avoid mismatch between features and config names.

This should work, until you have a really huge dataset. In the latter case I recommend to first create a dataset locally with your custom code in python (you might find Dataset.from_generator() useful) and then use .push_to_hub() to push the data to a Hub repo in parquet format.

Let me know if that worked :slight_smile:

Thank you very much for your quick reply. I tried in a small dataset and this structure works perfectly. Next step is trying to upload a bigger dataset, I will test the the push_to_hub() method as you pointed. Thanks! :grin:

2 Likes