Create the Moxilla Common Voice Data

Owos · November 14, 2022, 12:59pm

So I have these audio files and their corresponding csv file and I would like to make it like the moxilla common voice dataset that looks like this when you read it in python:

{
  'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 
  'path': 'et/clips/common_voice_et_18318995.mp3', 
  'audio': {
    'path': 'et/clips/common_voice_et_18318995.mp3', 
    'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32), 
    'sampling_rate': 48000
  }, 
  'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 
  'up_votes': 2, 
  'down_votes': 0, 
  'age': 'twenties', 
  'gender': 'male', 
  'accent': '', 
  'locale': 'et', 
  'segment': ''
}

My questions is: how do I create the audio column and how was the array feature generated?

polinaeterna · November 15, 2022, 9:46am

Hi @Owos! To convert audio files to arrays datasets has Audio feature that decodes audio on the fly.

I’m not sure I understand your question but if you want to create your custom audio dataset from your files similar to CommonVoice, you can check out our guide about audio datasets and other docs in Audio section. Feel free to ask any more questions if it’s not clear enough or open an issue if you think that something should be changed in the docs.

Owos · November 15, 2022, 2:16pm

Thank you so much @polinaeterna , I’ve been able to figure it out using the Audio data loader package provided by hugging face !

Topic		Replies	Views
How to create the 'audio' column form the Mozilla Common Voice Project Beginners	0	268	October 17, 2023
How to create a dataset like common voice? 🤗Datasets	2	548	January 31, 2022
[SOLVED] How to import a custom dataset (wav2vec2 & Common Voice)? Beginners	5	2066	August 4, 2023
Datasets map modifying audio array to list? 🤗Datasets	1	1272	November 29, 2021
Please, help me 🤗Datasets	1	621	January 10, 2022

Create the Moxilla Common Voice Data

Related topics