How to create a dataset like common voice?

Harveenchadha · January 20, 2022, 5:21pm

Hi,

I am trying to create a indic version of common voice. I am new to datasets, so I am not sure how to proceed with this.

Can anyone please help me to decide the structure and format of files. I have dataset in 6 languages. For every language I have a train, dev, test split.

What I am thinking is this:

--hi
-- --train
-- --dev
-- --test

--mr
-- --train
-- --dev
-- --test

I am planning to upload zip files for all the train, dev and test sets, will zips be supported or I have to upload individual files?

cc: @patrickvonplaten

Yehor · January 20, 2022, 8:23pm

You can use the same structure as common voice has. Its structure is useful, don’t think it’s a need to create own one.

lhoestq · January 31, 2022, 4:12pm

Hi ! We haven’t decided yet which structure we’re going to support natively. What would be the most convenient structure in your opinion ?

Topic		Replies	Views
Common voice dataset 15.0 version release 🤗Datasets	1	1266	October 3, 2023
Create the Moxilla Common Voice Data 🤗Datasets	2	835	November 15, 2022
Create own dataset of train and test in separate folders 🤗Datasets	1	780	January 26, 2023
Please, help me 🤗Datasets	1	625	January 10, 2022
Unable to load CommonVoice latest version 🤗Datasets	3	1709	December 13, 2021

How to create a dataset like common voice?

Related topics