How to create a dataset like common voice?

Hi,

I am trying to create a indic version of common voice. I am new to datasets, so I am not sure how to proceed with this.

Can anyone please help me to decide the structure and format of files. I have dataset in 6 languages. For every language I have a train, dev, test split.

What I am thinking is this:

--hi
-- --train
-- --dev
-- --test

--mr
-- --train
-- --dev
-- --test

I am planning to upload zip files for all the train, dev and test sets, will zips be supported or I have to upload individual files? :slight_smile:

cc: @patrickvonplaten

You can use the same structure as common voice has. Its structure is useful, don’t think it’s a need to create own one.

Hi ! We haven’t decided yet which structure we’re going to support natively. What would be the most convenient structure in your opinion ?