hi @gcjavi ! the recommended approach currently is to use no-code dataset configuration without custom dataset scripts, in your case you can use AudioFolder
structure for your repository to make the viewer work correctly. You need to structure your data according to the documentation, note that file with transcriptions must be called metadata.csv
/ metadata.jsonl
and column names also should be strictly file_name
and transcription
. and you should delete python script, and update/delete README’s dataset_info
field too to avoid mismatch between features and config names.
This should work, until you have a really huge dataset. In the latter case I recommend to first create a dataset locally with your custom code in python (you might find Dataset.from_generator()
useful) and then use .push_to_hub()
to push the data to a Hub repo in parquet format.
Let me know if that worked