I am new to hugging face so I am probably missing something obvious. I have audio data along with transcriptions of each audio.
The path to each audio data in the audio_path
array and the individual transcriptions are in the transcriptions
array.
what i did is this:
from datasets import Dataset, Audio
from sklearn.model_selection import train_test_split
#example array data
audio_path = ['path/to/audio1', 'path/to/audio2', ...]
transcriptions = ['audio1 transcription', 'audio2 transcription', ...]
audio_dataset = Dataset.from_dict({
"audio": audio_path ,
"sentence": transcriptions
}).cast_column("audio", Audio())
audio_dataset = audio_dataset.train_test_split(test_size=0.2, seed=42)
audio_dataset.push_to_hub("username/dataset")
When i run this it first starts uploading and says Creating parquet from Arrow format:
and then it switches over to Uploading the dataset shards : 4%
it stops there switches to illed 0%
and exits the program. No error message or anything what am i missing?
This is a snapshot of the terminal running the program: