Uploading an audio dataset keeps failing at "Uploading the dataset shards"

thepiratebay · March 14, 2024, 6:21pm

I am new to hugging face so I am probably missing something obvious. I have audio data along with transcriptions of each audio.

The path to each audio data in the audio_path array and the individual transcriptions are in the transcriptions array.

what i did is this:

from datasets import Dataset, Audio
from sklearn.model_selection import train_test_split

#example array data
audio_path = ['path/to/audio1', 'path/to/audio2', ...]
transcriptions = ['audio1 transcription', 'audio2 transcription', ...]

audio_dataset = Dataset.from_dict({
    "audio": audio_path ,
    "sentence": transcriptions
}).cast_column("audio", Audio())
audio_dataset = audio_dataset.train_test_split(test_size=0.2, seed=42)
audio_dataset.push_to_hub("username/dataset")

When i run this it first starts uploading and says Creating parquet from Arrow format: and then it switches over to Uploading the dataset shards : 4% it stops there switches to illed 0% and exits the program. No error message or anything what am i missing?

This is a snapshot of the terminal running the program:

KeelyPowers · March 14, 2024, 10:42pm

It sounds like you’re encountering an issue with uploading your audio dataset. Are you receiving any specific error messages during the upload process? It might help to check the size and format of your dataset files to ensure they meet the platform’s requirements. Additionally, you could try uploading smaller batches of data or using a different browser to see if that resolves the issue. If the problem persists, reaching out to the platform’s support team for assistance could be beneficial. Good luck!

thepiratebay · March 15, 2024, 8:58am

I got it to work when I uploaded them in chunks. It sucks that I have to use multiple datasets and then combine them after loading each one on my project though!

Topic		Replies	Views
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	858	November 17, 2023
Problem "Bad request" when using datasets.Dataset.push_to_hub() 🤗Datasets	6	477	October 28, 2024
Hugging Face is stuck hashing mp3 file for audio dataset 🤗Datasets	1	729	September 5, 2023
Column type issue pushing ASR dataset using Audiofolders 🤗Datasets	7	494	March 30, 2023
Audio dataset without uploading the data to the hub 🤗Datasets	6	1958	March 20, 2023

Uploading an audio dataset keeps failing at "Uploading the dataset shards"

Related topics