Loading custom audio dataset and fine-tuning model

Hi all. I’m very new to HuggingFace and I have a question that I hope someone can help with.

I was suggested the XLSR-53 (Wav2Vec) model for my use-case which is a speech to text model. However, the languages I require aren’t supported so I was told I need to fine-tune the model per my requirements. I’ve seen several documentation but they all use Common Voice which also doesn’t support what I need.

I have ~4 hours audio files and tsv files (annotations of the audio) but I am not sure how to load them and fine-tune the model with them. I can’t find much info online either. Is there any reference I can follow?

Any help would be appreciated.

@patrickvonplaten I am also trying it out for a similar usecase but couldnt find any example script till now for audio datasets other than CommonVoice. I have several datasets with me which arent available on huggingface datasets but because almost all the scripts rely so much on the usage of huggingface datasets its hard to get my head around it to change it my use cases. If you can suggest me any resources or any changes so that I can use my own dataset inspite of Commonvoice or any other dataset available on huggingface datasets it would be of great help.

Hi. I’m trying to do the same thing. I loaded my data in a DataFrame containing “file” and “text” similarly to the available datasets like CommonVoice but I’m not sure what to do with the audio so that it can be processed with the Audio feature of Huggingface. Did you find a solution ?

Hi @weirdguitarist! You can do the following to adjust the dataset format:

from datasets import Dataset, Audio, Value, Features

dset = Dataset.from_pandas(df)
features = Features({"text": Value("string"), "file": Audio(sampling_rate=...)})
dset = dset.cast(features)

Hi, I kinda figured out how to load a custom dataset having different splits (train, test, valid)

Step 1 : create csv files for your dataset (separate for train, test and valid) . The columns will be “text”, “path” and “audio”, Keep the transcript in the text column and the audio file path in “path” and “audio” column.(keep same in both)

Step 2: save the csv files with appropriate names like train_data.csv, test_data.csv and valid_data.csv

Step 3: Define features like below :

features = Features(
    {
        "text": Value("string"), 
        'path': Value('string'),
        "audio": Audio(sampling_rate=16000)
    }
)

Step 4 : load the dataset using below piece of code :

sample_data = load_dataset(
    'csv', data_files={
        'train': 'train_data.csv', 
        'test': 'test_data.csv',
        'valid': 'valid_data.csv'
    }
)

you will get something like this when you will print sample_data:

DatasetDict({
    train: Dataset({
        features: ['text', 'path', 'audio'],
        num_rows: 10
    })
    test: Dataset({
        features: ['text', 'path', 'audio'],
        num_rows: 10
    })
    valid: Dataset({
        features: ['text', 'path', 'audio'],
        num_rows: 10
    })
})

Step 5: cast your features into specified formats in the features using cast :

sample_data = sample_data.cast(features)

And you are done. The cast will automatically load the audio files from the mentioned paths and convert into numpy arrays with given sampling rate.

1 Like

You can also pass the features directly to load_dataset now to perform the cast, which avoids an extra transformation (leading to less space used for caching).

My local training dataset contains long audios(1-2hr) with timestamp of each sentence. What’s the proper approach to load them?