Wav2vec2 pretraining on own wav files

franp9am · April 22, 2022, 9:41am

I would like to run wav2vec2 (selfsupervised) pretraining on custom wav files. Is there an easy way to do this? I’m aware of the run_wav2vec2_pretraining_no_trainer script, but not sure how to pass (or create) the custom dataset.

Thanks a lot for some hint.

mfox · April 22, 2022, 11:36am

Hello,

I am currently running into this issue. What I did was use the following script to load the data from paths on disk.

def create_dataset(audio_paths: str):
    path_df = pd.DataFrame.from_dict(audio_paths)
    my_audio_dataset = Dataset.from_pandas(
        df=path_df,
        split=datasets.NamedSplit(name='train')
    )
    my_audio_dataset = my_audio_dataset.cast_column("audio", Audio())
    my_audio_dataset.save_to_disk(<outpath>)

Then I passed the <“outpath”> as the “dataset_name” and then it worked.

franp9am · April 24, 2022, 9:06am

Thanks a lot for your reply, @mfox I assume that audio_paths here is not a string but something like {"audio": [list of wav_paths]} is that right?
I created and saved the sataset as you suggested, but am still getting an error

ValueError: You are trying to load a dataset that was saved using save_to_disk. Please use load_from_disk instead.

It seems that I could change this line of the script to load_from_disk – and also edit the definition of train and validation split somehow – is it the recommended solution?

Topic		Replies	Views
How to import a custom dataset to fine tune wav2vec Beginners	0	913	October 19, 2022
Create own dataset of train and test in separate folders 🤗Datasets	1	773	January 26, 2023
Wav2VecForPreTraining - Not able to run trainer.train() Beginners	3	680	October 19, 2021
[SOLVED] How to import a custom dataset (wav2vec2 & Common Voice)? Beginners	5	2064	August 4, 2023
Running out of Diskspace 🤗Datasets	1	3089	April 26, 2022

Wav2vec2 pretraining on own wav files

Related topics