Wav2vec2 pretraining on own wav files

I would like to run wav2vec2 (selfsupervised) pretraining on custom wav files. Is there an easy way to do this? I’m aware of the run_wav2vec2_pretraining_no_trainer script, but not sure how to pass (or create) the custom dataset.

Thanks a lot for some hint.

Hello,

I am currently running into this issue. What I did was use the following script to load the data from paths on disk.

def create_dataset(audio_paths: str):
    path_df = pd.DataFrame.from_dict(audio_paths)
    my_audio_dataset = Dataset.from_pandas(
        df=path_df,
        split=datasets.NamedSplit(name='train')
    )
    my_audio_dataset = my_audio_dataset.cast_column("audio", Audio())
    my_audio_dataset.save_to_disk(<outpath>)

Then I passed the <“outpath”> as the “dataset_name” and then it worked.

1 Like

Thanks a lot for your reply, @mfox I assume that audio_paths here is not a string but something like {"audio": [list of wav_paths]} is that right?
I created and saved the sataset as you suggested, but am still getting an error

ValueError: You are trying to load a dataset that was saved using save_to_disk. Please use load_from_disk instead.

It seems that I could change this line of the script to load_from_disk – and also edit the definition of train and validation split somehow – is it the recommended solution?