How to use load_dataset to load my own local dataset?

  • I have a folder with .wav files (/home/user/myDataset).
  • I have a dataframe (/home/user/myDF.csv) with 2 columns:
    file - file name
    text - text of the speech of the file (ground truth)

How can I use huggingface datasets to load and split and train model with my dataset above ?

You can load and split the dataset as follows:

from datasets import Dataset, Audio

ds = Dataset.from_pandas(df)
ds = ds.cast_column("file", Audio())
ds.rename_column("text", "audio")

ds_dict_with_splits = ds.train_test_split(test_size=0.3)

I assume you want to train a speech recognition model - you can find a guide here and the transformers example scripts here (replace their dataset initialization code with your dataset)