How to use load_dataset to load my own local dataset?

laro1 · May 24, 2023, 12:29pm

I have a folder with .wav files (/home/user/myDataset).
I have a dataframe (/home/user/myDF.csv) with 2 columns:
file - file name
text - text of the speech of the file (ground truth)

How can I use huggingface datasets to load and split and train model with my dataset above ?

mariosasko · May 24, 2023, 7:06pm

You can load and split the dataset as follows:

from datasets import Dataset, Audio

ds = Dataset.from_pandas(df)
ds = ds.cast_column("file", Audio())
ds.rename_column("text", "audio")

ds_dict_with_splits = ds.train_test_split(test_size=0.3)

I assume you want to train a speech recognition model - you can find a guide here and the transformers example scripts here (replace their dataset initialization code with your dataset)

Topic		Replies	Views
Loading custom audio dataset and fine-tuning model Beginners	6	3249	December 12, 2023
Creating my own Dataset 🤗Transformers	2	3016	January 30, 2023
Run on single local file rather than dataset Beginners	1	316	January 30, 2024
From Pandas Dataframe to Huggingface Dataset Beginners	9	67373	December 20, 2024
Audio dataset without uploading the data to the hub 🤗Datasets	6	1957	March 20, 2023

How to use load_dataset to load my own local dataset?

Related topics