Hi, I kinda figured out how to load a custom dataset having different splits (train, test, valid)
Step 1 : create csv files for your dataset (separate for train, test and valid) . The columns will be “text”, “path” and “audio”, Keep the transcript in the text column and the audio file path in “path” and “audio” column.(keep same in both)
Step 2: save the csv files with appropriate names like train_data.csv, test_data.csv and valid_data.csv
Step 3: Define features like below :
features = Features(
{
"text": Value("string"),
'path': Value('string'),
"audio": Audio(sampling_rate=16000)
}
)
Step 4 : load the dataset using below piece of code :
sample_data = load_dataset(
'csv', data_files={
'train': 'train_data.csv',
'test': 'test_data.csv',
'valid': 'valid_data.csv'
}
)
you will get something like this when you will print sample_data:
DatasetDict({
train: Dataset({
features: ['text', 'path', 'audio'],
num_rows: 10
})
test: Dataset({
features: ['text', 'path', 'audio'],
num_rows: 10
})
valid: Dataset({
features: ['text', 'path', 'audio'],
num_rows: 10
})
})
Step 5: cast your features into specified formats in the features using cast :
sample_data = sample_data.cast(features)
And you are done. The cast will automatically load the audio files from the mentioned paths and convert into numpy arrays with given sampling rate.