Misunderstanding around creating audio datasets from Local files

asennoussi · January 26, 2023, 4:19am

Hi, in the documentation, it only states how to add audio files, but I want to add audio files and their transcriptions.

How can I do that so I can build a dataset of snippets / transcription that I can train on?

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

lhoestq · January 26, 2023, 9:54am

Hi ! Here is an example in python:

ds = Dataset.from_dict({
    "audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"],
    "transcription": ["First transcript", "Second transcript", ..., "Last transcript"],
}).cast_column("audio", Audio())

Alternatively you can also define an AudioFolder (see docs):

my_dataset/
├── README.md
├── metadata.csv
└── data/
    ├── audio_0.wav
    ...
    └── audio_n.wav

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

You can structure your AudioFolder like this:

my_dataset/
├── README.md
├── metadata.csv
├── test/
|   ├── audio_0.wav
|   ...
|   └── audio_n.wav
└── train/
    ├── audio_0.wav
    ...
    └── audio_n.wav

It’s also possible to have one metadata.csv in train/ and one in test/ if you want

sriniu · March 19, 2023, 7:16pm

@lhoestq ,

Thanks for posting the answer. One related question, Will the created dataset be already available to public or should we manually upload them?

lhoestq · March 20, 2023, 11:08am

You can upload the directory described above to a dataset repository on https://huggingface.co to make it available to public

see Share a dataset to the Hub

sriniu · March 20, 2023, 1:50pm

@lhoestq Thanks.
I have a local dataset currently and i created a datatset instance to use it with trained.train.
I use a batch size equals 4. However i do see that my GPU memory is almost full.I was wondering if trainer.train loads all the data into the memory though I have specified the batch size?

Just to give some idea, my train size is 84K samples with each samples being of length (16000,) (which is 1second of audio). My eval set is also of 84K samples.

Any leads to solve this issue?

asennoussi · March 20, 2023, 2:24pm

Use streaming mode.

sriniu · March 20, 2023, 6:20pm

In my case i have a dataset locally on my computer and its confidential. So Can i use streaming mode for local data?

asennoussi · March 21, 2023, 5:29am

Yes, you create a private dataset on huggingface and you stream from it.

sriniu · March 22, 2023, 9:14am

Thanks,
Could you please guide me how to handle to Out of memory issue while executing “Trainer.evaluate”

Here is my setup

#getting encoded datset
#preprocess_function(audio_dataset[:5])
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.sampling_rate, 
        max_length=int(128), 
        truncation=True, 
    )
    return inputs
audio_dict2 = {}
num_samples = 84000
audio_dict2['audio'] = ['../data/20230319_audioinput_1s_wav/test/' + str(int(y)) + '/audio_'+str(int(x))+'.wav' for x,y in zip(df_test.head(num_samples).index, df_test.head(num_samples).label)]
audio_dict2['label'] = [x for x in df_test.head(num_samples).label]
audio_dict2['split'] = len(df_test.head(num_samples))*['train']

audio_dataset2 = Dataset.from_dict(audio_dict2).cast_column("audio", Audio())
encoded_dataset2 = audio_dataset2.map(preprocess_function, remove_columns=["audio"], batched=True)

#training args

model_name = model_checkpoint.split("/")[-1]
batch_size = 4
metric = load_metric("accuracy")
args = TrainingArguments(
    f"{model_name}-finetuned-ks1",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=25,
    warmup_ratio=0.1,
    logging_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset2,
    eval_dataset=encoded_dataset2,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics
)

## I ge the OOM error here!

outputs = trainer.evaluate(encoded_dataset2)

In my case the dataset is available locally, Is there an option to batch process this evaluate process?

lhoestq · March 23, 2023, 10:14am

You can try saving some RAM by caching your dataset on disk by passing cache_file_name= to map()

Indeed creating it using from_dict() keeps it in RAM

RajSang · July 17, 2023, 7:57am

Hi @lhoestq, will an audio dataset that is created through Dataset.from_dict or through AudioFolder and then pushed to hub, support streaming automatically? If not is there a way to add this while creating AudioFolder/ audio dataset from local paths and not through the script method?

lhoestq · July 17, 2023, 9:32am

Yes it will support streaming

RajSang · July 17, 2023, 6:08pm

Thanks a lot!

Topic		Replies	Views
Audio dataset without uploading the data to the hub 🤗Datasets	6	1965	March 20, 2023
Create own dataset of train and test in separate folders 🤗Datasets	1	774	January 26, 2023
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4296	December 10, 2024
Problem with Dataset Preview with audio files 🤗Datasets	7	1246	April 17, 2025
How to create an audio dataset from local files already split into train and test without losing labels Beginners	2	408	March 17, 2024

Misunderstanding around creating audio datasets from Local files

Related topics