Misunderstanding around creating audio datasets from Local files

Hi, in the documentation, it only states how to add audio files, but I want to add audio files and their transcriptions.

How can I do that so I can build a dataset of snippets / transcription that I can train on?

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

Hi ! Here is an example in python:

ds = Dataset.from_dict({
    "audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"],
    "transcription": ["First transcript", "Second transcript", ..., "Last transcript"],
}).cast_column("audio", Audio())

Alternatively you can also define an AudioFolder (see docs):

my_dataset/
├── README.md
├── metadata.csv
└── data/
    ├── audio_0.wav
    ...
    └── audio_n.wav

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

You can structure your AudioFolder like this:

my_dataset/
├── README.md
├── metadata.csv
├── test/
|   ├── audio_0.wav
|   ...
|   └── audio_n.wav
└── train/
    ├── audio_0.wav
    ...
    └── audio_n.wav

It’s also possible to have one metadata.csv in train/ and one in test/ if you want

2 Likes

@lhoestq ,

Thanks for posting the answer. One related question, Will the created dataset be already available to public or should we manually upload them?

You can upload the directory described above to a dataset repository on https://huggingface.co to make it available to public :slight_smile:

see Share a dataset to the Hub

@lhoestq Thanks.
I have a local dataset currently and i created a datatset instance to use it with trained.train.
I use a batch size equals 4. However i do see that my GPU memory is almost full.I was wondering if trainer.train loads all the data into the memory though I have specified the batch size?

Just to give some idea, my train size is 84K samples with each samples being of length (16000,) (which is 1second of audio). My eval set is also of 84K samples.

Any leads to solve this issue?

Use streaming mode.

In my case i have a dataset locally on my computer and its confidential. So Can i use streaming mode for local data?

Yes, you create a private dataset on huggingface and you stream from it.

Thanks,
Could you please guide me how to handle to Out of memory issue while executing “Trainer.evaluate”

Here is my setup

#getting encoded datset
#preprocess_function(audio_dataset[:5])
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.sampling_rate, 
        max_length=int(128), 
        truncation=True, 
    )
    return inputs
audio_dict2 = {}
num_samples = 84000
audio_dict2['audio'] = ['../data/20230319_audioinput_1s_wav/test/' + str(int(y)) + '/audio_'+str(int(x))+'.wav' for x,y in zip(df_test.head(num_samples).index, df_test.head(num_samples).label)]
audio_dict2['label'] = [x for x in df_test.head(num_samples).label]
audio_dict2['split'] = len(df_test.head(num_samples))*['train']

audio_dataset2 = Dataset.from_dict(audio_dict2).cast_column("audio", Audio())
encoded_dataset2 = audio_dataset2.map(preprocess_function, remove_columns=["audio"], batched=True)

#training args

model_name = model_checkpoint.split("/")[-1]
batch_size = 4
metric = load_metric("accuracy")
args = TrainingArguments(
    f"{model_name}-finetuned-ks1",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=25,
    warmup_ratio=0.1,
    logging_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset2,
    eval_dataset=encoded_dataset2,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics
)

## I ge the OOM error here!

outputs = trainer.evaluate(encoded_dataset2)


In my case the dataset is available locally, Is there an option to batch process this evaluate process?

You can try saving some RAM by caching your dataset on disk by passing cache_file_name= to map()

Indeed creating it using from_dict() keeps it in RAM

Hi @lhoestq, will an audio dataset that is created through Dataset.from_dict or through AudioFolder and then pushed to hub, support streaming automatically? If not is there a way to add this while creating AudioFolder/ audio dataset from local paths and not through the script method?

Yes it will support streaming :slight_smile:

1 Like

Thanks a lot! :smiley: