How to get file paths when iterating over a custom dataset with KeyDataset?

almirb · April 5, 2023, 4:51pm

Hi!
I created a dataset from a folder with some mp3 files and tried to iterate over them with KeyDataset:

import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

dataset_mp3 = load_dataset("audiofolder", data_dir="/content/drive/MyDrive/Temp/Mp3", drop_metadata=True).cast_column("audio", Audio(sampling_rate=16000))

# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset

for out in tqdm(pipe(KeyDataset(dataset_mp3["train"], "audio"))):
    print(out)
    # {"text": "My audio transcription"}
    # {"text": ....}
    # ....

For now I’m getting the output for the processing I want (“text”), but how to get the file path corresponding to each output?

P.S: I’m using Google Colab.

Thanks!

jblazek · October 6, 2023, 1:22pm

I am not sure if this is nice solution but it works…

First of all create metadata.csv in your data_dir where you put the file_name and some id.
Load the dataset as before… Now the

dataset_mp3["train"][1] == {
   "audio": "path/to/audio/file",
   "id": "id from the metadata.csv"
}

Now change the for loop into:

for item in tqdm(KeyPairDataset(dataset_mp3["train"], "audio", "id")):
    out = pipe(item["text"])
    print(out, item["text_pair"])

It is strange, but take a look here https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/pt_utils.py and you will see how it works

It can be simplifed if you need just file_name just iterate over KeyDataset and run pipe inside the loop.

Hope this helps.

Topic		Replies	Views
Error Iterating over KeyDataset 🤗Datasets	0	30	August 30, 2024
KeyError: 'csv' using a csv file with KeyDataset Beginners	6	684	September 20, 2023
Create datasets object from multiple remote audio paths residing in Google Cloud Storage 🤗Datasets	2	374	June 28, 2022
ValueError: audio at <filename> doesn't have metadata in <path>/metadata.csv 🤗Datasets	6	996	October 30, 2023
Error "TypeError: not a path-like object" when iterating through a streamed dataset 🤗Datasets	3	541	September 8, 2022

How to get file paths when iterating over a custom dataset with KeyDataset?

Related topics