How to get file paths when iterating over a custom dataset with KeyDataset?

Hi!
I created a dataset from a folder with some mp3 files and tried to iterate over them with KeyDataset:

import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

dataset_mp3 = load_dataset("audiofolder", data_dir="/content/drive/MyDrive/Temp/Mp3", drop_metadata=True).cast_column("audio", Audio(sampling_rate=16000))

# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset

for out in tqdm(pipe(KeyDataset(dataset_mp3["train"], "audio"))):
    print(out)
    # {"text": "My audio transcription"}
    # {"text": ....}
    # ....

For now I’m getting the output for the processing I want (“text”), but how to get the file path corresponding to each output?

P.S: I’m using Google Colab.

Thanks!

I am not sure if this is nice solution but it works…

  1. First of all create metadata.csv in your data_dir where you put the file_name and some id.
  2. Load the dataset as before… Now the
dataset_mp3["train"][1] == {
   "audio": "path/to/audio/file",
   "id": "id from the metadata.csv"
}
  1. Now change the for loop into:
for item in tqdm(KeyPairDataset(dataset_mp3["train"], "audio", "id")):
    out = pipe(item["text"])
    print(out, item["text_pair"])

It is strange, but take a look here https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/pt_utils.py and you will see how it works

It can be simplifed if you need just file_name just iterate over KeyDataset and run pipe inside the loop.

Hope this helps.