How to get file paths when iterating over a custom dataset with KeyDataset?

I am not sure if this is nice solution but it works…

  1. First of all create metadata.csv in your data_dir where you put the file_name and some id.
  2. Load the dataset as before… Now the
dataset_mp3["train"][1] == {
   "audio": "path/to/audio/file",
   "id": "id from the metadata.csv"
}
  1. Now change the for loop into:
for item in tqdm(KeyPairDataset(dataset_mp3["train"], "audio", "id")):
    out = pipe(item["text"])
    print(out, item["text_pair"])

It is strange, but take a look here https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/pt_utils.py and you will see how it works

It can be simplifed if you need just file_name just iterate over KeyDataset and run pipe inside the loop.

Hope this helps.