Create datasets object from multiple remote audio paths residing in Google Cloud Storage

I am trying to create a datasets object for audio files residing in Google Cloud storage. This is what I have in mind:
from datasets import Dataset, Audio
import pandas as pd

my_audio_paths_df = pd.DataFrame({‘audio_path’: [
‘gs://<my_gs_bucket>/<audio_path>/<audio_name>.wav’,
‘gs://<my_gs_bucket>/<audio_path>/<audio_name>.wav’]})

my_audio_dataset = Dataset.from_pandas(my_audio_paths_df)
my_audio_dataset = my_audio_dataset.cast_column(“audio_path”, Audio(sampling_rate=16_000))

  • This would work if I had local paths for my audio, but is there a way to do it for google cloud storage paths?

Hi ! Currently this is not supported. Only paths to local files and http URLs are supported right now, though we’ll probably explore adding support for cloud storage soon :wink:

@lhoestq Thank you for your swift reply. I will try to find a workaround!