I am trying to understand the download mechanism in streamable datasets. I have a gigaspeech dataset, from which I am removing the “audio” column/feature.
Yes, unless this dataset is stored as Parquet, in which case it’s possible to select columns to download like so load_dataset(..., columns=list_of_columns_to_read.
No, I meant that your map still downloads the audio column, and it’s only possible to avoid this if the dataset is in Parquet (by specifying columns to read in load_dataset).