Remove_columns in map - what is the download footprint?

Hi all, first question here so forgive me -

I am trying to understand the download mechanism in streamable datasets. I have a gigaspeech dataset, from which I am removing the “audio” column/feature.

>>> gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", streaming=True)
>>> gigaspeech_subset = gigaspeech["train"].map(lambda example: example, remove_columns=['audio'])
>>> next(iter(gigaspeech_subset))
{'segment_id': 'YOU0000000315_S0000660', 'speaker': 'N/A', ....}

In this case, does it still download the audio? If I download the audio the next time, will it be cached?

Yes, unless this dataset is stored as Parquet, in which case it’s possible to select columns to download like so load_dataset(..., columns=list_of_columns_to_read.

In the streaming mode, we don’t cache any data.

@mariosasko Just to clarify, in streaming mode, once I have mapped a subset of the original dataset, only the subset data is used?

No, I meant that your map still downloads the audio column, and it’s only possible to avoid this if the dataset is in Parquet (by specifying columns to read in load_dataset).