Remove_columns in map - what is the download footprint?

prassanna-ravishanka · September 1, 2023, 6:38pm

Hi all, first question here so forgive me -

I am trying to understand the download mechanism in streamable datasets. I have a gigaspeech dataset, from which I am removing the “audio” column/feature.

>>> gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", streaming=True)
>>> gigaspeech_subset = gigaspeech["train"].map(lambda example: example, remove_columns=['audio'])
>>> next(iter(gigaspeech_subset))
{'segment_id': 'YOU0000000315_S0000660', 'speaker': 'N/A', ....}

In this case, does it still download the audio? If I download the audio the next time, will it be cached?

mariosasko · September 4, 2023, 7:01pm

Yes, unless this dataset is stored as Parquet, in which case it’s possible to select columns to download like so load_dataset(..., columns=list_of_columns_to_read.

In the streaming mode, we don’t cache any data.

prassanna-ravishanka · September 6, 2023, 5:44pm

@mariosasko Just to clarify, in streaming mode, once I have mapped a subset of the original dataset, only the subset data is used?

mariosasko · September 6, 2023, 6:32pm

No, I meant that your map still downloads the audio column, and it’s only possible to avoid this if the dataset is in Parquet (by specifying columns to read in load_dataset).

Topic		Replies	Views
Remove columns from streamable datasets doesn't work 🤗Datasets	3	6133	January 24, 2024
Question about streaming 🤗Datasets	3	573	April 25, 2023
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	489	March 17, 2023
Download only a subset of a split 🤗Datasets	10	16552	February 25, 2025
Remove_columns option for .map Beginners	0	1683	July 22, 2022

Remove_columns in map - what is the download footprint?

Related topics