Common Voice 8.0.0 en using all available RAM

Ollie · April 6, 2022, 3:00pm

Hello. In the past I’ve been able to download prior Common Voice releases using common_voice and the version number, but mozilla-foundation/common_voice_8_0 seems to give me memory issues.

The code snippet below downloads the dataset just fine but uses up all available memory in a Colab High-RAM environment when preparing the train set:

dataset = datasets.load_dataset("mozilla-foundation/common_voice_8_0", 
                                "en",
                                use_auth_token="my_auth_token",
                                split="train")

It’s the English dataset so this will take time to download. I reduced the writer_batch_size in the hopes that it will work but no luck

Any help would be greatly appreciated!

Ollie · April 7, 2022, 6:50am

Pinging @lhoestq because this could be related to this issue maybe?

lhoestq · April 7, 2022, 9:19am

Hi ! On which version of datasets do you have this issue ? and which version of pyarrow ?

Ollie · April 7, 2022, 10:00am

Sorry, should have provided that first.
datasets was installed from master so the version is 2.0.1.dev0, the pyarrow version is 6.0.1.

lhoestq · April 7, 2022, 10:22am

Thanks ! I was able to reproduce with datasets 2.0.0 and pyarrow 6.0.1 on colab, I’ll investigate

Ollie · April 19, 2022, 9:44am

Hi @lhoestq, don’t mean to be a pain but were you able to find the cause of this? Thanks!

lhoestq · April 19, 2022, 9:56am

Hi, not yet but investigating this is indeed in my very short term todo list Will keep you posted !

Ollie · August 5, 2022, 1:18pm

Fixed

Topic		Replies	Views
Unable to load mozila-foundation/common_voice_8_0 Beginners	4	1769	March 18, 2022
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	63	December 23, 2024
"Too many open files" when loading Common Voice 🤗Datasets	4	1367	February 8, 2022
Unable to load mozilla-foundation/common_voice_6_0 dataset 🤗Datasets	2	1211	April 4, 2022
Russian ASR: Fine-tuning Wav2Vec2 Languages at Hugging Face	20	2699	May 22, 2021

Common Voice 8.0.0 en using all available RAM

Related topics