Common Voice 8.0.0 en using all available RAM

Hello. In the past I’ve been able to download prior Common Voice releases using common_voice and the version number, but mozilla-foundation/common_voice_8_0 seems to give me memory issues.

The code snippet below downloads the dataset just fine but uses up all available memory in a Colab High-RAM environment when preparing the train set:

dataset = datasets.load_dataset("mozilla-foundation/common_voice_8_0", 
                                "en",
                                use_auth_token="my_auth_token",
                                split="train")

It’s the English dataset so this will take time to download. I reduced the writer_batch_size in the hopes that it will work but no luck :frowning:

Any help would be greatly appreciated!

Pinging @lhoestq because this could be related to this issue maybe?

Hi ! On which version of datasets do you have this issue ? and which version of pyarrow ?

Sorry, should have provided that first.
datasets was installed from master so the version is 2.0.1.dev0, the pyarrow version is 6.0.1.

Thanks ! I was able to reproduce with datasets 2.0.0 and pyarrow 6.0.1 on colab, I’ll investigate

2 Likes

Hi @lhoestq, don’t mean to be a pain but were you able to find the cause of this? Thanks!

Hi, not yet but investigating this is indeed in my very short term todo list :wink: Will keep you posted !

2 Likes