I am using load_dataset to load the new Wikipedia dataset by:
load_dataset(“wikipedia”, language=“en”, date=“20230301”, beam_runner=“DirectRunner”)
I have successfully download the whole data, however, the procedure gets stuck after downloading it and I have to interrupt it:
Any solutions?
Hi, just a hint from my experience preprocessing the es
dataset:
I successfully preprocessed the es
dump from 20230320
, and that dataset is roughly 4 GB which is much smaller than the en
dataset you chose. The preprocessing took about 3 hours and consumed up to 35 GB of ram, but it worked. Currently, there is one annoyance: no indication of progress, except if you monitor the size of the temporary file being created in the folder ~/.cache/huggingface/datasets/wikipedia/en/2.0.0/...
. You can see that file growing about every 2 minutes.
So I’m guessing in your case the processing will take at least 12 hours and consume much more ram than 35 GB.
A good alternative to preprocessing locally would be to use DataflowRunner, but I don’t precisely know how to craft the beam_options
arguments to do so.
My 2 cts!
Cheers and take care