How to preprocess a wikipedia dataset using DataflowRunner?

graelo · June 3, 2023, 9:01pm

Hi,

I could preprocess a recent (20230320) wikipedia dataset for es` using the DirectRunner,

# install the mwparserfromhell from the main branch
# install "apache-beam[dataframe]"
wikipedia_es = load_dataset("wikipedia", language="es", date="20230320", beam_runner="DirectRunner")

but it was long and memory intensive. I would like to use DataflowRunner to do so but I’m missing a configuration example (essentially the beam_options).

Could some provide such an example for load_dataset(...)? Searching for beam and DataflowRunner returns no results.

Thanks!

PS: I saw the datasets-cli example provided in the docs (Beam Datasets) for doing so, but it looks we can’t provide a date to that tool.

lhoestq · June 5, 2023, 9:39am

Hi ! Indeed datasets/run_beam.py at main · huggingface/datasets · GitHub doesn’t seem to support passing builder kwargs like date or language when instantiating the DatasetBuilder (builder_cls in the code).

Feel free to modify this script to your needs, and if you want to open a PR to support passing builder kwargs that could also benefit other people

graelo · June 7, 2023, 9:46am

Thanks Quentin! I’ll give it a shot, probably over the weekend.

graelo · June 12, 2023, 6:52am

Here is my PR: Pass datasets-cli additional args as kwargs to DatasetBuilder in `run_beam.py` by graelo · Pull Request #5942 · huggingface/datasets · GitHub

Cheers and thanks in advance for your input.

Topic		Replies	Views
Cannot preprocess wikipedia dataset 🤗Datasets	1	502	June 3, 2023
Load_dataset for wikipedia gets stuck 🤗Datasets	1	648	February 6, 2023
Streaming Wikipedia dataset 🤗Datasets	2	891	April 7, 2023
Error loading Wikipedia Dataset 🤗Datasets	6	2984	July 5, 2023
Loading natural_questions 🤗Datasets	6	3493	December 12, 2023

How to preprocess a wikipedia dataset using DataflowRunner?

Related topics