How to preprocess a wikipedia dataset using DataflowRunner?

Hi,

I could preprocess a recent (20230320) wikipedia dataset for es` using the DirectRunner,

# install the mwparserfromhell from the main branch
# install "apache-beam[dataframe]"
wikipedia_es = load_dataset("wikipedia", language="es", date="20230320", beam_runner="DirectRunner")

but it was long and memory intensive. I would like to use DataflowRunner to do so but I’m missing a configuration example (essentially the beam_options).

Could some provide such an example for load_dataset(...)? Searching for beam and DataflowRunner returns no results.

Thanks!

PS: I saw the datasets-cli example provided in the docs (Beam Datasets) for doing so, but it looks we can’t provide a date to that tool.

Hi ! Indeed datasets/run_beam.py at main · huggingface/datasets · GitHub doesn’t seem to support passing builder kwargs like date or language when instantiating the DatasetBuilder (builder_cls in the code).

Feel free to modify this script to your needs, and if you want to open a PR to support passing builder kwargs that could also benefit other people :slight_smile:

1 Like

Thanks Quentin! I’ll give it a shot, probably over the weekend.

Here is my PR: Pass datasets-cli additional args as kwargs to DatasetBuilder in `run_beam.py` by graelo · Pull Request #5942 · huggingface/datasets · GitHub

Cheers and thanks in advance for your input.