Hi,
I could preprocess a recent (20230320) wikipedia dataset for
es` using the DirectRunner,
# install the mwparserfromhell from the main branch
# install "apache-beam[dataframe]"
wikipedia_es = load_dataset("wikipedia", language="es", date="20230320", beam_runner="DirectRunner")
but it was long and memory intensive. I would like to use DataflowRunner to do so but I’m missing a configuration example (essentially the beam_options
).
Could some provide such an example for load_dataset(...)
? Searching for beam
and DataflowRunner
returns no results.
Thanks!
PS: I saw the datasets-cli
example provided in the docs (Beam Datasets) for doing so, but it looks we can’t provide a date to that tool.