I could preprocess a recent (
20230320) wikipedia dataset for es` using the DirectRunner,
# install the mwparserfromhell from the main branch # install "apache-beam[dataframe]" wikipedia_es = load_dataset("wikipedia", language="es", date="20230320", beam_runner="DirectRunner")
but it was long and memory intensive. I would like to use DataflowRunner to do so but I’m missing a configuration example (essentially the
Could some provide such an example for
load_dataset(...)? Searching for
DataflowRunner returns no results.
PS: I saw the
datasets-cli example provided in the docs (Beam Datasets) for doing so, but it looks we can’t provide a date to that tool.