What does the wikipedia dataset with the specific language and date mean?

StephennFernandes · May 4, 2022, 11:19am

hey there, So ive been downloading the wikipedia corpus. while i came across this scirpt to download wikipedia dumps for a specific language at a given date.
lang_dataset = datasets.load_dataset("wikipedia", "20220301.hi", beam_runner="DirectRunner")

my doubt is, does this download all the text that’s available on wikipedia for the given language? or does it limits to downloading the text that was updated to wikipedia on that specific date ?

I actually need to download all the data the wikipedia has for the given language. how do i specifically do that ?

lhoestq · May 5, 2022, 7:43pm

Hi ! It downloads all the text that’s available on wikipedia at a given date

More specifically, it downloads the wikipedia dump at that date for the specified language.

Topic		Replies	Views
Question about loading wikipedia datset 🤗Datasets	2	2352	November 11, 2020
Cannot preprocess wikipedia dataset 🤗Datasets	1	501	June 3, 2023
How to load dataset that exist in cache path Beginners	5	4957	December 6, 2023
How to preprocess a wikipedia dataset using DataflowRunner? 🤗Datasets	3	821	June 12, 2023
Saving dataset in the current state without cache 🤗Datasets	9	5891	March 17, 2022

What does the wikipedia dataset with the specific language and date mean?

Related topics