Where to find the "wiki-big.train.raw" data as mentioned in the snippet for tokenizers 0.9?

I came across this short snippet of code on LinkedIn by HuggingFace, introducing tokenizers 0.9.

LinkedIn URL: snippet for tokenizers 0.9

How do I get the following dataset to run the code snippet? Is it available on huggingface.datasets?

files = ["../../data/wiki-big.train.raw"]

code-snippet-image

This dataset can probably get you started. This gist by @thomwolf may also prove useful.

1 Like

Thank you. :+1: