Where to find the "wiki-big.train.raw" data as mentioned in the snippet for tokenizers 0.9?

sugatoray · October 29, 2020, 9:49am

I came across this short snippet of code on LinkedIn by HuggingFace, introducing tokenizers 0.9.

How do I get the following dataset to run the code snippet? Is it available on huggingface.datasets?

files = ["../../data/wiki-big.train.raw"]

BramVanroy · October 29, 2020, 9:53am

This dataset can probably get you started. This gist by @thomwolf may also prove useful.

sugatoray · October 29, 2020, 10:04am

Thank you.

Topic		Replies	Views
Use dataset.map for ngrams and Word2Vec style data pipeline Beginners	0	881	April 26, 2021
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	924	November 17, 2021
Training GPT-2 from scratch Beginners	2	1208	August 3, 2020
NLP dataset for ByteLevelTokenizer Training 🤗Datasets	1	2067	February 16, 2021
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12738	February 12, 2024