Hi all,
I wanted a project that shows how to derive the total number of training tokens from a large text dataset from datasets with Apache Beam and Cloud Dataflow.
In NLP, the number of training tokens dictates model scaling behaviour. However, counting the number of tokens can be non-trivial for large-scale datasets. Hence this project.
These are the steps I have followed:
- Load the
wikitext
dataset usingdatasets
. It has over a million number of training samples. So, it’s a good candidate for demonstration purposes. - Generate
.jsonl
shards of the dataset and have them uploaded to a Google Cloud Storage (GCS) bucket. The shard generation step is needed because Apache Beam reads data on a shard-by-shard basis and is therefore able to induce parallel processing across many workers. - Train a tokenizer using the
tokenizers
library with thewikitext
dataset fromdatasets
. The tokenizer I trained is available here: sayakpaul/unigram-tokenizer-wikitext · Hugging Face. - Execute the Apache Beam pipeline on Dataflow for generating the number of training tokens. The steps of the Beam pipeline are as follows:
Here’s the code:
Thanks to @lhoestq for all the help!