Counting the number of training tokens in a scalable way

Hi all,

I wanted a project that shows how to derive the total number of training tokens from a large text dataset from :hugs: datasets with Apache Beam and Cloud Dataflow.

In NLP, the number of training tokens dictates model scaling behaviour. However, counting the number of tokens can be non-trivial for large-scale datasets. Hence this project.

These are the steps I have followed:

  • Load the wikitext dataset using datasets. It has over a million number of training samples. So, it’s a good candidate for demonstration purposes.
  • Generate .jsonl shards of the dataset and have them uploaded to a Google Cloud Storage (GCS) bucket. The shard generation step is needed because Apache Beam reads data on a shard-by-shard basis and is therefore able to induce parallel processing across many workers.
  • Train a tokenizer using the :hugs: tokenizers library with the wikitext dataset from :hugs: datasets. The tokenizer I trained is available here: sayakpaul/unigram-tokenizer-wikitext · Hugging Face.
  • Execute the Apache Beam pipeline on Dataflow for generating the number of training tokens. The steps of the Beam pipeline are as follows:

Here’s the code:

Thanks to @lhoestq for all the help!