Counting the number of training tokens in a scalable way

sayakpaul · June 10, 2022, 2:03pm

Hi all,

I wanted a project that shows how to derive the total number of training tokens from a large text dataset from datasets with Apache Beam and Cloud Dataflow.

In NLP, the number of training tokens dictates model scaling behaviour. However, counting the number of tokens can be non-trivial for large-scale datasets. Hence this project.

These are the steps I have followed:

Load the wikitext dataset using datasets. It has over a million number of training samples. So, it’s a good candidate for demonstration purposes.
Generate .jsonl shards of the dataset and have them uploaded to a Google Cloud Storage (GCS) bucket. The shard generation step is needed because Apache Beam reads data on a shard-by-shard basis and is therefore able to induce parallel processing across many workers.
Train a tokenizer using the tokenizers library with the wikitext dataset from datasets. The tokenizer I trained is available here: sayakpaul/unigram-tokenizer-wikitext · Hugging Face.
Execute the Apache Beam pipeline on Dataflow for generating the number of training tokens. The steps of the Beam pipeline are as follows:

Here’s the code:

Thanks to @lhoestq for all the help!

Topic		Replies	Views
Number of epochs in pre-training BERT Models	1	11627	December 13, 2020
Finding number of tokens in dataset 🤗Datasets	2	4478	November 19, 2021
Training a Tokenizer on a Streamed Dataset Beginners	5	1342	May 30, 2023
Use dataset.map for ngrams and Word2Vec style data pipeline Beginners	0	883	April 26, 2021
Efficient bucketing implementation 🤗Datasets	4	3551	May 16, 2022

Counting the number of training tokens in a scalable way

Related topics