How much data to train a language model from scratch?

olaffson · November 6, 2021, 3:29pm

This is more of an open question. Are there some heuristics available to know in advance how much data is required to train a language model from scratch in english (on a specific subfield)?

I imagine there are some minimum thresholds in terms of number of tokens required so that the grammar, syntax and other basic language rules are somehow learned? How much much is OK? 1GB of text? 10GB?

Any suggestion is welcome
Thanks!

LPHAI · January 4, 2024, 11:13am

Its a difficult question to answer, but not as mysterious as some people would have you believe. You could say “a lot” or “how long is a piece of string”, but theres a better answer:

Jarad Kaplan proposed the “Scaling Laws for Natural Language Models” which estimate the size of the model, training dataset and compute required to achieve a given amount of “loss”. The tolerance for loss is what determines the accuracy and usability of a model. Kaplans theoretical laws matched experiment, so we can now accurately predict how big our training set (of clean data) needs to be.

GPT (2018) was 117million parameters. GPT-2 tried scaling that by 10x, doubling the accuracy of the model. Based on Kaplans new scaling laws OpenAI 10x’d the training size again and, with a bit of transformer and attention model magic, and GPT-3 gave us a fully working foundation model. Roughly 12billion parameters (words). They increased that 175billion over time.

So the rough, rough answer, depending on a whole bunch of factors, is approx 12billion+ words, or 50-100gb of text.

Topic		Replies	Views
Necessary resources for training a (small/tiny) LM from scratch? Beginners	0	879	July 7, 2021
Retraining Individual Words Beginners	2	17	July 1, 2025
Fine-tune, or train from scratch? Beginners	6	3457	September 16, 2020
Resources for model design (number of layers, attention heads, etc) Beginners	2	610	January 4, 2021
Resource required to fine tune a large model? Beginners	0	398	November 12, 2022

How much data to train a language model from scratch?

Related topics