How much data to train a language model from scratch?

This is more of an open question. Are there some heuristics available to know in advance how much data is required to train a language model from scratch in english (on a specific subfield)?

I imagine there are some minimum thresholds in terms of number of tokens required so that the grammar, syntax and other basic language rules are somehow learned? How much much is OK? 1GB of text? 10GB?

Any suggestion is welcome
Thanks!

Its a difficult question to answer, but not as mysterious as some people would have you believe. You could say “a lot” or “how long is a piece of string”, but theres a better answer:

Jarad Kaplan proposed the “Scaling Laws for Natural Language Models” which estimate the size of the model, training dataset and compute required to achieve a given amount of “loss”. The tolerance for loss is what determines the accuracy and usability of a model. Kaplans theoretical laws matched experiment, so we can now accurately predict how big our training set (of clean data) needs to be.

GPT (2018) was 117million parameters. GPT-2 tried scaling that by 10x, doubling the accuracy of the model. Based on Kaplans new scaling laws OpenAI 10x’d the training size again and, with a bit of transformer and attention model magic, and GPT-3 gave us a fully working foundation model. Roughly 12billion parameters (words). They increased that 175billion over time.

So the rough, rough answer, depending on a whole bunch of factors, is approx 12billion+ words, or 50-100gb of text.