This is more of an open question. Are there some heuristics available to know in advance how much data is required to train a language model from scratch in english (on a specific subfield)?
I imagine there are some minimum thresholds in terms of number of tokens required so that the grammar, syntax and other basic language rules are somehow learned? How much much is OK? 1GB of text? 10GB?
Any suggestion is welcome