This is a follow up question in regards to this hugging face blog post on training a LM (and a tokenizer) from scratch :
I’d like to try this approach, but I’m wondering about cost and how much data I really need to train a (small) LM from scratch on my domain-specific dataset, and what LM maybe is a decent starting point for POC? I’m quite new to the field and haven’t read many papers on this subject as of yet, so I was hoping someone might be able to provide some ballpark estimates about computing resources required for training some small LM(s) from scratch? I’d like to obtain a domain-specific LM to serve as a backbone for various downstream NLP tasks on my domain-specific text data. I have been experimenting with the fine-tuning LM approach, (i.e. fine-tuning BERT based models on MLM before performing task-specific fine-tuning), but I’m curious about the training from scratch option if I can get a rough idea the required compute resources / cost.
Thanks very much in advance for any help / tips on unpacking this question.