Training a language model from scratch with tensorflow (not pytorch)?

Hello there,

I am interested in training a language model from scratch. Not fine tuning the usual distilbert ---- running the whole thing on my GPU instead :sunglasses:!

I found this interesting notebook How to train a new language model from scratch using Transformers and Tokenizers and I would be interested to know if there is one that uses tensorflow instead. I cannot have pytorch on my machine unfortunately.

Is there a huggingface example notebook that would help me do that?

Thanks!

Hi there!

This might be what you’re after: transformers/examples/tensorflow/language-modeling at master · huggingface/transformers · GitHub

You’d use the run_clm script for a GPT-2 like model, and the run_mlm script for a BERT-like model.

EDIT: If you’re able to use docker on your machine, you could also use a huggingface image to run that notebook you linked to.

1 Like

thanks @cakiki aka aznavour :smiley:

but these notebooks are much, much more complex than the code on the notebook I liked (or the blog post). Looking for something more streamlined…

1 Like

@cakiki conceptually, I wonder if training my own language model and then fine-tune it for text-classification will work better than fine-tuning the same old distilbert model that everybody is using. The corpus I am working on is highly specialized (say, medicine for instance) so a dedicated language model makes sense.

What do you think?

1 Like

I think it would depend on how much (and how different) specialized data you have (perhaps compare that to the size of the dataset the model was initially pretrained on). If it’s a considerable amount, it might make sense to continue pre-training from the checkpoint of the model you’re interested in.