Further pre-train language model in transformers like BERT

lizzzi111 · November 25, 2020, 7:55am

Hi all,
yesterday through a workshop I learned about this forum. So I have the following question: is it possible to further pre-train transformers (e.g. BERT, DistilBert) using my own corpus? I mean not for the downstream task, but the language model (e.g. BERT tasks MSM and NSP) itself? Is it possible in general and with huggingface specifically?

Thank you. best regards
LIza

thomwolf · November 25, 2020, 11:29am

Hi @lizzzi111, nice to see you here

Yes it’s possible.

Examples and readme to do so are here: https://github.com/huggingface/transformers/tree/master/examples/language-modeling

lizzzi111 · November 25, 2020, 11:46am

Thank you a lot, @thomwolf!

musitafa · March 27, 2022, 7:49pm

Hi @thomwolf ,

The link is expired, could you please send the link again?

Thank you

Topic		Replies	Views
Training a language model from scratch with tensorflow (not pytorch)? Intermediate	4	858	August 9, 2021
Saving underlying language model after trained on downstream task 🤗Transformers	0	423	September 14, 2020
Doing classification 100% from scratch? 🤗Transformers	4	1718	September 17, 2021
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1380	July 22, 2023
DistilBert for Self-Supervision - switch heads for pre-training: MaskedLM and SequenceClassification Beginners	0	223	February 16, 2023

Further pre-train language model in transformers like BERT

Related topics