SpanBERT, ELECTRA, MARGE from scratch?

Hey everyone! I am incredibly grateful for this tutorial on training a language model from scratch: How to train a new language model from scratch using Transformers and Tokenizers

I really want to expand this to contiguous masking of longer token sequences (e.g. [mask-5], [mask-8]). I have begun looking into how to write a custom DataCollator for this, but suspect I will also need to make some changes to the model as well.

Has anyone looked into this and can point me to any resources?

Thank you!

Found something useful on StackOverflow for this:

This tutorial on Keras Code Examples is the most useful thing I have found so far on this:

Hello Connor :hugs: Great to see you here! I will not be able to help you, I’m going to ping @nielsr here, maybe he could help. Sorry for delay!


I’ve seen that SpanBERT models are on the hub, but we haven’t added the model itself yet to the library.

This would be a great project actually:

  • contribute SpanBERT to HuggingFace Transformers, based on the modeling file. This will be relatively easy, as the authors already used HuggingFace’s implementation of BERT and tweaked it a little bit. The only difference is this class. We could then call the model SpanBertModel in the library, and add a SpanBertForPreTraining similar to BertForPreTraining that includes the heads necessary for pre-training.
  • add a script to the examples directory, which could be called (similar to This can be based on the files defined here (Facebook open-sourced everything!).

If anyone is interested in contributing, let me know!

1 Like