Hey everyone! I am incredibly grateful for this tutorial on training a language model from scratch: How to train a new language model from scratch using Transformers and Tokenizers
I really want to expand this to contiguous masking of longer token sequences (e.g. [mask-5], [mask-8]). I have begun looking into how to write a custom DataCollator for this, but suspect I will also need to make some changes to the model as well.
Has anyone looked into this and can point me to any resources?
Found something useful on StackOverflow for this:
This tutorial on Keras Code Examples is the most useful thing I have found so far on this:
Hello Connor Great to see you here! I will not be able to help you, I’m going to ping @nielsr here, maybe he could help. Sorry for delay!
I’ve seen that SpanBERT models are on the hub, but we haven’t added the model itself yet to the library.
This would be a great project actually:
- contribute SpanBERT to HuggingFace Transformers, based on the modeling file. This will be relatively easy, as the authors already used HuggingFace’s implementation of BERT and tweaked it a little bit. The only difference is this class. We could then call the model
SpanBertModel in the library, and add a
SpanBertForPreTraining similar to
BertForPreTraining that includes the heads necessary for pre-training.
- add a script to the examples directory, which could be called run_span_mlm.py (similar to run_mlm.py). This can be based on the files defined here (Facebook open-sourced everything!).
If anyone is interested in contributing, let me know!