I want to adjust the BERT model and pretrain it with my data. Specifically, I want to:
- Use relative position embedding instead of absolute position embedding
- Use SwiGLU instead of Gelu
- Use flash attention instead of regular attention
- Predict at every token position instead of just 15% selected one (tips from the Electra paper).
- I saw word masking is supported somewhere in the codebase. I’d like to use that to mask tokens too.
Any suggestions/resources on how to implement any of these points? Thanks!