Pretrain own model

I want to adjust the BERT model and pretrain it with my data. Specifically, I want to:

  1. Use relative position embedding instead of absolute position embedding
  2. Use SwiGLU instead of Gelu
  3. Use flash attention instead of regular attention
  4. Predict at every token position instead of just 15% selected one (tips from the Electra paper).
  5. I saw word masking is supported somewhere in the :hugs: codebase. I’d like to use that to mask tokens too.
    Any suggestions/resources on how to implement any of these points? Thanks!