is there a BERT/Roberta etc. any model with RoPE embedding on hugging face (for English language)
or can some one provide any github repo/ help in making one from scratch .
thanks in advance.
1 Like
Hmm… I don’t know anything about it, but it seems like there aren’t many English versions of ReFormer…
Hello,
I want to train a reformer for a sequence classification task. The sequences are of protein so I thought of making a new tokenizer and then loaded as a reformer tokenizer which is defined as below.
spm.SentencePieceTrainer.train(input='./sequences_scope.txt', model_prefix='REFORM', max_sentence_length=2000, vocab_size=25)
tokenizer = ReformerTokenizer("REFORM.model", padding=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
The dataset was created as below -
with open("seque…
opened 12:11AM - 26 Jun 23 UTC
closed 09:01AM - 14 Jul 23 UTC
### Feature request
Hello,
I would like if possible for Rotary Position Em… bedding scaling factors to be usable in the library. Currently this can only be done by monkey-patching the library.
Namely, it requires modifying the:
- `max_position_embeddings`: This can already be done via the model's config class or `config.json`
- `position_scale`: This variable doesn't exist currently, and there is no way to incorporate this effect at the moment without monkey-patching the existing `LlamaRotaryEmbeddings` class. (I'd also like to not step over toes of a possible future XPos implementation which also uses it's own scale for different purposes)
### Motivation
Recently I demonstrated it is possible to drastically reduce training compute when fine-tuning pre-trained RoPE models with an adjusted scaling factor for the purpose of extending the context length of the model. This has the effect of interpolating the position embeddings making it easier to fine-tune the model using in-distribution positions as opposed to out-of-distribution positions typically used via pure extrapolation. There is an extended write-up with motivations here https://kaiokendev.github.io/context as well as the code I used (for the 8K example) can be found here https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test/blob/main/llama_rope_scaled_monkey_patch.py
Some existing discussions and benchmarks can be found here: https://github.com/ggerganov/llama.cpp/discussions/1965
Several models currently use this scaling feature, but they will not produce coherent output unless the scale is applied correctly during inference (scale is a hyperparameter):
- https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test
- https://huggingface.co/Peeepy/Airoboros-13b-SuperHOT-8k
- https://huggingface.co/emozilla/open_llama_7b-scaled
EDIT: Meta has recently written a paper about it: https://arxiv.org/abs/2306.15595
### Your contribution
I would love to help in any way possible. While the basic implementation would be easy, I'm not sure what the best way could be for adding this modification (such as if users want to used a fixed scale versus having it dynamically applied based on the input sequence length)
So if I want to create a nert model with rope from scratch is there any help.
Any github repo or codes etc. Resources
1 Like
Maybe ModernBERT has RoPE built in?
Bringing BERT into modernity via both architecture changes and scaling
Implementation of BERT from scratch(both pre-training and fine-tuning) using the RoPE embeddings.