RoFormer (Eng language)

sourabhY · May 17, 2025, 11:49am

is there a BERT/Roberta etc. any model with RoPE embedding on hugging face (for English language)
or can some one provide any github repo/ help in making one from scratch .
thanks in advance.

John6666 · May 17, 2025, 12:08pm

Hmm… I don’t know anything about it, but it seems like there aren’t many English versions of ReFormer…

github.com/huggingface/transformers

Adding support for scaling rotary position embeddings

opened 12:11AM - 26 Jun 23 UTC

closed 09:01AM - 14 Jul 23 UTC

kaiokendev

### Feature request Hello, I would like if possible for Rotary Position Em…bedding scaling factors to be usable in the library. Currently this can only be done by monkey-patching the library. Namely, it requires modifying the: - `max_position_embeddings`: This can already be done via the model's config class or `config.json` - `position_scale`: This variable doesn't exist currently, and there is no way to incorporate this effect at the moment without monkey-patching the existing `LlamaRotaryEmbeddings` class. (I'd also like to not step over toes of a possible future XPos implementation which also uses it's own scale for different purposes) ### Motivation Recently I demonstrated it is possible to drastically reduce training compute when fine-tuning pre-trained RoPE models with an adjusted scaling factor for the purpose of extending the context length of the model. This has the effect of interpolating the position embeddings making it easier to fine-tune the model using in-distribution positions as opposed to out-of-distribution positions typically used via pure extrapolation. There is an extended write-up with motivations here https://kaiokendev.github.io/context as well as the code I used (for the 8K example) can be found here https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test/blob/main/llama_rope_scaled_monkey_patch.py Some existing discussions and benchmarks can be found here: https://github.com/ggerganov/llama.cpp/discussions/1965 Several models currently use this scaling feature, but they will not produce coherent output unless the scale is applied correctly during inference (scale is a hyperparameter): - https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test - https://huggingface.co/Peeepy/Airoboros-13b-SuperHOT-8k - https://huggingface.co/emozilla/open_llama_7b-scaled EDIT: Meta has recently written a paper about it: https://arxiv.org/abs/2306.15595 ### Your contribution I would love to help in any way possible. While the basic implementation would be easy, I'm not sure what the best way could be for adding this modification (such as if users want to used a fixed scale versus having it dynamically applied based on the input sequence length)

sourabhY · May 17, 2025, 12:22pm

So if I want to create a nert model with rope from scratch is there any help.
Any github repo or codes etc. Resources

John6666 · May 17, 2025, 12:24pm

Maybe ModernBERT has RoPE built in?

John6666 · May 17, 2025, 12:26pm

ModernBERT

Topic		Replies	Views
Train reformer QA model from pre-trained model 🤗Transformers	0	209	October 20, 2022
Reformer for Sequence Classification Models	0	462	June 21, 2022
Training a reformer from scratch Beginners	5	1495	July 20, 2021
Is it possible to train reformer for q&a with existing pretrained models 🤗Transformers	1	407	October 20, 2022
Fine tuning reformer model 🤗Transformers	0	371	August 30, 2020

Related topics