I want to implement different types of attention architectures and to run experiments with them on pretrained LMs, but small ones ideally <500M.
I could start with GPT-2, T5 and Pythia, but I’m looking for more options. I tried looking into SmolLM but I could not find it’s implementation on GitHub so replicating it and loading in the weights looks impossible.
Could you guys please recommend some other small LMs I could use for this purpose? Or maybe provide a URL to the place I could find the SmolLM implementation in torch?
Hi there!
If you’re looking to experiment with different attention architectures on small pretrained language models under 500M parameters, there are several great options available. GPT-2 is a solid starting point, with sizes ranging from 117M to 774M parameters, and it’s well-supported in the Hugging Face transformers library, making it easy to customize. Similarly, T5 has smaller variants like T5-small (60M parameters) and T5-base (220M parameters) that are perfect for experimenting with encoder-decoder architectures. Pythia is another excellent choice, especially its smaller variants such as the 70M or 160M models, which are designed with reproducibility in mind.
Beyond these, you might also consider DistilGPT-2, a distilled version of GPT-2 with only 82M parameters, which is faster and smaller while still retaining much of its capability. DistilBERT is another lightweight model at 66M parameters and works well if you’re interested in encoder-based architectures. Other good options include smaller variants of Meta’s OPT, such as the 125M or 350M models, and EleutherAI’s GPT-Neo 125M, which is an open-source GPT-style model.
Regarding SmolLM, it seems like the implementation isn’t currently available on GitHub, and without the code or pretrained weights, replicating it could be quite difficult. Unless the authors release it officially, the models mentioned above are excellent alternatives.
To modify attention mechanisms, using the Hugging Face transformers library would be ideal. It provides flexibility to load models and extend them by overriding the attention layers or other parts of the architecture. Tools like PreTrainedModel make it straightforward to build custom model classes, and libraries like Accelerate or PyTorch Lightning can help you manage experiments efficiently.
If you come across more details on SmolLM or need help with specific modifications to these models, feel free to share! Hope this help!
Thank you so much for the comprehensive reply! I totally forgot about OPT and GPT-Neo; they’re also well cited in the literature so I will definitely be testing on them too. I also found this paper that does a survey of small LMs so there’s some hidden nuggets bound to be in there: https://arxiv.org/pdf/2501.05465
Kinda disappointing the SmolLM code isn’t there. There’s a similar situation with IBM’s Granite models where there’s open weights but the implementations aren’t available online which is really weird.
Anyways I think that’s enough for me to work off.
Again, thank you very much!