Repeat Yourself - 🤗 Transformers Design Philosophy

patrickvonplaten · April 5, 2022, 10:37am

“Don’t repeat yourself” , or DRY , is a well-known principle of software development. The principle originates from “The pragmatic programmer”, one of the most read books on code design. The principle’s simple message makes obvious sense: Don’t rewrite a logic that already exists somewhere else. This ensures the code remains in sync, making it easier to maintain and more robust. Any change to this logical pattern will uniformly affect all of its dependencies.

At first glance, the design of Hugging Face’s Transformers library couldn’t be more contrary to the DRY principle. Code for the attention mechanism is more or less copied over 50 times into different model files. Sometimes code of the whole BERT model is copied into other model files. We often force new model contributions that are identical to existing models - besides a small logical tweak - to copy all of the existing code. Why do we do this? Are we just too lazy or overwhelmed to centralize all logical pieces into one place?

No, we are not lazy - it’s a very conscious decision to not apply the DRY design principle to the Transformers library. Instead, we decided to adopt a different design principle which we like to call the single model file policy. The single model file policy states that all code necessary for the forward pass of a model is in one and only one file - called the model file. If a reader wants to understand how BERT works for inference, she should only have to look into BERT’s modeling_bert.py file. We usually reject any attempt to abstract identical sub-components of different models into a new centralized place. We don’t want to have a attention_layer.py that includes all possible attention mechanisms. Again why do we do this?

In short the reasons are:

1. Transformers is built for and by the open-source community.
2. Our product are models and our customers are users reading or tweaking model code.
3. The field of machine learning evolves extremely fast.
4. Machine Learning models are static.

Read the full blog post here.

We’re keen to hear what you think! Leave your opinion below

imranq · April 10, 2022, 11:21pm

I really like this idea especially as someone that has struggled to understand all the layers of abstraction in other libraries

I think all the DRY use cases can be left to pytorch/tf libraries where the level of abstraction is consistent enough thay it will apply to ML research for years to come (e.g i don’t see autograd going away anytime soon)

hugbump · April 11, 2022, 3:51pm

The single model file policy is practical and very friendly to new/intermediate users. I like it!

kurianbenoy · April 11, 2022, 5:13pm

Is a single file with 1000+ lines easy to take a look for beginners the first time, and understand what the architecture is doing?

I really liked the huggingface philosophy by the way. But was getting intimidated by looking a large file with this much lines of code

hugbump · April 12, 2022, 2:58am

In 1 file making it (self-)contained in the first place and then I like the top-down as needed.

kcarnold · April 13, 2022, 2:19pm

I wonder if some combination of composition and codegen might achieve some of the accessibility goals while improving readability.

Something like:

# LMHead = make_lm_head(model, position_embeddings=Fancy)
class MyNewModelLMHead:
    def __init__(self, config):
        self.transformer = MyNewModelTransformer(config)
        self.lm_head = nn.Linear(...)
        # [Fancy position embedding init code]
    def forward(self, all_the_args):
        # [Fancy position embedding input prep]
        transformer_output = self.model
        if labels: ...
        return ...

Then the same sort of mechanism as “copied from” runs to generate the actual class by composing templates. So when the templates change, a PR gets opened to apply those changes to the generated code.

Major challenges include getting the abstractions right and keeping the generated code readable. Both of those are seriously hard problems, so maybe this idea just can’t fly.

hugbump · April 15, 2022, 4:20pm

“getting the abstractions right” - In a fast moving field it never happens. At best it is transitory. Unless you abstract something that is really really canonical, one and only. One thing I hate about abstraction is that implementation can silently change under the hood, especially when there are 100 models that subscribes to the abstraction. As a model owner, how I can rest assured that my model is set in stone? And the “beautiful” feeling is really a bubble. While on the other hand in a single model file, it is much more immutable, much less surprise. And the ugly is given upfront.

patrickvonplaten · April 19, 2022, 12:14pm

Agree with you!

patrickvonplaten · April 19, 2022, 12:15pm

That’s a fair point! The reason for 1000+ lines of code is usually because of the multiple heads that are supported by the model. The most important code to understand is always the class ....Model: class

patrickvonplaten · April 19, 2022, 12:17pm

Cool idea! We do think it’s important that all the necessary code is already generated in each file since we don’t expect readers to know about mechanisms like the # Copied from ...

keturn · September 6, 2022, 9:01pm

This point is the hardest one for me to wrap my head around. Not that I’m disputing it—it’s just so different from the fields I’ve worked in. There very few things are static and the product lifecycles look nothing like that of a journal article.

The pace of change is faster in some parts of code than others. Some parts get more cobwebs and neglect. But it’s so rare that I can ever say “That one’s done! We’ll never need to change it again.” about anything I’ve worked on.

Lollipop · October 17, 2022, 6:16pm

I think sometimes here is not difference between you RY with Copy methods (without options) and default imports, may be just point about single file, but it’s too small with current github UI. In other way copies with options can help to avoid abstractions or overflowing parameters. But instead you have problems, when you need to find difference between some models, because in DRY way you have very small files with models with easy visible diff, but in RY way you have problem off too big files to find something.

Also patch your code is really hard.

Akindin · October 10, 2024, 8:04pm

I think you can isolate models without putting everything in one file, like in FSD (feature sliced design) where every transformer is a feature and you can divide parts of your transformer into separate files and than when something like Rethinking Attention with Performers you can just create a new file that will have different logic for one of the functions in transformer and choose what you need in main file.

Topic		Replies	Views
Original transformers model implementation Beginners	2	975	June 1, 2022
Suggestions for hugging face transformer models for Code and Formal Languages Intermediate	2	1754	May 3, 2022
Transformers fine-tune architecture/code structure Beginners	0	343	September 28, 2021
Saving underlying language model after trained on downstream task 🤗Transformers	0	421	September 14, 2020
Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide Show and Tell	5	4333	May 1, 2025

Repeat Yourself - 🤗 Transformers Design Philosophy

Related topics