“Don’t repeat yourself” , or DRY , is a well-known principle of software development. The principle originates from “The pragmatic programmer”, one of the most read books on code design. The principle’s simple message makes obvious sense: Don’t rewrite a logic that already exists somewhere else. This ensures the code remains in sync, making it easier to maintain and more robust. Any change to this logical pattern will uniformly affect all of its dependencies.
At first glance, the design of Hugging Face’s Transformers library couldn’t be more contrary to the DRY principle. Code for the attention mechanism is more or less copied over 50 times into different model files. Sometimes code of the whole BERT model is copied into other model files. We often force new model contributions that are identical to existing models - besides a small logical tweak - to copy all of the existing code. Why do we do this? Are we just too lazy or overwhelmed to centralize all logical pieces into one place?
No, we are not lazy - it’s a very conscious decision to not apply the DRY design principle to the Transformers library. Instead, we decided to adopt a different design principle which we like to call the single model file policy. The single model file policy states that all code necessary for the forward pass of a model is in one and only one file - called the model file. If a reader wants to understand how BERT works for inference, she should only have to look into BERT’s modeling_bert.py file. We usually reject any attempt to abstract identical sub-components of different models into a new centralized place. We don’t want to have a attention_layer.py that includes all possible attention mechanisms. Again why do we do this?
In short the reasons are:
1. Transformers is built for and by the open-source community.
2. Our product are models and our customers are users reading or tweaking model code.
3. The field of machine learning evolves extremely fast.
I really like this idea especially as someone that has struggled to understand all the layers of abstraction in other libraries
I think all the DRY use cases can be left to pytorch/tf libraries where the level of abstraction is consistent enough thay it will apply to ML research for years to come (e.g i don’t see autograd going away anytime soon)
I wonder if some combination of composition and codegen might achieve some of the accessibility goals while improving readability.
Something like:
# LMHead = make_lm_head(model, position_embeddings=Fancy)
class MyNewModelLMHead:
def __init__(self, config):
self.transformer = MyNewModelTransformer(config)
self.lm_head = nn.Linear(...)
# [Fancy position embedding init code]
def forward(self, all_the_args):
# [Fancy position embedding input prep]
transformer_output = self.model
if labels: ...
return ...
Then the same sort of mechanism as “copied from” runs to generate the actual class by composing templates. So when the templates change, a PR gets opened to apply those changes to the generated code.
Major challenges include getting the abstractions right and keeping the generated code readable. Both of those are seriously hard problems, so maybe this idea just can’t fly.
“getting the abstractions right” - In a fast moving field it never happens. At best it is transitory. Unless you abstract something that is really really canonical, one and only. One thing I hate about abstraction is that implementation can silently change under the hood, especially when there are 100 models that subscribes to the abstraction. As a model owner, how I can rest assured that my model is set in stone? And the “beautiful” feeling is really a bubble. While on the other hand in a single model file, it is much more immutable, much less surprise. And the ugly is given upfront.
That’s a fair point! The reason for 1000+ lines of code is usually because of the multiple heads that are supported by the model. The most important code to understand is always the class ....Model: class
Cool idea! We do think it’s important that all the necessary code is already generated in each file since we don’t expect readers to know about mechanisms like the # Copied from ...
This point is the hardest one for me to wrap my head around. Not that I’m disputing it—it’s just so different from the fields I’ve worked in. There very few things are static and the product lifecycles look nothing like that of a journal article.
The pace of change is faster in some parts of code than others. Some parts get more cobwebs and neglect. But it’s so rare that I can ever say “That one’s done! We’ll never need to change it again.” about anything I’ve worked on.
I think sometimes here is not difference between you RY with Copy methods (without options) and default imports, may be just point about single file, but it’s too small with current github UI. In other way copies with options can help to avoid abstractions or overflowing parameters. But instead you have problems, when you need to find difference between some models, because in DRY way you have very small files with models with easy visible diff, but in RY way you have problem off too big files to find something.
I think you can isolate models without putting everything in one file, like in FSD (feature sliced design) where every transformer is a feature and you can divide parts of your transformer into separate files and than when something like Rethinking Attention with Performers you can just create a new file that will have different logic for one of the functions in transformer and choose what you need in main file.
We installed custom glass walls from Crystalia Glass, and it was the perfect solution for opening up our interior space. Great craftsmanship and exceptional service. Find out more at: https://crystaliaglass.com/.