[Proposal] Copy Pasting modeling_bart.py

I really like keeping all four models combined into one file (as it is the case right now). I believe this is much better than copy-pasting. In addition to the software engineering advantages that are mentioned above about maintainability (which are not small things with a library that’s growing so fast like yours), I want to add two points from personal experience,

  • Easier to understand and compare models: Here’s a recent use case; I have a working model that uses BART and I wanted to try Pegasus. Running the same training code results in NaNs with fp16. With a quick diff of the two configs, I can find out the model design changes that might be the reason for the fp16 issues. Now to compare with T5 (implemented in a separate file), it is many hours of code-reading to find the differences (compared with 1 minute to diff the configs).

  • Easier to try new ideas: I was recently experimenting with adding Longformer self-attention to various models on the HF/transformers repo. Once I have it implemented for BART, I get it for free for Pegasus and the other models sharing the same file. On the other hand, you can check the Longformer Github repo for the number of issues requesting adding longformer self-attention to BERT, ALBERT, ELECTRA, … etc. Each one requires its own implementation, and each one is a ton more work. Many similar use cases exist, experimenting with a new activation function, a new position embedding, a new type of layer normalization … etc.

I also want to comment on some of the arguments for the copy-paste mentioned above:

delete 6 configuration parameters and delete 6 corresponding if statements

I think the benefits mentioned above outweigh the gains from deleting these lines. Also, remember that removing these lines comes at the expense of a massive amount of added code.

if you want to know how a model works you can read it in one file.

It is still one file of code, but with an accompanying config file. If the additional config file is undesirable, maybe it can be clarified in code via comments so that the user doesn’t even need the config. Something like:

if config.position_embedding == `learned`: # BART, mBART
   self.embedding = LearnedPositionalEmbedding()
elif config.position_embedding == `sinusoidal`:: # Pegasus, Marina, T5
   self.embedding = SinusoidalPositionalEmbedding()

The annoying code design that makes reading model code difficult is when you introduce higher-level abstractions and move them to their own files. Having a separate file for self-attention, another file for the embedding layer, and a third for objective functions means you keep moving from one abstraction to another and jumping between files, which makes it challenging to grasp the model details. This is not the case here.

in the survey, 65% of respondents said they wanted more “Self-contained” code. This is what users want!

This code design satisfies the self-containment requirement. The few if-else statements with code for other models won’t be a deal-breaker. As mentioned above, I think the design that the users were voting against is something like fairseq and allennlp that have lots of layers of abstractions and lots of magic that happens behind the scene. I think the current design strikes a nice balance between fairseq/allennlp as one extreme, and absolute self-containment with tons of duplicate code on the other extreme.

5 Likes