[Proposal] Copy Pasting modeling_bart.py

We are considering copy pasting the contents of modeling_bart.py into modeling_pegasus.py, modeling_mbart.py and modeling_marian.py. and wanted to see what the community thinks.

Currently, here is modeling_pegasus.py:

marian and mbart are similar. They do nothing and inherit the complete implementation from bart. They are only their own classes so that AutoConfig/ AutoTokenizer work properly.

Arguments for copy paste

  • if you want to know how a model works you can read it in one file.
  • This is how the rest of the repo works and there is no good reason to introduce a stylistic inconsistency here.
  • we can delete 6 configuration parameters, that would become always True/always False.
        # params needed for mbart
        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True
        self.normalize_embedding = normalize_embedding  # True for mbart, False otherwise
        self.normalize_before = normalize_before  # combo of fairseq's encoder_ and decoder_normalize_before
        self.add_final_layer_norm = add_final_layer_norm

        # Params introduced for Marian
        self.add_bias_logits = add_bias_logits
        self.static_position_embeddings = static_position_embeddings
  • we could also delete 6 corresponding if statements
  • it might be easier to make one off fixes or modify one model without unintended consequences.
  • in the survey, 65% of respondents said they wanted more “Self-contained” code. This is what users want!

Arguments against copy paste

  • we have to add and maintain about 4,500 lines of code. Could slow us down on adding new features/models.
  • this might make the code harder to maintain. Some bugs/features that require 1 change will require 4 changes.
  • the implementations could drift, making it more difficult for the same finetuning code to work with multiple models.
  • Integration tests enforce that you can’t have big unintended consequences when you are refactoring, so the “unintended side effects” point is moot.
  • It is actually harder to understand what is going on in pegasus/mbart/marian if the code is duplicated. Seeing that pegasus state dict is identical to bart makes reading the code easier. “It is annoying to have to read so many similar implementations and have no easy way to see the delta.”
  • Do not repeat yourself/rule of 3 are best practice and there is no good reason for this lib to deviate.
  • In the survey, people said they wanted better examples/more models faster, having to maintain more code would distract the team from those objectives.
  • Maintaining equivalent test coverage after copy paste would require adding 111 tests!

Would love to hear other people’s opinions as there is some disagreement within the team.
You can vote here and comment below.
I am probably missing alternative solutions, pros and cons. I would love to hear!


I really like keeping all four models combined into one file (as it is the case right now). I believe this is much better than copy-pasting. In addition to the software engineering advantages that are mentioned above about maintainability (which are not small things with a library that’s growing so fast like yours), I want to add two points from personal experience,

  • Easier to understand and compare models: Here’s a recent use case; I have a working model that uses BART and I wanted to try Pegasus. Running the same training code results in NaNs with fp16. With a quick diff of the two configs, I can find out the model design changes that might be the reason for the fp16 issues. Now to compare with T5 (implemented in a separate file), it is many hours of code-reading to find the differences (compared with 1 minute to diff the configs).

  • Easier to try new ideas: I was recently experimenting with adding Longformer self-attention to various models on the HF/transformers repo. Once I have it implemented for BART, I get it for free for Pegasus and the other models sharing the same file. On the other hand, you can check the Longformer Github repo for the number of issues requesting adding longformer self-attention to BERT, ALBERT, ELECTRA, … etc. Each one requires its own implementation, and each one is a ton more work. Many similar use cases exist, experimenting with a new activation function, a new position embedding, a new type of layer normalization … etc.

I also want to comment on some of the arguments for the copy-paste mentioned above:

delete 6 configuration parameters and delete 6 corresponding if statements

I think the benefits mentioned above outweigh the gains from deleting these lines. Also, remember that removing these lines comes at the expense of a massive amount of added code.

if you want to know how a model works you can read it in one file.

It is still one file of code, but with an accompanying config file. If the additional config file is undesirable, maybe it can be clarified in code via comments so that the user doesn’t even need the config. Something like:

if config.position_embedding == `learned`: # BART, mBART
   self.embedding = LearnedPositionalEmbedding()
elif config.position_embedding == `sinusoidal`:: # Pegasus, Marina, T5
   self.embedding = SinusoidalPositionalEmbedding()

The annoying code design that makes reading model code difficult is when you introduce higher-level abstractions and move them to their own files. Having a separate file for self-attention, another file for the embedding layer, and a third for objective functions means you keep moving from one abstraction to another and jumping between files, which makes it challenging to grasp the model details. This is not the case here.

in the survey, 65% of respondents said they wanted more “Self-contained” code. This is what users want!

This code design satisfies the self-containment requirement. The few if-else statements with code for other models won’t be a deal-breaker. As mentioned above, I think the design that the users were voting against is something like fairseq and allennlp that have lots of layers of abstractions and lots of magic that happens behind the scene. I think the current design strikes a nice balance between fairseq/allennlp as one extreme, and absolute self-containment with tons of duplicate code on the other extreme.


Wow, @sshleifer - that’s an amazing very balanced list of pros and cons! You made it hard to choose sides. To add to your list FSMT could also re-use about half of Bart’s components as is.

@beltagy’s comment is very insightful - I agree with a lot of what you said, thank you!

Here are just a few ideas from me:

1. unfolding tool

In the most recent discussion, that I’m aware of, where we tried to introduce some refactoring, it was mentioned that a tool has been in works (or being discussed?) that will be able to unfold the code for those who want it. If that is possible my vote would go there.

Such tool would permit for a much easier development and maintenance, while allowing those who want to read a single file with all the code in it by unfolding the abstractions.

2. generic component library

If such tool is not possible, I’d vote for creating more generic components, so there will be a set of Encoders and Decoders, create as many as needed and then plug those in. One doesn’t go and read the code for torch.nn.Linear most of the time, you quickly learn what it does. Perhaps the same can be done for more complex components and just create a variety of them so that they could be easily plugged in. i.e. if BartEncoder is a thing (and it is), then how is it different from torch.nn.Linear if there are many who want to use it.

So what I’m proposing is not to copy BartEncoder, but to put it into transformers_encoders.py with all the other encoders and thus creating the encoders library, so even modeling_bart.py imports it as a generic component.

Maintaining equivalent test coverage after copy paste would require adding 111 tests!

Not so if you copy all the tests :wink: Which btw, is also a problem to add to your list - as the test suite run-time will be much much longer - and that’s a problem.

p.s. I’m not sure that your voting tool is not game-able - once I voted it asked me if I wanted to vote again. But in any case I’m not sure how valid the results would be, how do we know which groups of people were reached by this post and whether the voting process doesn’t get gamed by having an unbalanced sample.


I agree with Iz here.
As long as the code which the model inherits is in single file, then it’s readable and personally for me don’t make much sense to copy paste.

And if the models share the code then adding some new functionality (gradient checkpoiting, new heads) to the base model gives me the same functionality for free for other sub-classed models. This actually helps more with experimentation. For example I wanted to try MBart for seq classification and as it inherits from BART completely all I had to do was just subclass BartForSequenceClassfication.
Without this, I would have to copy paste the head, test it again which slows down things.

Another example, I wanted to experiment with Camembert with EncoderDecoder and as it inherited from Roberta which was already in EncoderDecoder , it was very simple and fast change without requiring much extra code and tests as the tests also came free with base model.

And IMO in some cases such refactoring might even introduce unintended effects, IIRC after refactoring longformer to remove Roberta abstraction a major slowdown was introduced, @beltagy might remember this.


I am a big huggingface fan!

I want to voice more support for not copy pasting. I think it makes the code much easier to develop and maintain. I also think it makes it much easier to read and understand, and would disagree with the claim that copy-pasting code makes it easier to read.

  1. When we read code we implicitly delegate understanding to other portions of a code base. When we see something like self.calc_mse_loss we understand what’s going on, even if the function lives in another file. The same is true when we see nn.Linear- we don’t need to copy all of the torch code and instead rely on torch to have implemented it correctly. By copying lots of code between models, we have it all in one place, but we make it hard for readers to read and understand the code at a higher-level of abstraction.

  2. It is helpful when reading code to understand exactly where a model differs from similar models, but if all the code is copy-pasted into a model class, it is hard to figure this out. For example, when we see class RobertaEmbeddings(BertEmbeddings) we know that RobertaEmbeddings is largely similar to a BertEmbeddings, and the code that follows in the definition of RobertaEmbeddings is only the ways in which Roberta is different from Bert. Instead, if we copy paste a lot of code from BERT into Roberta, we don’t know how they differ without carefully reading both, and it is difficult to even figure out which classes are more similar to Roberta. I think this is more cognitive burden on someone reading the code, not less.

  3. (really 2a) Requiring that code be identical leads to cases where naming conventions are confusing. For example T5 calls things blocks while other models refer to them as layers. This leads to cases where we need nearly identical methods just because things are named differently (e.g. copy_to_student). As a reader, this further obfuscates the similarities and differences between T5 and other models.