Remove PE/Encoder on BartModel

I want to generate sequences from keywords. I’m trying to remove the PE from the encoder of a BartForConditionalGeneration.

The idea is to remove the PE on the encoder so that a seq2seq model allows me to do set2seq (unordered tokens to autoregresive output sequence).

Basically I thought that training the model like this would be enough:

model_name = "sshleifer/distilbart-xsum-12-3"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Remove PE input:
class UseLessModule(nn.Module):
    def __init__(self):
        self._constant_zero = torch.tensor(0.0, requires_grad=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Remove the PE!
        return self._constant_zero

model.model.encoder.embed_positions = UseLessModule()

# train as usual

The model trains well, and I get a good loss. But it seems that for some reason I do remember the positions of the keywords.

Since when I infer with it (I also have the PE eliminated in test), the sentence always begins with the first keyword. I give an example (I shuffled the same keywords, I mean different order):

Keywords: ['dominant', 'sequence', 'transduction', 'models', 'based', 'complex', 'recurrent', 'convolutional', 'neural', 'networks', 'encoder', 'decoder', 'configuration', 'best', 'performing', 'models', 'connect', 'encoder', 'decoder', 'attention', 'mechanism', 'propose', 'new', 'simple', 'network', 'architecture', 'based', 'attention', 'mechanisms', 'dispensing', 'recurrence', 'convolutions', 'Experiments', 'machine', 'translation', 'tasks', 'show', 'models', 'superior', 'quality', 'parallelizable', 'requiring', 'less', 'time', 'train', 'model', 'achieves', 'German', 'translation', 'task', 'improving', 'existing', 'best', 'results', 'including', 'ensembles', 'French', 'translation', 'task', 'model', 'establishes', 'new', 'single', 'model', 'state', 'art', 'score', 'training', 'days', 'GPUs', 'small', 'fraction', 'training', 'costs', 'best', 'models', 'literature', 'show', 'generalizes', 'other', 'tasks', 'applying', 'constituency', 'parsing', 'large', 'limited', 'training', 'data']
Sequence: ['one of the dominant mechanisms for improving the quality of translation tasks is attention - based attention transduction. \n we propose a new recurrent neural network architecture that generalizes existing encoder - decoder models to a single task, and show that it achieves the best results on a sequence of tasks,']

Keywords: ['neural', 'dominant', 'based', 'connect', 'other', 'propose', 'decoder', 'complex', 'convolutions', 'art', 'including', 'single', 'score', 'GPUs', 'parsing', 'convolutional', 'translation', 'model', 'models', 'dispensing', 'training', 'days', 'show', 'new', 'requiring', 'models', 'model', 'results', 'encoder', 'architecture', 'constituency', 'simple', 'less', 'establishes', 'superior', 'task', 'networks', 'models', 'time', 'mechanisms', 'tasks', 'models', 'improving', 'parallelizable', 'data', 'Experiments', 'French', 'translation', 'recurrence', 'recurrent', 'large', 'translation', 'fraction', 'attention', 'costs', 'small', 'new', 'state', 'best', 'decoder', 'German', 'machine', 'training', 'achieves', 'network', 'task', 'sequence', 'existing', 'performing', 'tasks', 'literature', 'mechanism', 'attention', 'show', 'applying', 'best', 'generalizes', 'based', 'limited', 'quality', 'ensembles', 'best', 'configuration', 'model', 'train', 'training', 'transduction', 'encoder']
Sequence: ['neural networks show superior results for improving the quality of translation and other tasks, including parsing, recurrence, and translation. \n we propose a new recurrent network architecture, based on a simple encoder - decoder mechanism, which generalizes existing models by dispensing attention to a single task']

Keywords: ['dispensing', 'convolutions', 'decoder', 'limited', 'achieves', 'show', 'improving', 'best', 'best', 'decoder', 'small', 'single', 'translation', 'applying', 'network', 'convolutional', 'existing', 'state', 'models', 'training', 'models', 'recurrence', 'configuration', 'French', 'generalizes', 'transduction', 'dominant', 'models', 'sequence', 'fraction', 'parallelizable', 'performing', 'GPUs', 'train', 'mechanism', 'quality', 'task', 'show', 'mechanisms', 'literature', 'time', 'machine', 'model', 'other', 'large', 'simple', 'encoder', 'results', 'networks', 'attention', 'parsing', 'based', 'requiring', 'connect', 'data', 'ensembles', 'score', 'translation', 'task', 'costs', 'neural', 'models', 'tasks', 'establishes', 'model', 'tasks', 'including', 'less', 'propose', 'training', 'attention', 'model', 'architecture', 'days', 'art', 'translation', 'superior', 'new', 'training', 'complex', 'constituency', 'new', 'encoder', 'Experiments', 'recurrent', 'based', 'German', 'best']
Sequence: ['dispensing is one of the most dominant mechanisms for improving machine translation. \n existing models in the literature show superior results for tasks including parsing, translation, recurrence, and translation. here \n, we propose a new recurrent neural network architecture based on a simple encoder - decoder']

I am not able to understand how the model can preserve the order of the keywords when the PE is always 0 in the encoder. Is there something I am forgetting?