EncoderDecoderModel converts classifier layer of decoder

I am trying to do named entity recognition using a Sequence-to-Sequence-model. My output is simple IOB-tags, and thus I only want to predict probabilities for 3 labels for each token (IOB).

I am trying a EncoderDecoderModel using the HuggingFace-implementation with a DistilBert as my encoder, and a BertForTokenClassification as my decoder.

First, I import my encoder and decoder:

encoder = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

decoder = BertForTokenClassification.from_pretrained('bert-base-uncased',

When I check my decoder model as shown, I can clearly see the linear classification layer that has out_features=3:

## sample of output:
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=3, bias=True)

However, when I combine the two models in my EncoderDecoderModel, it seems that the decoder is converted into a different kind of classifier - now with out_features as the size of my vocabulary:

bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("./Encoder","./Decoder")

## sample of output:
(cls): BertOnlyMLMHead(
      (predictions): BertLMPredictionHead(
        (transform): BertPredictionHeadTransform(
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (decoder): Linear(in_features=768, out_features=30522, bias=True)

Why is that? And how can I keep out_features = 3 in my model?

The EncoderDecoderModel class is not meant to do token classification. It is meant to do text generation (like summarization, translation). Hence, the head on top of the decoder will be a language modeling head.

To do token classification, you can use any xxxForTokenClassification model in the library, such as BertForTokenClassification or RobertaForTokenClassification.

Thanks for the advice. However, am I correct to assume that with the TokenClassification structure, the predictions would not depend on each other, and a beam search would then not make sense?