Why is there no pooler representation for XLNet or a consistent use of sequence_summary()?

Iā€™m trying to create sentence embeddings using different Transformer models. Iā€™ve created my own class where I pass in a Transformer model, and I want to call the model to get a sentence embedding.

Both BertModel and RobertaModel return a pooler output (the sentence embedding).

pooler_output ( torch.FloatTensor of shape (batch_size, hidden_size) ) ā€“ Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

Why does XLNetModel not produce a similar pooler_output?

When I look at the source code for XLNetForSequenceClassification, I see that there actually exists code for getting a sentence embedding using a function called sequence_summary().

    def forward():
        transformer_outputs = self.transformer( ... )
        output = transformer_outputs[0]

        output = self.sequence_summary(output)

Why is this sequence_summary() function not used consistently in the other Transformers models, such as BertForSequenceClassification and RobertaForSequenceClassification?

To be rigorous, I compared (1) BertModelā€™s pooler output to (2) the output of SequenceSummary for the same sentence. Arenā€™t these two approaches supposed to produce the same sentence embedding? Thatā€™s what Iā€™m led to believe from the comments of SequenceSummary:

class transformers.modeling_utils. SequenceSummary
Compute a single vector summary of a sequence hidden states.

summary_type ( str ) ā€“ The method to use to make this summary. Accepted values are:

  • "last" ā€“ Take the last token hidden state (like XLNet)
  • "first" ā€“ Take the first token hidden state (like Bert)
  • "mean" ā€“ Take the mean of all tokens hidden states
  • "cls_index" ā€“ Supply a Tensor of classification token position (GPT/GPT-2)
  • "attn" ā€“ Not implemented now, use multi-head attention

Returns

The summary of the sequence hidden states.

Iā€™ve created a Google Colab notebook that computes both (1) pooler output and (2) SequenceSummary output for the same sentence. The last line in the notebook shows that the two approaches to sentence embeddings are not the same.

Is this a bug, or was my asssumption wrong? If itā€™s a bug (in code or documentation), Iā€™ll open a GitHub issue.

Any help would be appreciated.

@thomwolf Because he answered this tangentially-related Github issue.

The answer for the why is in your own question.

BERT implements a pooler output because it has been pretrained with an additional objective (next sentence prediction) that uses such output. RoBERTa does not use that objective anymore but it subclasses BERT in code so it also has that functionality. XLNet however is trained on PLM and is different altogether.

The BERT pooler is slightly different from what SequenceSummary does. You can compare the code if you want to. The pooler gets the output of CLS and puts it through a linear layer and TanH activation.

The SequenceSummary has more options: you can choose the activation function, add dropout, and pick a summary type. This can explain the differences that you encounter.

Note that the SequenceSummary object will disappear soon-ish: it goes against our philosophy of having each model fully defined in one self-contained file (except for obvious subclasses) and adds config arguments to some models that are never actually used. We will use an approach like in the BERT or ELECTRA modeling files, where all the layers for the head are defined in the modeling.py file.

BERT implements a pooler output because it has been pretrained with an additional objective (next sentence prediction) that uses such output. RoBERTa does not use that objective anymore but it subclasses BERT in code so it also has that functionality. XLNet however is trained on PLM and is different altogether.

Thanks, but I donā€™t think that fully explains why XLNet doesnā€™t have a pooler_output. You are right that BERT is trained on next-sentence-prediction, but that only affects the classifier top layer (the linear layer that takes the pooler_output and produces logits). Iā€™m interested in only the pooler_output, the summarization of all the token embeddings. To be consistent with BERT and Roberta, XLNet should produce a pooler_output of its own. I will use that sentence embedding to train my own classifier top layer using a downstream task.

Note that the SequenceSummary object will disappear soon-ish: it goes against our philosophy of having each model fully defined in one self-contained file (except for obvious subclasses) and adds config arguments to some models that are never actually used. We will use an approach like in the BERT or ELECTRA modeling files, where all the layers for the head are defined in the modeling.py file.

Thank you. When SequenceSummary is removed, will all models then produce their own pooler_output ?

No, this was just a BERT thing that it had in its original code and is a part of its pretraining objective. There is no reason to add it to other models that did not use that in their pretraining objective.

Users (like me) still need a way to generate sentence embeddings from a token sequence, right? Iā€™m referring to the actual construction of the sentence embedding from the token embeddings, such as:

  • First token embedding
  • Mean of the token embeddings
  • Max of the token embeddings
  • Etc.

Isnā€™t that what the SequenceSummary and/or pooled_output provides?

Iā€™m all for usability and have made many suggestions that focus on the user side of things, but decisions need to be made as to what this library should provide and what should be considered user-specific scenarios and how intricate it is for users to implement such functionality themselves. You saying that this library should provide x is your subjective opinion but does not go well with the philosophy of the library. The library cannot implement all small code snippets that individual users want, especially not if those are easy to do yourself.

The library does provide some useful, general help such as the Trainer class, but a sequence summary is more subjective: it can be seen as a final classification module on top of a model, but then the decision needs to be made how to implement it. Dropout? Activation? Mlp? How many layers? Additional skip connections? Perhaps even just ensemble first? Etc. There are so many options and the ā€˜rightā€™ option will depend on the model and your use-case. So rather than implementing something that is just half-and-half, we can leave that implementation up to the user. As you can see from the code from sequence summary, it isnā€™t hard to get the position of the CLS token or mean of tokens which, as a basic sentence embedding, works well.

You have to remember that some models are basically extensions of others (Roberta and Bert) but that others are completely different (Bert and gpt2). It doesnā€™t make sense to force them into the same position especially if it is easy for users to do this themselves.

If you really need this functionality to get sentence embeddings, and you do not have the skill to implement it yourself, you can use the sentence-transformers library which is intended exactly for this purpose.

The library cannot implement all small code snippets that individual users want, especially not if those are easy to do yourself.

Iā€™m not asking about some random code snippet. Iā€™m asking about functionality to construct a sentence embedding form the token embeddings, which is core requirement for NLU applications. Iā€™m further not asking anyone to implement anything new. Iā€™m asking why would they remove the SequenceSummary class, which already constructs multiple types of sentence embeddings.

a sequence summary is more subjective: it can be seen as a final classification module on top of a model, but then the decision needs to be made how to implement it. Dropout? Activation? Mlp? How many layers? Additional skip connections? Perhaps even just ensemble first? Etc.

Iā€™m not asking about the classification. Iā€™m asking about how to construct the sentence embedding from the token embeddings, which is universal for all these neural language models because their core competency is to produce token embeddings.

The reason Iā€™m asking about all this is that I donā€™t want to reinvent the wheel. There seems to already exist code to get sentence embeddings from BERT, XLNet, and others, but the code paths are not consistent.

@sgugger Already explained why it is being removed.

ā€œA core requirement for NLU applicationsā€ is a stretch. SOTA applications typically finetune full models on their downstream task. Sentence embeddings as such (by averaging or otherwise pooling without any tuning) are a approach but hardly the best one out there. Averaging token embeddings is as simple as torch.mean if that is what you want, and getting a CLS tokenā€™s output (if it even exists for that LM) isnā€™t hard either (cf. SequenceSummary).

I think this topic can conclude here. I would advise you to either implement this yourself, and perhaps start from SequenceSummary and build from that if you are capable, and otherwise use the sentence-transformers library or similar which provides finetuned models to get better sentence representations.

SOTA applications typically finetune full models on their downstream task.

Weā€™re in agreement here.

Averaging token embeddings is as simple as torch.mean if that is what you want

Itā€™s not so simple when taking into account attention masks. Thatā€™s why I was hoping the HF library could provide pooling functionality appropriate for every model.

otherwise use the sentence-transformers library or similar which provides finetuned models to get better sentence representations.

Again, Iā€™m not interested in pretrained parameters for a classifier. I just want to generate sentence embeddings so I can fine tune the parameters on my own downstream task. My observations are that:

  1. BERT has a pooled_output.
  2. XLNet does not have a pooled_output but instead uses SequenceSummarizer.
  3. sgugger says that SequenceSummarizer will be removed in the future, and there is no plan to have XLNet provide its own pooled_output.
  4. Folks like me doing NLU need to produce a sentence embedding so we can fine-tune a downstream classifier.

Have you looked at sentence-transformers? Thatā€™s what they do, outputting a single embedding for a single sentence. You can then use that with your own downstream classifier. They do not provide many different models though, so that can be an issue for you. But again, this shouldnā€™t be so hard to implement yourself.

I am not sure what you mean with ā€œtaking into account attention masksā€, why do you need those?

Have you looked at sentence-transformers?

Yes, Iā€™m doing research in the same area.

I am not sure what you mean with ā€œtaking into account attention masksā€, why do you need those?

BERT will insert [PAD] tokens as needed. The attention masks mark where those tokens are located in the input sequencer. I donā€™t want to include the embedding for the those tokens when I compute the max or the mean.

You donā€™t need the attention mask for that explicitly, as you can create it yourself. You just need the input token IDs. Then you can get the index where input_tokens_ids == tokenizer.pad_token_id, and then you can select those indices from the output. If you cannot figure out to do this I can post a snippet here.

Thanks. Iā€™m trying to plug in other encoders like XLNet and Roberta, so I donā€™t think I can necessarily rely on creating a mask with input_tokens_ids == tokenizer.pad_token_id. Iā€™ll figure it out.