Hello, I’m reading a paper where BERT (TFBertModel
) and RoBerta (TFRobertaModel
) are used to solve a text classification task.
Preamble
Going through the implementation, I noticed that each text sample is tokenized with no special tokens tokenizer.encode(sentence, add_special_tokens=False)
.
Later on the outputs of the tokenizers are passed to the respective models and the pooled output is retrieved, as follows:
embedding_BERT = encoder_BERT(
input_ids_BERT,
token_type_ids=token_type_ids_BERT,
attention_mask=attention_mask_BERT
)['pooler_output']
Questions
- The authors claim to be using the
[CLS]
tokens produced by both models. However, how can this be the case if the tokenizers encoded the text samples without including the special tokens? - If
add_special_tokens
isFalse
, does the first token of each text sample still encode knowledge about the whole sequence as it usually is the case with[CLS]
? - The authors actually use the pooled output, which is produced by
BertPooler
. Can its output still be considered as theCLS
token?
References
paper: https://ceur-ws.org/Vol-3202/politices-paper1.pdf
code: PoliticES2022/PoliticES.ipynb at main · ssantamaria94/PoliticES2022 · GitHub