One question is about the pretrain method in Transformer packge ？

ccfeidao · August 16, 2021, 10:16am

I am a beginner and need your help very much, thank you

In pretrain, the classical BERT input dimension and output dimension are 768 dimensions, and then the cross entropy is calculated as loss, but why ‘BertForMaskedLM’ output dimension is vacob’s (tokens’) number？

BertForMaskedLM’s outpus:
(decoder): Linear(in_features=768, out_features=20000, bias=True)

BERTs outpus:
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()

JuyiLin · March 19, 2025, 4:01pm

        logits = language_model_output.logits  # shape: [batch_size, sequence_length, vocab_size]
        predicted_token_ids = torch.argmax(logits, dim=-1)

Topic		Replies	Views
Why BertForMaskedLM has decoder layer 🤗Transformers	2	820	August 17, 2021
Reduce output dimensions of BERT Models	3	2681	March 5, 2022
HuggingFace transformers BERT for classification: dimensionality of output with classification layer is expected to be 1, but is 512 instead 🤗Transformers	1	1295	November 14, 2023
What is the input vector size for a BERT and Transformer-XL? 🤗Transformers	1	3551	September 2, 2020
Feature extraction output Beginners	0	410	March 12, 2022

One question is about the pretrain method in Transformer packge ？

Related topics