BERT performs worse than other implementations?

tueboesen · July 24, 2020, 11:11pm

I was using the BERT implementation made here before:

But this implementation seemed to not be exactly following the standard procedure for BERT when it came to labels/attentions. Which made me try out huggingface’s version.

Now I have both of them running on the same data, and as far as I can see in an identical way. However their learning speed seems very different!

With the huggingface implementation my code trains to 20% accuracy on masked learning in about 7000 iterations. While with the other implementation I’m reaching 75% accuracy at that point. Their training time also seems comparable.

I can see that the two BERT models are slightly different (In the following I’m printing the network for a 1 layer model just to get something that is easier to compare):

HuggingfaceBERT has 8097792 parameters
model
BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30, bias=True)
    )
  )
)

Other BERT has 7136286 trainable parameters.

BERTseq(
  (bert): BERT(
    (embedding): BERTEmbedding(
      (token): TokenEmbedding(30, 768, padding_idx=0)
      (position): PositionalEmbedding()
      (segment): SegmentEmbedding(3, 768, padding_idx=0)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_blocks): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadedAttention(
          (linear_layers): ModuleList(
            (0): Linear(in_features=768, out_features=768, bias=True)
            (1): Linear(in_features=768, out_features=768, bias=True)
            (2): Linear(in_features=768, out_features=768, bias=True)
          )
          (output_linear): Linear(in_features=768, out_features=768, bias=True)
          (attention): Attention()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=768, out_features=3072, bias=True)
          (w_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation): GELU()
        )
        (input_sublayer): SublayerConnection(
          (norm): LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (output_sublayer): SublayerConnection(
          (norm): LayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (mask_lm): MaskedLanguageModel(
    (linear): Linear(in_features=768, out_features=30, bias=True)
  )
)

Does anyone have any idea what could be going on here? I’m not an expert in BERT, so I can’t really tell whether any of the differences are crucial or what exactly is going on.

Topic		Replies	Views
Difference of performance when finetuning bert use the huggingface or the google official code 🤗Transformers	0	446	June 20, 2022
Advice to speed and performance 🤗Transformers	4	7220	December 7, 2020
Is Google's official BERT model and huggingface BERT model different or same? Beginners	1	1226	March 9, 2022
Trainer class optimization for transformer models Models	0	419	January 8, 2022
PyTorch Bilinear messing with HuggingFace BERT?! Beginners	0	626	February 22, 2022

BERT performs worse than other implementations?

Related topics