Converting TF-Bert to Torch using conversion script works, but

Hi there - we are using this BERT architecture from Google:

* attention_probs_dropout_prob:0.1
* hidden_act:"gelu"
* hidden_dropout_prob:0.1
* hidden_size:768
* initializer_range:0.02
* intermediate_size:3072
* max_position_embeddings:512
* num_attention_heads:12
* num_hidden_layers:12
* type_vocab_size:2
* vocab_size:32000

we trained it from scratch with 10 millions of documents from our very specific domain and also changed optimizers and sentence tokenizer. Now, our BERT works wonderfully, it is evaluated on mask_token and next sentence prediction and we can also fine tune it for e.g. downstream Classfication tasks - all of that works. As long as we stay in the TensorFlow/Nvidia world.

We would however love to move into all the possibilities of PyTorch as well. So we applied the following script: " convert_bert_original_tf_checkpoint_to_pytorch.py"

This script initially fails, because our optimizer are not recognized. By adding them to the list of skipped attributes

if any(n in ["adam_v", "adam_m", "*AdamWeightDecayOptimizer", "AdamWeightDecayOptimizer_1", "global_step"*] for n in name):

The scripts runs through fine and we get a working torch model. And here things become very strange. Initially we tried to fine tune the torch model of our BERT and nothing happened - inspecting the attention values and outputs for input tokens we realized that for any given input sequence ALL attention values are identical, yet different when the input changes. Also the output values in the hidden stated of the converted BASE model are identical for each token, irrespective of the input.

Nevertheless the conversion seems to work in that way that the model can be loaded and is a fully valid Pytorch model file.

I understand that all of that is quite vague but maybe it sounds familiar to somebody and or we get a hint where to look next.

The manifestation of that effect:

outputs = torch_bert_model(**example_input)
print(outputs[0]) # last hidden state, irrespective of input always this:

tensor([[[ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         ...,
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04]]], grad_fn=<NativeLayerNormBackward>)

And the attention heads:
4=CLS, 5=SEP
Always different for any given input but always the identical value on all heads.

tensor([[   4,   13,    8, 6060,    5,   13, 2840,  350,    8, 6060,    5]])
(tensor([[[[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         ...,

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]]]],
       grad_fn=<SoftmaxBackward>), tensor([[[[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],

Thanks a lot for any help! Much appreciated.

Just in case someone runs into the same issue: It was a problem with the naming of the layers our naming came from an Nvidia TF package and it is differed from standard naming - we did the mapping ourselves and now the model is working and producing the same output for identical input. the script was still useful to see how it is done in principle.

Hi Taber,

Do you know if the same script can be used for a RoBERTa model. I used the script but it doesn’t seem to be working in case of RoBERTa. My understanding is BERT and RoBERTa are very similar except token_type_vocab, hyperparameters etc. so ideally using the same code for converting a RoBERTa TF to Pytorch should work. I am looking into this in detail now to see if anything has to be changed in the original script.

I would appreciate if you have tried this and have some insights. I also tried just loading the TF checkpoint though that throws an error as well: Error while converting a RoBERTa TF checkpoint to Pytorch · Issue #12798 · huggingface/transformers · GitHub

Hi Leena,

Sorry I can’t help, script has not been maintained from our side and we haven’t worked with it for almost a year.

Taber

No problems. Thank you!

I plan on reading the script in detail and probably customizing it as needed.