Converting TF-Bert to Torch using conversion script works, but

Taber · August 3, 2020, 1:50pm

Hi there - we are using this BERT architecture from Google:

* attention_probs_dropout_prob:0.1
* hidden_act:"gelu"
* hidden_dropout_prob:0.1
* hidden_size:768
* initializer_range:0.02
* intermediate_size:3072
* max_position_embeddings:512
* num_attention_heads:12
* num_hidden_layers:12
* type_vocab_size:2
* vocab_size:32000

we trained it from scratch with 10 millions of documents from our very specific domain and also changed optimizers and sentence tokenizer. Now, our BERT works wonderfully, it is evaluated on mask_token and next sentence prediction and we can also fine tune it for e.g. downstream Classfication tasks - all of that works. As long as we stay in the TensorFlow/Nvidia world.

We would however love to move into all the possibilities of PyTorch as well. So we applied the following script: " convert_bert_original_tf_checkpoint_to_pytorch.py"

This script initially fails, because our optimizer are not recognized. By adding them to the list of skipped attributes

if any(n in ["adam_v", "adam_m", "*AdamWeightDecayOptimizer", "AdamWeightDecayOptimizer_1", "global_step"*] for n in name):

The scripts runs through fine and we get a working torch model. And here things become very strange. Initially we tried to fine tune the torch model of our BERT and nothing happened - inspecting the attention values and outputs for input tokens we realized that for any given input sequence ALL attention values are identical, yet different when the input changes. Also the output values in the hidden stated of the converted BASE model are identical for each token, irrespective of the input.

Nevertheless the conversion seems to work in that way that the model can be loaded and is a fully valid Pytorch model file.

I understand that all of that is quite vague but maybe it sounds familiar to somebody and or we get a hint where to look next.

The manifestation of that effect:

outputs = torch_bert_model(**example_input)
print(outputs[0]) # last hidden state, irrespective of input always this:

tensor([[[ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         ...,
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04],
         [ 1.5048e-04, -4.1375e-06,  1.4493e-04,  ...,  2.1820e-04,
          -6.6411e-05,  2.1333e-04]]], grad_fn=<NativeLayerNormBackward>)

And the attention heads:
4=CLS, 5=SEP
Always different for any given input but always the identical value on all heads.

tensor([[   4,   13,    8, 6060,    5,   13, 2840,  350,    8, 6060,    5]])
(tensor([[[[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         ...,

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]],

         [[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          ...,
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909]]]],
       grad_fn=<SoftmaxBackward>), tensor([[[[0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],
          [0.0909, 0.0909, 0.0909,  ..., 0.0909, 0.0909, 0.0909],

Thanks a lot for any help! Much appreciated.

Taber · August 7, 2020, 9:08am

Just in case someone runs into the same issue: It was a problem with the naming of the layers our naming came from an Nvidia TF package and it is differed from standard naming - we did the mapping ourselves and now the model is working and producing the same output for identical input. the script was still useful to see how it is done in principle.

Leena · July 23, 2021, 12:13am

Hi Taber,

Do you know if the same script can be used for a RoBERTa model. I used the script but it doesn’t seem to be working in case of RoBERTa. My understanding is BERT and RoBERTa are very similar except token_type_vocab, hyperparameters etc. so ideally using the same code for converting a RoBERTa TF to Pytorch should work. I am looking into this in detail now to see if anything has to be changed in the original script.

I would appreciate if you have tried this and have some insights. I also tried just loading the TF checkpoint though that throws an error as well: Error while converting a RoBERTa TF checkpoint to Pytorch · Issue #12798 · huggingface/transformers · GitHub

Taber · July 23, 2021, 6:27am

Hi Leena,

Sorry I can’t help, script has not been maintained from our side and we haven’t worked with it for almost a year.

Taber

Leena · July 23, 2021, 6:28am

No problems. Thank you!

I plan on reading the script in detail and probably customizing it as needed.

Topic		Replies	Views
Convert tensorflow tokenclassifier checkpoint to pytorch 🤗Transformers	2	911	January 2, 2022
How to convert TF Checkpoints to sentence embedings Beginners	4	1523	November 27, 2020
Issue with converting my own BERT TF2 checkpoint to PyTorch and loading the PyTorch model for training 🤗Transformers	0	537	February 25, 2021
Convert TAPAS tf checkpoint to PyTorch 🤗Transformers	0	598	July 17, 2020
Converting TF checkpoint with missing .meta file Models	0	790	February 4, 2022

Converting TF-Bert to Torch using conversion script works, but

Related topics