How to convert TF Checkpoints to sentence embedings

Hello

I trained the model with original bert code https://github.com/google-research/bert but I don’t know:

  1. How to convert files to the model:bert_files

I tried to understand the documentation but deinetly i do something wrong ( I don’t know what I have to do with index and meta file, why file model.ckpt-500000.data-00000-of-00001 is much bigger than typical model.ckpt file for Bert standard, why I don’t have model.ckpt file as aoutput?)
https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel

  1. How to convert whole sentence to embeding vector? ( I know how to load tokenizer from file vocab.txt)

I’am afraid that I didn’t understand some general concept of transformers models and this is a reason of my problems. Are there any online course for transformers library?

Regards Peter

To convert a model trained with the original repository you should use the conversion script here: https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py

Thank You. I tried code below:

and result was the error:
" File “C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\training\py_checkpoint_reader.py”, line 44, in error_translator
raise errors_impl.DataLossError(None, None, error_message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file G:\PycharmProject\Ancient_Greek_BERT\model_uncased_L-12_H-512_A-12\model.ckpt-500000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
"
I have no idea how to modify my code, because error description isn’t clear for me.
Peter

I found solution :slight_smile: It’s working.

I’m completely new to Transformers and BERT, and I’m struggling to pull together a functional workflow.

Like the OP, I’m trying to use convert_pytorch_checkpoint_to_tf2.py to convert a pre-trained model from Google Tensorflow BERT format (via TPU) into the pytorch version. I want to go on to fine-tune this model.

But, crucially, my pre-trained model is based on a custom vocabulary - I trained BERT from scratch on my own data. It isn’t like any of the Huggingface pre-trained models.

But I’m trying the Huggingface conversion script as below:

python convert_pytorch_checkpoint_to_tf2.py --model_type=bert --pytorch_checkpoint_path=model.ckpt-20 --config_file=bert_config.json --tf_dump_path=pytorch_model.bin

(The checkpoint is a ‘hello world’ example to help me understand the training workflow).

But I get an error from the conversion script:

File “convert_pytorch_checkpoint_to_tf2.py”, line 402, in convert_all_pt_checkpoints_to_tf
config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
ValueError: not enough values to unpack (expected 5, got 4)

When I put a diagnostic print(MODEL_CLASSES[model_type]) in the script, I see that the model_type I gave (‘bert’) is being matched against many existing Huggingface pre-trained models. The simple text match finds the string ‘bert’ in the MODEL_CLASSES list but it’s not fully matched against any Huggingface pre-trained model. So I think now that I must specify a Huggingface pre-trained model definition - but how is this used? As some sort of template for further training?

I am obviously missing the point totally here. What I want is to be able to pre-train a model from scratch with Tensorflow on Google TPU, or the equivalent with pytorch, then fine-tune it. Is that possible in the Huggingface ecosystem? Is there a HF script to do the conversion to pytorch format, without the constraint of specifying a pre-trained model? I think maybe I’m at saturation point, continually hitting dependency mismatches etc. as I try to learn to work with BERT.

I’d be very grateful for your advice.