Upload a TF model to Huggingface

Hi,
I am pre-training a Bert model from scratch using Tensorflow.
I’ve seen the methid to push pyTorch models, but I don’t know how to do with my TF model.

Here is how I imagine I have to do:
1- Convert my checkpoint from Tf to torch
2- Push to HF

Is this correct?

But an other question is :
I don’t know how to push the tokenizer, all I am having is:

  • my vocab.txt
  • tokenizer.vocab
    -tokenizer.model

how can I do this please.

Thanks

If you pre-trained BERT from scratch in TF using the run_mlm.py script, you can easily convert the model from TF to PyTorch, like so:

from transformers import BertForMaskedLM

model = BertForMaskedLM.from_pretrained("name of directory where the run_mlm.py script saved all files", from_tf=True)
model.save_pretrained("name of directory where you'd like to save all model files")

Next, you can easily push it to the hub as follows (I’m assuming you’re in a Colab notebook):

First, install git-LFS:

!sudo apt-get install git-lfs
!git config --global user.email "<your email>"
!git config --global user.name "<your name>"

Next, create a repo on the hub, then git clone it:

git clone <URL of your repository on the hub>

Next, add your files and upload them:

git add .
git commit -m "First commit"
git push

Thanks a lot.
Is this enough even for the tokenizer. I heard I have also provide the tokenizer with the model.

Yes you should also include the tokenizer files. As these are framework-independent, you can use the ones that were saved from the run_mlm.py script.

Thanks, one more question about it:
How can pytorch users use my tokenizer with AutoTokenizer?
Providing my vocab.txt is it enough or tokenizer.model is the one they need.

Not sure what tokenizer.model is, normally the vocab.txt is enough.

Thanks a lot for you complete answer Nielsr