Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Hi,

looking at the files: Ayham/roberta_gpt2_summarization_cnn_dailymail at main

It indeed looks like only the weights (pytorch_model.bin) and model configuration (config.json) are uploaded, but not the tokenizer files.

You can upload the tokenizer files programmatically using the huggingface_hub library. First, make sure you have installed git-LFS and are logged into your HuggingFace account. In Colab, this can be done as follows:

!sudo apt-get install git-lfs
!git config --global user.email "your email"
!git config --global user.name "your username"
!huggingface-cli login

Next, you can do the following:

from transformers import RobertaTokenizer
from huggingface_hub import Repository

repo_url = "https://huggingface.co/Ayham/roberta_gpt2_summarization_cnn_dailymail"
repo = Repository(local_dir="tokenizer_files", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="Niels Rogge",
                  git_email="niels.rogge1@gmail.com",
                  use_auth_token=True,
)

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
tokenizer.save_pretrained("tokenizer_files")

repo.push_to_hub(commit_message="Upload tokenizer files")

Note that the Trainer can actually automatically push all files during/after training to the hub for you as seen here.

2 Likes