[Announcement] Model Versioning: Upcoming changes to the model hub

Update: migration is now completed.

TL;DR early next week, we will migrate the models stored on the huggingface.co model hub. Accessing models from the library will be transparent and backward-compatible, however the process to upload models is going to be different. Please share your feedback!

We host more and more of the community’s models which is awesome :heart:. To scale this sharing, we need to change the infra to both support more models, and unlock new powerful features.

To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.

The benefits of this switch are:

  • built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
  • access control (will unlock private models, private datasets, etc)
  • scalability (our usage of S3 to maintain lists of models was starting to bottleneck)

Let’s dive in to the actual changes:

I. On the website

You’ll now see a “Browse files and versions” tab or button on each model page. (design is not final, we’ll make it more prominent/streamlined in the near future)

This is what this page looks like:

Here’s a link to check it out directly in a staging env: https://moon-preprod.huggingface.co/julien-c/EsperBERTo-small/tree/main (disabled now that migration is completed)

The UX should look familiar and self-explanatory, but we’ll add more ML-specific features in the future (what cool feature ideas do you have for version control for Machine learning :exploding_head:?)

You can:

  • see commit histories and diffs of changes made to any text file, like config.json:
    • changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you’ll be able to opt out from those changes)
  • Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
  • Ability to update your text files, like your README.md model card, directly on the website!
    • with instant preview :fire:

II. In the transformers library

We are soliciting feedback on the PR to enable this new storage mode in the transformers library: https://github.com/huggingface/transformers/pull/8324

This PR has two parts:

1. changes to the file downloading code used in from_pretrained() methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.

In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.

For instance:

tokenizer = AutoTokenizer.from_pretrained(
  "julien-c/EsperBERTo-small",
  revision="v2.0.1" # tag name, or branch name, or commit hash
)

Finally, the networking code is more robust and doesn’t gobble up errors anymore, so in case you have trouble downloading a specific file you’ll know exactly why.

2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.

To create a repo:

transformers-cli repo create your-model-name

Then you’ll get a repo url that you’ll be able to clone:

git clone https://huggingface.co/username/your-model-name

# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"

A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.

By the way, again, every model is its own repo. So you can git clone any public model if you’d like:

git clone https://huggingface.co/gpt2

But you won’t be able to push unless it’s one of your models (or one of your orgs’).

Again, please review this PR if possible :pray:: https://github.com/huggingface/transformers/pull/8324

III. Backward compatibility

We intend to merge the PR in transformers next Tuesday morning (November 10). :scream:

  • Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
  • :warning: Model uploads using the current system won’t work anymore: you’ll need to upgrade your transformers installation to the next release, v3.5.0, or to build from master.
    Alternatively, in the next week or so we’ll add the ability to create a repo from the website directly so you’ll be able to push even without the transformers library.

Please let us know of your feedback! We are super excited about this change, because it’s going to unlock really powerful features in the future.

9 Likes

Awesome new feature!

Can’t wait to test it; versioning is really great, especially for fine-tuned models (that can be improved over time).

I would love to see an example of how a “tagged” version can be used. E.g. how can a “v2” tag of a model be used in Transformers then - with something like:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("julien-c/EsperBERTo-small@v2") # specify commit/tag

I really like the versioning concept. But how do you “sync” changes between the model_cards README.md and a specific tagged version of the model?

E.g. I normally would open a PR for a model card README.md in the Transformers library. Then later I would update the model to a version 2, “tag” the old model with a “v1” tag and update the model card in Transformers library for version 2. How can I switch back to the version 1 model card, that belongs to the tagged model for v1 UX-wise :thinking:

It would be awesome to have a kind of version switcher (for tags) in a more prominant way :hugs:

2 Likes

Good point about an example of the syntax.

You can do with the changes in the PR with:

tokenizer = AutoTokenizer.from_pretrained(
  "julien-c/EsperBERTo-small",
  revision="v2" # tag name, or branch name, or commit hash
)
3 Likes

On the model card question: you can push your model card to the model repo itself (that actually is not new, we’ve supported loading model cards from S3 for a while even if most users still push to the transformers repo)

See this model for instance: https://huggingface.co/Helsinki-NLP/opus-mt-en-de/blob/main/README.md

Then a specific version of the README.md lives on a branch (then you can push to branch v1 or fine-tune-experiment or whatever), or pinned at a specific tag (v1.0.1). Does this solve the use case you’re describing?

2 Likes

This looks great!

This last comment made it clearer to me.
So basically a model is a repo and a version is a branch.
I guess by default everything is on the “main” branch.

So:

  • a new model requires a new repo (are they auto-created?)
  • then we just push files to a specific branch

A few questions:

  • Who is the repo owner?
  • How is the repo named? Just the model name?
1 Like

Thanks for the clarification :hugs:

1 Like

@stefan-it @boris I added examples of code and commands directly to the initial post, let me know if it’s clearer.

In short:

  • you need to create a repo, then clone it
  • the repo’s ids are the same as today, i.e. user_name/model_name or org/model_name depending on whether you pass an organization flag. Access control rights are the same as now.

This is great! As said by the others this is an excellent addition to keep track of model versions, particularly when you are finetuning your own. Now we can use commit messages to keep track of all model’s changes, too.

Thinking about how this will work once people start using it, I’d like to raise two possible concerns. It seems that you can clone other people’s models because they are all public anyway. That is great, but possible concerns are:

  • Will the branching be visually clear? In other words, will it be clear to users which is the original repo?
  • In the search results, the original repo (or most downloaded repo?) should be shown at the top in case of forks, or at least there should be some clarity here. (Imagine that everyone clones bert-base to play with it or to finetune it, and suddenly there are hundreds of bert-bases.)

Because of this possible increase in duplication, it might also be fruitful to bring back the discussion about required fields in model cards. I don’t know if that has been addressed yet, but perhaps this is a good time to introduce required fields like license and training data used. I’m aware that that’s not what this PR is about, but it seems related enough.

This sounds and looks great! Definitely looking forward to private and versioned models.

Curious why you chose to go with git-lfs over dvc? And, will this change mean git-lfs is an added requirement for publishing and downloading models?

@setu4993 There will be no additional requirement for downloading models, it’s still downloading individual files like it’s working now. To publish models, yes – git-lfs is now a requirement (hopefully very widespread and easy to install)

Regarding DVC: we did investigate dvc a little bit, and discussed with some in the community. To my understanding, dvc has several “layers”, with the versioning layer being pretty close to (a slight more customizable) LFS in terms of features/concepts. The implementation of a hosting server for the file versioning part of dvc would probably be very similar to that of lfs.

For the v1 we figured it was simpler to only do LFS, but we could support DVC-enabled “repos” at some point. What do you think?

Looks really good! I’m excited to try it out.

As someone who uses more models than he shares (I’ve uploaded … none :pensive:) one of the things I’ve been thinking about how I could discover the “best” or “most appropriate” model from the growing family of models available via hugging face. To be honest, I tend to fall back onto BERT very often and don’t feel I have a great way to discover other models that may be better for my needs. I’m sure there are things I have missed!

Would be awesome to have a way to search & compare models against each other. E.g. all models trained on the same dataset, or all models trained on the same downstream task - sorted by a metric that I pick. Just an idea for the future :rocket:

2 Likes

Worked like a charm!

Just uploaded a new model with this new approach:

Awesome new feature :heart: Thanks for implementing it :hugs:

4 Likes

Yes! Everything looks pretty straightforward! Congrats @julien-c and team!!

1 Like

Thanks for the clarification on requirements, makes sense.

I haven’t used git-lfs directly before, but I do use dvc actively. I definitely see what you mean by versioning is only one part of dvc. I’ve come to appreciate versioning with it a lot more than I thought I would, and the simplicity of being able to use it with S3 / external data stores easily, which would be the case here as well. On the negative side, it wouldn’t have as visibility into the files that are already uploaded or their sizes (as in the EsperBERTo example image above).

Good to know it is released, looking forward to trying it out soon :slight_smile:!

1 Like

i am try to push from colab but getting this error

fatal: could not read Username for ‘https://huggingface.co’: No such device or address

any ideas?

Yes: looks like a limitation of Colab. You can add your token when cloning, like this: git clone https://:$TOKEN@huggingface.co/user/model (or update your remote to this)

Let me know if this works!

1 Like

i am getting any error like this

remote: Unauthorized
fatal: Authentication failed for ‘https://:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@huggingface.co/manishiitg/longformer-recruit-qa-v2/

i have replaced the token with “XXX”

can you try adding your username in front of :? Otherwise you can also user username:password instead of the token.

Let me know if this works

2 Likes

ok it worked finally :slight_smile: