Update: migration is now completed.
TL;DR early next week, we will migrate the models stored on the huggingface.co model hub. Accessing models from the library will be transparent and backward-compatible, however the process to upload models is going to be different. Please share your feedback!
We host more and more of the community’s models which is awesome . To scale this sharing, we need to change the infra to both support more models, and unlock new powerful features.
To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.
The benefits of this switch are:
- built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
- access control (will unlock private models, private datasets, etc)
- scalability (our usage of S3 to maintain lists of models was starting to bottleneck)
Let’s dive in to the actual changes:
I. On the website
You’ll now see a “Browse files and versions” tab or button on each model page. (design is not final, we’ll make it more prominent/streamlined in the near future)
This is what this page looks like:
Here’s a link to check it out directly in a staging env: https://moon-preprod.huggingface.co/julien-c/EsperBERTo-small/tree/main
(disabled now that migration is completed)
The UX should look familiar and self-explanatory, but we’ll add more ML-specific features in the future (what cool feature ideas do you have for version control for Machine learning ?)
You can:
- see commit histories and diffs of changes made to any text file, like config.json:
- changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you’ll be able to opt out from those changes)
- Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
- Ability to update your text files, like your README.md model card, directly on the website!
- with instant preview
II. In the transformers library
We are soliciting feedback on the PR to enable this new storage mode in the transformers
library: https://github.com/huggingface/transformers/pull/8324
This PR has two parts:
1. changes to the file downloading code used in from_pretrained()
methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.
In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.
For instance:
tokenizer = AutoTokenizer.from_pretrained(
"julien-c/EsperBERTo-small",
revision="v2.0.1" # tag name, or branch name, or commit hash
)
Finally, the networking code is more robust and doesn’t gobble up errors anymore, so in case you have trouble downloading a specific file you’ll know exactly why.
2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git
too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.
To create a repo:
transformers-cli repo create your-model-name
Then you’ll get a repo url that you’ll be able to clone:
git clone https://huggingface.co/username/your-model-name
# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.
By the way, again, every model is its own repo. So you can git clone any public model if you’d like:
git clone https://huggingface.co/gpt2
But you won’t be able to push unless it’s one of your models (or one of your orgs’).
Again, please review this PR if possible : https://github.com/huggingface/transformers/pull/8324
III. Backward compatibility
We intend to merge the PR in transformers next Tuesday morning (November 10).
- Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
-
Model uploads using the current system won’t work anymore: you’ll need to upgrade your transformers installation to the next release,
v3.5.0
, or to build frommaster
.
Alternatively, in the next week or so we’ll add the ability to create a repo from the website directly so you’ll be able to push even without the transformers library.
Please let us know of your feedback! We are super excited about this change, because it’s going to unlock really powerful features in the future.