Currently we can only access the LFS list/delete functionality through the web interface, which is very inconvenient to manage in cases where I need to upload and delete frequently.
Are there any plans to add these LFS management capabilities to the Hugging Face Python API (hf_api)? This would be extremely helpful for users who need to programmatically manage large file storage.
I think it would be faster to ask the developer. @Wauplin
Thanks for the ping
@larryvrh what are you exactly trying to achieve? For context, the upload_file
/upload_folder
/create_commit
methods already work correctly with LFS files (i.e. if file is too large or matches gitattributes rules, it will automatically be uploaded as an LFS pointer). Also you can use list_repo_tree
to list files from the repo with their LFS status (i.e. is the file LFS or not, and if yes what is the pointer file). Finally you can also delete files from the repo using delete_file
/create_commit
, which works seamlessly for both regular and LFS files.
In general, the LFS protocol is kinda hidden to the end user when dealing with the HfApi
client. HTTP requests are made to seamlessly work with any type or size of files. Here is a short explanation about it: Git vs HTTP paradigm.
Let me know if you have any precise question regarding LFS support in HfApi
Thanks Wauplin!
Hi, Wauplin, thanks for replying! My problem is that the LFS storage wonât release properly even after we use the high level API to delete files. For example, I currently store my different checkpoints in different branches of a repo, each created from the initial revision:
huggingface_hub.create_branch(repo_id=repo_id,
repo_type=repo_type,
branch=branch,
revision=huggingface_hub.list_repo_commits(repo_id=repo_id, repo_type=repo_type, token=token)[-1].commit_id,
token=token,
exist_ok=False)
However, when I want to delete some of the branches with the following code:
api.delete_files(repo_id=repo_id, revision=branch, delete_patterns='*')
api.super_squash_history(repo_id=repo_id, branch=branch)
api.delete_branch(repo_id=repo_id, branch=branch)
The branch and files get successfully deleted, and Iâm sure that those files arenât referenced from any other branch, but the LFS storage wonât always release. Iâve observed that there are sometimes delayed releases, but most times it just wonât be released at all.
Ok so if I understand it correctly, what you try to achieve is to delete the actual files that are stored on S3 but it does not do it when you delete all the commits with a pointer to the said files, am I right? Untracked LFS files are indeed garbage collected from time to time but not instant and not guaranteed. Can you tell us more why this is a problem on your side and how did you come to realize that some files are garbage collected and others not? Iâd like to better understand your needs in order to help you in the good direction.
Yes, this issue centers on S3 storage management. I can monitor which files are being garbage collected by checking the âStorage Usageâ section in each repositoryâs settings page. The problem arises because private storage is now a paid service. While Iâm comfortable with paying, I frequently upload and delete temporary checkpoints to Hugging Face, causing my storage usage to increase indefinitely since I lack an effective method to clean up the accumulated storage.
Right, I havenât spot this issue indeed. Iâll ask around internally what can be done in this case. Note that repositories on the Hub are meant to version data and keep the history. And super_squash_commit
meant to be a power-user method to reduce the number of commits but not thought it term of âdeleting previously uploaded dataâ. If you do not need versioning (i.e. if you do not need past checkpoints to be stored) I can advice to store checkpoints in a temporary repository and then delete it once the âfinal checkpointsâ are ready. Instead of the
api.delete_files(repo_id=repo_id, revision=branch, delete_patterns='*')
api.super_squash_history(repo_id=repo_id, branch=branch)
api.delete_branch(repo_id=repo_id, branch=branch)
you could even do something like
api.delete_repo(repo_id=repo_id)
api.create_repo(repo_id=repo_id)
api.upload_file(...)
Of course this would come with some drawbacks (total history is lost, community tab is lost, link to collections is lost etc.) but depending on your use case and workflow it can be a good workaround.
To complete on my answer above, here is some documentation about how to free-up space: Storage limits. There is a UI in the repo settings to manually delete some LFS files.
We will also add support for this method in the Python client in the near future.
PR: Support permanently deleting LFS files by Wauplin ¡ Pull Request #2954 ¡ huggingface/huggingface_hub ¡ GitHub. Expect it to land in next huggingface_hub release.
Got it, thanks a lot for helping!
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.