Will LFS related functionality come to hf_api?

Currently we can only access the LFS list/delete functionality through the web interface, which is very inconvenient to manage in cases where I need to upload and delete frequently.
Are there any plans to add these LFS management capabilities to the Hugging Face Python API (hf_api)? This would be extremely helpful for users who need to programmatically manage large file storage.

1 Like

I think it would be faster to ask the developer.:sweat_smile: @Wauplin

Thanks for the ping :slight_smile:
@larryvrh what are you exactly trying to achieve? For context, the upload_file/upload_folder/create_commit methods already work correctly with LFS files (i.e. if file is too large or matches gitattributes rules, it will automatically be uploaded as an LFS pointer). Also you can use list_repo_tree to list files from the repo with their LFS status (i.e. is the file LFS or not, and if yes what is the pointer file). Finally you can also delete files from the repo using delete_file/create_commit, which works seamlessly for both regular and LFS files.

In general, the LFS protocol is kinda hidden to the end user when dealing with the HfApi client. HTTP requests are made to seamlessly work with any type or size of files. Here is a short explanation about it: Git vs HTTP paradigm.

Let me know if you have any precise question regarding LFS support in HfApi :hugs:

2 Likes

Thanks Wauplin!

1 Like

Hi, Wauplin, thanks for replying! My problem is that the LFS storage won’t release properly even after we use the high level API to delete files. For example, I currently store my different checkpoints in different branches of a repo, each created from the initial revision:

huggingface_hub.create_branch(repo_id=repo_id,
                              repo_type=repo_type,
                              branch=branch,
                              revision=huggingface_hub.list_repo_commits(repo_id=repo_id, repo_type=repo_type, token=token)[-1].commit_id,
                              token=token,
                              exist_ok=False)

However, when I want to delete some of the branches with the following code:

api.delete_files(repo_id=repo_id, revision=branch, delete_patterns='*')
api.super_squash_history(repo_id=repo_id, branch=branch)
api.delete_branch(repo_id=repo_id, branch=branch)

The branch and files get successfully deleted, and I’m sure that those files aren’t referenced from any other branch, but the LFS storage won’t always release. I’ve observed that there are sometimes delayed releases, but most times it just won’t be released at all.

1 Like

Ok so if I understand it correctly, what you try to achieve is to delete the actual files that are stored on S3 but it does not do it when you delete all the commits with a pointer to the said files, am I right? Untracked LFS files are indeed garbage collected from time to time but not instant and not guaranteed. Can you tell us more why this is a problem on your side and how did you come to realize that some files are garbage collected and others not? I’d like to better understand your needs in order to help you in the good direction.

1 Like

Yes, this issue centers on S3 storage management. I can monitor which files are being garbage collected by checking the ‘Storage Usage’ section in each repository’s settings page. The problem arises because private storage is now a paid service. While I’m comfortable with paying, I frequently upload and delete temporary checkpoints to Hugging Face, causing my storage usage to increase indefinitely since I lack an effective method to clean up the accumulated storage.

1 Like

Right, I haven’t spot this issue indeed. I’ll ask around internally what can be done in this case. Note that repositories on the Hub are meant to version data and keep the history. And super_squash_commit meant to be a power-user method to reduce the number of commits but not thought it term of “deleting previously uploaded data”. If you do not need versioning (i.e. if you do not need past checkpoints to be stored) I can advice to store checkpoints in a temporary repository and then delete it once the “final checkpoints” are ready. Instead of the

api.delete_files(repo_id=repo_id, revision=branch, delete_patterns='*')
api.super_squash_history(repo_id=repo_id, branch=branch)
api.delete_branch(repo_id=repo_id, branch=branch)

you could even do something like

api.delete_repo(repo_id=repo_id)
api.create_repo(repo_id=repo_id)
api.upload_file(...)

Of course this would come with some drawbacks (total history is lost, community tab is lost, link to collections is lost etc.) but depending on your use case and workflow it can be a good workaround.

2 Likes

To complete on my answer above, here is some documentation about how to free-up space: Storage limits. There is a UI in the repo settings to manually delete some LFS files.

We will also add support for this method in the Python client in the near future.

1 Like

PR: Support permanently deleting LFS files by Wauplin ¡ Pull Request #2954 ¡ huggingface/huggingface_hub ¡ GitHub. Expect it to land in next huggingface_hub release.

2 Likes

Got it, thanks a lot for helping! :+1: :blush:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.