How to sync to the latest version with `snapshot_download`, old files removed

By default, snapshot_download(repo_id=rid, resume_download=True) download the whole snapshot of a repo’s file at its latest version.

I was wondering, if there are new commits after its first execution, how to sync the downloaded files to the latest status and the obsoleted files are removed (to save disk space).

It seems, executing the function again will sync the files to latest status (add new files to blobs/<commit-id>and update the reference links in snapshot/<commit-id> to the new files), however, the old files remains (the obsoleted files in blobs keeps).

Failed to find a possible para in snapshot_download’s doc.

That’s a great point! Under the hood, huggingface_hub utilizes Git for data fetching. Are you attempting to run a git clean -df command to remove old untracked files? cc @Wauplin, who maintains the library, in case they can provide additional insight.

Thanks for the ping @radames :slight_smile:

Actually no, huggingface_hub doesn’t use git under the hood, except when you use the legacy class Repository which is not the case here. The main download methods hf_hub_download (single file) and snapshot_download (entire repo) are HTTP-based. Here is a short explanation on the difference between the git-based and http-based approach.

Regarding the initial question about how the cache work, here is a guide that explains the cache directory structure, how it gets updated, how to scan what’s inside and finally, how to clean outdated files. I think it’s a great read to understand caching in HF ecosystem. Of course it uses the git hash/revision but it is not related or compatible with git commands.

however, the old files remains (the obsoleted files in blobs keeps).

This is intentional, in case users want to have different versions of a model in their cache. We believe that it’s better to let the users clean their cache themselves rather than deleting a blob file, potentially leading to a re-download.

Hope this will help you :slight_smile:

1 Like

Thank you for the info. :smiley: @radames,

@Wauplin
Thank you for the update.
Been though the points you mentioned.

Considering one use python (w/o git) for the whole process, from preparing the datasets to deploy their project, providing an optional para to the function would easy some needs on managing dataset storage. For instance, whether or not to delete obsoleted files in blobs to reduce disk usage.

It’s likely that one knows it leads to the deletions when specifying this optional para in the function.

Currently, to my understanding, one would need to do it manually (cross-check the links in snapshots and the files in blobs) or write an (bash) script to do that.

Currently, to my understanding, one would need to do it manually (cross-check the links in snapshots and the files in blobs ) or write an (bash) script to do that.

Not exactly. When scanning the cache, there is a delete_revisions method to clean the cache given a list of revisions you want to remove. If you want to delete all files except the last modified one for each repo, you can do it like this:

from huggingface_hub import scan_cache_dir


def delete_old_files():
    # Scan cache
    scan = scan_cache_dir()

    # Select revisions to delete
    to_delete = []
    for repo in scan.repos:
        latest_revision = max(repo.revisions, key=lambda x: x.last_modified)
        to_delete.extend([revision.commit_hash for revision in repo.revisions if revision != latest_revision])
    strategy = scan.delete_revisions(*to_delete)

    # Delete them
    print(f"Will delete {len(to_delete)} old revisions and save {strategy.expected_freed_size_str}")
    strategy.execute()

All the logic lies into the # Select revisions to delete section. Here I chose to delete all revisions except the last modified one. But one could also argue the way to go would be to delete all revisions except the last revision from the main branch.

The reason why I’m reluctant to implement a new parameter in snapshot_download is that we would have to define its exact expected behavior and that might change depending on the user. For example, if I download a model for revision tag 2.0, does it mean that I want to delete 1.0 revision tag? Or that I want to delete revisions that are not referenced by a specific ref (main, 1.0, 2.0, another_branch,…)? Or another example, if a user downloads a previous revision of the repo, should we delete the last revision from the main branch or not? Seems like corner cases but questions we need to answer.

To avoid debates and unexpected behaviors, I would prefer to enhance the delete-cache CLI and delete_revisions helper rather than mixing it with the download methods. An issue has been created for it but TBH it has not being prioritized yet. We might reevaluate this in the future :slight_smile:

2 Likes

Thank you for the explanation.