Git clone often gives 403 or 429 errors

We download data from Hugging Face at our CI (GitHub Actions) like this:

+ git lfs install
Updated Git hooks.
Git LFS initialized.
+ git clone https://huggingface.co/datasets/certik/fastGPT
Cloning into 'fastGPT'...
fatal: unable to access 'https://huggingface.co/datasets/certik/fastGPT/': The requested URL returned error: 403

Quite often in the last month we started encountering “403 Forbidden” and “429 Too Many Requests” errors. We were getting similar errors when using curl over https. We do not want to overload your servers. I have a few questions to figure out how to move forward:

  1. Is it ok to use Hugging Face at our CI to download data?
  2. If so, what is the rate limit?
  3. How should we handle the 403/429 errors — should we retry with exponential delay iterations (1, 2, 4, 8, … seconds)?
  4. Should we use some authentication?

Thanks for maintaining this service, it has been very useful to us.

1 Like

SSL configuration errors and token configuration errors are common causes. Also, in Windows environments, it is easier to stabilize git-lfs as well as git by reinstalling them using the installer. The default ones are kind of old.

1 Like

Thanks @John6666. Note that the repository I am trying to download is public. I nevertheless created a Huggingface token and I can download it using curl and the token like this:

curl -f -L -H "Authorization: Bearer ${HUGGINGFACE_TOKEN}" -o model.dat https://huggingface.co/datasets/certik/fastGPT/resolve/main/model_fastgpt_124M_v1.dat

I tested that if I use invalid token that this download will fail. So I think it is working when I do it by hand in a terminal.

However in order to use this at the CI, I first tried creating a secure variable at GitHub, but it doesn’t get exposed to PRs from forks. I could expose the token I created online, but this doesn’t seem like a good practice even if the token has no permissions, since it is still tied to my account and there might be some other consequences. Also, from Huggingface’s perspective – if there were some special privileges, like not getting download errors for people that use the token, then if I exposed the token publicly, anyone could use the same token, so it would defeat the measure. So that doesn’t seem like a solution.

1 Like

Hmm… huggingface-cli login?

The huggingface-cli still requires a token, so at our CI the token would have to be public, which I think is not a good idea.

However, I found this page that shows how to download files: Download files from the Hub, maybe it doesn’t fail like a direct curl invocation.

1 Like

We also tried the following curl options: --retry-all-errors --retry 10 --retry-delay 20, but unfortunately that still sometimes fails, here is an example:

That is over 3 minutes of trying every 20 seconds, and it did not succeed. So I think this is not some transient error, it is likely some rate limit. So I think we should not be downloading from Huggingface using curl at our CI.

1 Like

For a moment I thought it might be a Windows git problem, but if it’s happening with curl, it’s a different error that’s unrelated…

Hmm, I wonder if something is being blocked in the connection path…

For Windows users