We download data from Hugging Face at our CI (GitHub Actions) like this:
+ git lfs install
Updated Git hooks.
Git LFS initialized.
+ git clone https://huggingface.co/datasets/certik/fastGPT
Cloning into 'fastGPT'...
fatal: unable to access 'https://huggingface.co/datasets/certik/fastGPT/': The requested URL returned error: 403
Quite often in the last month we started encountering “403 Forbidden” and “429 Too Many Requests” errors. We were getting similar errors when using curl over https. We do not want to overload your servers. I have a few questions to figure out how to move forward:
Is it ok to use Hugging Face at our CI to download data?
If so, what is the rate limit?
How should we handle the 403/429 errors — should we retry with exponential delay iterations (1, 2, 4, 8, … seconds)?
Should we use some authentication?
Thanks for maintaining this service, it has been very useful to us.
SSL configuration errors and token configuration errors are common causes. Also, in Windows environments, it is easier to stabilize git-lfs as well as git by reinstalling them using the installer. The default ones are kind of old.
Thanks @John6666. Note that the repository I am trying to download is public. I nevertheless created a Huggingface token and I can download it using curl and the token like this:
I tested that if I use invalid token that this download will fail. So I think it is working when I do it by hand in a terminal.
However in order to use this at the CI, I first tried creating a secure variable at GitHub, but it doesn’t get exposed to PRs from forks. I could expose the token I created online, but this doesn’t seem like a good practice even if the token has no permissions, since it is still tied to my account and there might be some other consequences. Also, from Huggingface’s perspective – if there were some special privileges, like not getting download errors for people that use the token, then if I exposed the token publicly, anyone could use the same token, so it would defeat the measure. So that doesn’t seem like a solution.
We also tried the following curl options: --retry-all-errors --retry 10 --retry-delay 20, but unfortunately that still sometimes fails, here is an example:
That is over 3 minutes of trying every 20 seconds, and it did not succeed. So I think this is not some transient error, it is likely some rate limit. So I think we should not be downloading from Huggingface using curl at our CI.