Which URLs should be reachable to work with Huggingface hub

Hi, I am configuring a server that should be able to reach the Huggingface Hub. In particular, I would like to be able to use the Datasets library to download public datasets, as well as retrieve pretrained models and tokenizers.

I need to specify the URL I will need to reach in order to get them whitelisted in our proxies. Is there a list somewhere? If I try to donwload a dataset without an internet connection, I see the process fail while trying to reach https://raw.githubusercontent.com, but I am not sure whether everything is hosted there or I should enable more domains

Oh no, I see that every dataset connects to a custom URL :frowning: I had hoped everything was actually hosted centrally by HuggingFace, everything would have been so much simpler then

Hi ! Indeed in general we don’t host the data files of research datasets, we simply provide a python script that downloads the file from its original source and processes it. This is because we’re often not allowed to redistribute the data because of licensing/copyright reasons.