Download data is slow from github lfs

I download my data from huggingface by DownloadManager. It is very fast.

from datasets.download.download_manager import DownloadManager
dl_manager = DownloadManager()
dl_manager.download("https://huggingface.co/datasets/ljw20180420/CRISPR_data/resolve/main/dataset.json.gz")

However, when I download it from github lfs, it is very slow.

downloaded_files = dl_manager.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz")

Does anyone know the reason? Thank you.

1 Like

I know the reason. I only set proxy by

from huggingface_hub import configure_http_backend
import requests

url="socks5h://127.0.0.1:1080"
# Create a factory function that returns a Session with configured proxies
def backend_factory() -> requests.Session:
    session = requests.Session()
    session.proxies.update({
        "http": url,
        "https": url
    })
    return session
# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

This works only for huggingface_hub, but not datasets. I need to set proxy for DownloadManager.

from datasets.download.download_manager import DownloadManager, DownloadConfig
dl_manager = DownloadManager()
df = dl_manager.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz") # slow
dl_manager_proxy = DownloadManager(download_config=DownloadConfig(proxies={"http": "socks5h://127.0.0.1:1080", "https": "socks5h://127.0.0.1:1080"}))
df_proxy = dl_manager_proxy.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz") # fast
2 Likes