I download my data from huggingface by DownloadManager. It is very fast.
from datasets.download.download_manager import DownloadManager
dl_manager = DownloadManager()
dl_manager.download("https://huggingface.co/datasets/ljw20180420/CRISPR_data/resolve/main/dataset.json.gz")
However, when I download it from github lfs, it is very slow.
downloaded_files = dl_manager.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz")
Does anyone know the reason? Thank you.
1 Like
I know the reason. I only set proxy by
from huggingface_hub import configure_http_backend
import requests
url="socks5h://127.0.0.1:1080"
# Create a factory function that returns a Session with configured proxies
def backend_factory() -> requests.Session:
session = requests.Session()
session.proxies.update({
"http": url,
"https": url
})
return session
# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)
This works only for huggingface_hub, but not datasets. I need to set proxy for DownloadManager.
from datasets.download.download_manager import DownloadManager, DownloadConfig
dl_manager = DownloadManager()
df = dl_manager.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz") # slow
dl_manager_proxy = DownloadManager(download_config=DownloadConfig(proxies={"http": "socks5h://127.0.0.1:1080", "https": "socks5h://127.0.0.1:1080"}))
df_proxy = dl_manager_proxy.download("https://github.com/ljw20180420/CRISPRdata/raw/refs/heads/main/dataset.json.gz") # fast
2 Likes