I’ve created a Datasets script that works only with local files. However,
load_dataset() seems to start with
Starting new HTTPS connection (1): storage.googleapis.com:443
Currently, perhaps of our VPN I also get the message
WARNING:HF google storage unreachable. Downloading and preparing it from source
What is that connection for, and is it possible to disable it?
Hi ! The
datasets lib checks for datasets that are already processed on the HF google storage, so that you don’t have to run the data processing over the raw fiels yourself and save you time (e.g for the
You can set the library to work offline by setting the environment variable
Hi @lhoestq , thanks, exactly what I needed!
Actually, @lhoestq ,
HF_DATASETS_OFFLINE doesn’t seem to work, despite what the documentation says. I set it up
os.environ['HF_DATASETS_OFFLINE'] = '1'
ds = datasets.load_dataset(path='../path/to/my_dataset',
and the HTTPS request still happens
2023-03-29 17:25:27,347:DEBUG:Starting new HTTPS connection (1): storage.googleapis.com:443
2023-03-29 17:25:27,519:DEBUG:https://storage.googleapis.com:443 "HEAD /huggingface-nlp/cache/datasets/my_dataset/debug-HASH/1.0.0/dataset_info.json HTTP/1.1" 404 0
You should try setting
HF_DATASETS_OFFLINE before importing
Or at runtime by setting
datasets.config.HF_DATASETS_OFFLINE = True
This worked, thanks @lhoestq . It gives the warning
WARNING: HF google storage unreachable. Downloading and preparing it from source
and avoids checking the online database.