Hi,
I’ve created a Datasets script that works only with local files. However, load_dataset()
seems to start with
Starting new HTTPS connection (1): storage.googleapis.com:443
Currently, perhaps of our VPN I also get the message
WARNING:HF google storage unreachable. Downloading and preparing it from source
What is that connection for, and is it possible to disable it?
Kind regares,
Ramon.
Hi ! The datasets
lib checks for datasets that are already processed on the HF google storage, so that you don’t have to run the data processing over the raw fiels yourself and save you time (e.g for the wikipedia
dataset).
You can set the library to work offline by setting the environment variable HF_DATASETS_OFFLINE=1
1 Like
Hi @lhoestq , thanks, exactly what I needed!
Actually, @lhoestq , HF_DATASETS_OFFLINE
doesn’t seem to work, despite what the documentation says. I set it up
os.environ['HF_DATASETS_OFFLINE'] = '1'
ds = datasets.load_dataset(path='../path/to/my_dataset',
name='debug',
data_dir='/path/to/dataset',
cache_dir='./path/to/cache')
and the HTTPS request still happens
2023-03-29 17:25:27,347:DEBUG:Starting new HTTPS connection (1): storage.googleapis.com:443
2023-03-29 17:25:27,519:DEBUG:https://storage.googleapis.com:443 "HEAD /huggingface-nlp/cache/datasets/my_dataset/debug-HASH/1.0.0/dataset_info.json HTTP/1.1" 404 0
You should try setting HF_DATASETS_OFFLINE
before importing datasets
Or at runtime by setting datasets.config.HF_DATASETS_OFFLINE = True
1 Like
This worked, thanks @lhoestq . It gives the warning
WARNING: HF google storage unreachable. Downloading and preparing it from source
and avoids checking the online database.