Load_dataset(): how to skip Starting new HTTPS connection (1): storage.googleapis.com:443

Hi,

I’ve created a Datasets script that works only with local files. However, load_dataset() seems to start with

Starting new HTTPS connection (1): storage.googleapis.com:443

Currently, perhaps of our VPN I also get the message

WARNING:HF google storage unreachable. Downloading and preparing it from source

What is that connection for, and is it possible to disable it?

Kind regares,

Ramon.

Hi ! The datasets lib checks for datasets that are already processed on the HF google storage, so that you don’t have to run the data processing over the raw fiels yourself and save you time (e.g for the wikipedia dataset).

You can set the library to work offline by setting the environment variable HF_DATASETS_OFFLINE=1

1 Like

Hi @lhoestq , thanks, exactly what I needed!

Actually, @lhoestq , HF_DATASETS_OFFLINE doesn’t seem to work, despite what the documentation says. I set it up

os.environ['HF_DATASETS_OFFLINE'] = '1'
ds = datasets.load_dataset(path='../path/to/my_dataset',
                           name='debug',
                           data_dir='/path/to/dataset',
                           cache_dir='./path/to/cache')

and the HTTPS request still happens

2023-03-29 17:25:27,347:DEBUG:Starting new HTTPS connection (1): storage.googleapis.com:443
2023-03-29 17:25:27,519:DEBUG:https://storage.googleapis.com:443 "HEAD /huggingface-nlp/cache/datasets/my_dataset/debug-HASH/1.0.0/dataset_info.json HTTP/1.1" 404 0

This seems related to HF_DATASETS_OFFLINE=1 didn't stop datasets.builder from downloading · Issue #3447 · huggingface/datasets · GitHub

You should try setting HF_DATASETS_OFFLINE before importing datasets

Or at runtime by setting datasets.config.HF_DATASETS_OFFLINE = True

1 Like

This worked, thanks @lhoestq . It gives the warning

WARNING: HF google storage unreachable. Downloading and preparing it from source

and avoids checking the online database.