Can't use datasets offline, even if I have uploaded the datasets to .cache dir

I want to use sst dataset on my school server,
my dataset loding code is: raw_dataset = datasets.load_dataset('glue', 'sst2')

I have uploaded my local downloaded dataset to the \.cache\huggingface\datasets dir.

I also use os.environ['HF_DATASETS_OFFLINE ']= "1" to force the program don’t try to search the internet.

But I still got:

ConnectionError: Couldn't reach

Could anyone help me to figure it out?

the dataset dir on my server

Seems like you have a trailing space at the end there. Remove it.

thanks for pointing out. But it still dosen’t work after I remore the space.

@sgugger @pierric Could you please help me?

More infomation:

I first download the sst2 dataset on my local windows computer, than I upload the datasets folder to the .cache/huggingface/ folder on my Ubuntu server, which is not able to connect to the internet.

Is it because of the different OS?


make sure to have the line os.environ['HF_DATASETS_OFFLINE '] = "1" before import datasets in your script running on the Ubuntu server. If this is not enough, you can bypass the checks enforced by load_dataset and directly load the dataset arrow files. To do that, first, get the list of cache files on your local machine:

cache_files = your_dataset.cache_files

Then recompute the paths which these files will have once you upload them to the server. Next, upload the cache files to the server. Finally, in the script running on the server create the datasets from the cache files using Dataset.from_file (one dataset per file; you can concatenate them with datasets.concatenate_datasets if the dataset consists of more than one cache file). However, with this approach, you’ll lose some metadata by default such as .info, so let us know if you need those.

Hi ! Can you double check that you uploaded your cache directory in the right location ? If it’s in the right location you offline machine will use this cache instead of throwing an error.
By default the location is ~/.cache/huggingface/datasets

But if you have uploaded your cache directory to somewhere else, you can try to specify your new cache directory with

raw_dataset = datasets.load_dataset('glue', 'sst2', cache_dir="path/to/.cache/huggingface/datasets")