Can't use datasets offline, even if I have uploaded the datasets to .cache dir

beyond · October 15, 2021, 9:15am

I want to use sst dataset on my school server,
my dataset loding code is: raw_dataset = datasets.load_dataset('glue', 'sst2')

I have uploaded my local downloaded dataset to the \.cache\huggingface\datasets dir.

I also use os.environ['HF_DATASETS_OFFLINE ']= "1" to force the program don’t try to search the internet.

But I still got:

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.13.0/datasets/glue/glue.py

Could anyone help me to figure it out?

beyond · October 15, 2021, 9:16am

the dataset dir on my server

BramVanroy · October 15, 2021, 9:43am

Seems like you have a trailing space at the end there. Remove it.

beyond · October 19, 2021, 4:12am

thanks for pointing out. But it still dosen’t work after I remore the space.

beyond · October 19, 2021, 11:29am

@sgugger @pierric Could you please help me?

beyond · October 19, 2021, 11:32am

More infomation:

I first download the sst2 dataset on my local windows computer, than I upload the datasets folder to the .cache/huggingface/ folder on my Ubuntu server, which is not able to connect to the internet.

Is it because of the different OS?

mariosasko · October 19, 2021, 5:50pm

Hi,

make sure to have the line os.environ['HF_DATASETS_OFFLINE '] = "1" before import datasets in your script running on the Ubuntu server. If this is not enough, you can bypass the checks enforced by load_dataset and directly load the dataset arrow files. To do that, first, get the list of cache files on your local machine:

cache_files = your_dataset.cache_files

Then recompute the paths which these files will have once you upload them to the server. Next, upload the cache files to the server. Finally, in the script running on the server create the datasets from the cache files using Dataset.from_file (one dataset per file; you can concatenate them with datasets.concatenate_datasets if the dataset consists of more than one cache file). However, with this approach, you’ll lose some metadata by default such as .info, so let us know if you need those.

lhoestq · October 21, 2021, 8:56am

Hi ! Can you double check that you uploaded your cache directory in the right location ? If it’s in the right location you offline machine will use this cache instead of throwing an error.
By default the location is ~/.cache/huggingface/datasets

But if you have uploaded your cache directory to somewhere else, you can try to specify your new cache directory with

raw_dataset = datasets.load_dataset('glue', 'sst2', cache_dir="path/to/.cache/huggingface/datasets")

zaccharieramzi · September 20, 2022, 2:40pm

Hi !

I have the exact same problem with datasets==2.4.0.
The output of ls ~/.cache/huggingface/datasets is

downloads
_home_zramzi_.cache_huggingface_datasets_huggan___parquet_huggan--pokemon-fd0f3e14764c2001_0.0.0_2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete.lock
_home_zramzi_.cache_huggingface_datasets_huggan___parquet_huggan--pokemon-fd0f3e14764c2001_0.0.0_2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.lock
huggan___parquet

But then, when I use load dataset offline (with the jupyter notebook magic for example):

%env HF_DATASETS_OFFLINE=1

from datasets import load_dataset

dataset = load_dataset(
    "huggan/pokemon",
    None,
    cache_dir=None,
    use_auth_token=None,
    split="train",
)

I get the error:

ConnectionError                           Traceback (most recent call last)
Cell In [4], line 1
----> 1 dataset = load_dataset(
      2     "huggan/pokemon",
      3     None,
      4     cache_dir=None,
      5     use_auth_token=None,
      6     split="train",
      7 )

File ~/workspace/diffusion-function-measures/venv/lib/python3.9/site-packages/datasets/load.py:1723, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1720 ignore_verifications = ignore_verifications or save_infos
   1722 # Create a dataset builder
-> 1723 builder_instance = load_dataset_builder(
   1724     path=path,
   1725     name=name,
   1726     data_dir=data_dir,
   1727     data_files=data_files,
   1728     cache_dir=cache_dir,
   1729     features=features,
   1730     download_config=download_config,
   1731     download_mode=download_mode,
   1732     revision=revision,
   1733     use_auth_token=use_auth_token,
...
   1244             f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1245             f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1246         ) from None

ConnectionError: Couln't reach the Hugging Face Hub for dataset 'huggan/pokemon': Offline mode is enabled.

EDIT

Upon further research, I think this is linked with Datasets created with `push_to_hub` can't be accessed in offline mode · Issue #3547 · huggingface/datasets · GitHub

MariosOreo · November 25, 2022, 1:43pm

Hello,

I have the same problem.
I download dataset from huggingface by load_dataset, then the cached dataset is saved in local machine by save_to_disk. After that, I transfer saved folder to Ubuntu server and load dataset by load_from_disk. But when reading data, it occurs No such file or directory error, I found that the read path is still path to data on my local machine.

Could you please help me how to correct it？

lhoestq · December 1, 2022, 2:52pm

Hi ! There’s a PR open that improves save_to_disk and embeds the files inside the Arrow data in save_to_disk instead of links to local paths: #5268

In the meantime you can check the documentation of save_to_disk if you want to replace the image or audio links by the actual data: Main classes
For example for images you need to run:

def read_image_file(example):
    with open(example["image"].filename, "rb") as f:
        return {"image": {"bytes": f.read()}}
ds = ds.map(read_image_file)
ds.save_to_disk("path/to/dataset/dir")

Topic		Replies	Views
Load_dataset(): how to skip Starting new HTTPS connection (1): storage.googleapis.com:443 🤗Datasets	6	3901	April 3, 2023
How to load dataset that exist in cache path Beginners	5	4962	December 6, 2023
Load dataset from cache in offline mode 🤗Datasets	1	1694	January 23, 2023
How to load cached dataset offline? Beginners	2	4581	May 29, 2022
How can you use downloaded dataset in streaming mode offline? 🤗Datasets	0	220	May 5, 2024

Can't use datasets offline, even if I have uploaded the datasets to .cache dir

EDIT

Related topics