Can't use datasets offline, even if I have uploaded the datasets to .cache dir

I want to use sst dataset on my school server,
my dataset loding code is: raw_dataset = datasets.load_dataset('glue', 'sst2')

I have uploaded my local downloaded dataset to the \.cache\huggingface\datasets dir.

I also use os.environ['HF_DATASETS_OFFLINE ']= "1" to force the program don’t try to search the internet.

But I still got:

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.13.0/datasets/glue/glue.py

Could anyone help me to figure it out?

the dataset dir on my server

Seems like you have a trailing space at the end there. Remove it.

thanks for pointing out. But it still dosen’t work after I remore the space.

@sgugger @pierric Could you please help me?

More infomation:

I first download the sst2 dataset on my local windows computer, than I upload the datasets folder to the .cache/huggingface/ folder on my Ubuntu server, which is not able to connect to the internet.

Is it because of the different OS?

Hi,

make sure to have the line os.environ['HF_DATASETS_OFFLINE '] = "1" before import datasets in your script running on the Ubuntu server. If this is not enough, you can bypass the checks enforced by load_dataset and directly load the dataset arrow files. To do that, first, get the list of cache files on your local machine:

cache_files = your_dataset.cache_files

Then recompute the paths which these files will have once you upload them to the server. Next, upload the cache files to the server. Finally, in the script running on the server create the datasets from the cache files using Dataset.from_file (one dataset per file; you can concatenate them with datasets.concatenate_datasets if the dataset consists of more than one cache file). However, with this approach, you’ll lose some metadata by default such as .info, so let us know if you need those.

Hi ! Can you double check that you uploaded your cache directory in the right location ? If it’s in the right location you offline machine will use this cache instead of throwing an error.
By default the location is ~/.cache/huggingface/datasets

But if you have uploaded your cache directory to somewhere else, you can try to specify your new cache directory with

raw_dataset = datasets.load_dataset('glue', 'sst2', cache_dir="path/to/.cache/huggingface/datasets")

Hi !

I have the exact same problem with datasets==2.4.0.
The output of ls ~/.cache/huggingface/datasets is

downloads
_home_zramzi_.cache_huggingface_datasets_huggan___parquet_huggan--pokemon-fd0f3e14764c2001_0.0.0_2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete.lock
_home_zramzi_.cache_huggingface_datasets_huggan___parquet_huggan--pokemon-fd0f3e14764c2001_0.0.0_2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.lock
huggan___parquet

But then, when I use load dataset offline (with the jupyter notebook magic for example):

%env HF_DATASETS_OFFLINE=1

from datasets import load_dataset

dataset = load_dataset(
    "huggan/pokemon",
    None,
    cache_dir=None,
    use_auth_token=None,
    split="train",
)

I get the error:

ConnectionError                           Traceback (most recent call last)
Cell In [4], line 1
----> 1 dataset = load_dataset(
      2     "huggan/pokemon",
      3     None,
      4     cache_dir=None,
      5     use_auth_token=None,
      6     split="train",
      7 )

File ~/workspace/diffusion-function-measures/venv/lib/python3.9/site-packages/datasets/load.py:1723, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1720 ignore_verifications = ignore_verifications or save_infos
   1722 # Create a dataset builder
-> 1723 builder_instance = load_dataset_builder(
   1724     path=path,
   1725     name=name,
   1726     data_dir=data_dir,
   1727     data_files=data_files,
   1728     cache_dir=cache_dir,
   1729     features=features,
   1730     download_config=download_config,
   1731     download_mode=download_mode,
   1732     revision=revision,
   1733     use_auth_token=use_auth_token,
...
   1244             f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1245             f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1246         ) from None

ConnectionError: Couln't reach the Hugging Face Hub for dataset 'huggan/pokemon': Offline mode is enabled.

EDIT

Upon further research, I think this is linked with Datasets created with `push_to_hub` can't be accessed in offline mode · Issue #3547 · huggingface/datasets · GitHub

Hello,

I have the same problem.
I download dataset from huggingface by load_dataset, then the cached dataset is saved in local machine by save_to_disk. After that, I transfer saved folder to Ubuntu server and load dataset by load_from_disk. But when reading data, it occurs No such file or directory error, I found that the read path is still path to data on my local machine.

Could you please help me how to correct it?

Hi ! There’s a PR open that improves save_to_disk and embeds the files inside the Arrow data in save_to_disk instead of links to local paths: #5268

In the meantime you can check the documentation of save_to_disk if you want to replace the image or audio links by the actual data: Main classes
For example for images you need to run:

def read_image_file(example):
    with open(example["image"].filename, "rb") as f:
        return {"image": {"bytes": f.read()}}
ds = ds.map(read_image_file)
ds.save_to_disk("path/to/dataset/dir")