I first download the sst2 dataset on my local windows computer, than I upload the datasets folder to the .cache/huggingface/ folder on my Ubuntu server, which is not able to connect to the internet.
make sure to have the line os.environ['HF_DATASETS_OFFLINE '] = "1" before import datasets in your script running on the Ubuntu server. If this is not enough, you can bypass the checks enforced by load_dataset and directly load the dataset arrow files. To do that, first, get the list of cache files on your local machine:
cache_files = your_dataset.cache_files
Then recompute the paths which these files will have once you upload them to the server. Next, upload the cache files to the server. Finally, in the script running on the server create the datasets from the cache files using Dataset.from_file (one dataset per file; you can concatenate them with datasets.concatenate_datasets if the dataset consists of more than one cache file). However, with this approach, you’ll lose some metadata by default such as .info, so let us know if you need those.
Hi ! Can you double check that you uploaded your cache directory in the right location ? If it’s in the right location you offline machine will use this cache instead of throwing an error.
By default the location is ~/.cache/huggingface/datasets
But if you have uploaded your cache directory to somewhere else, you can try to specify your new cache directory with