Downloading a dataset files locally

ampem · December 22, 2022, 7:28pm

Due to proxies and various other restrictions and policies, I cannot download the data using the APIs like:

from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")

I had the same problem when downloading pretrain models, but there is an alternative, to download the model files and load the model locally, for example:

git lfs install
git clone https://huggingface.co/bert-base-uncased

Then i can use

model = AutoModelForSequenceClassification.from_pretrained("path/to/locally/downloaded/model/files")

Can I download the dataset files in a similar fashion directly and for example use? if yes how?

raw_datasets = load_dataset("path/to/locally/downloaded/dataset/files")

mapama247 · December 25, 2022, 10:12am

You can use the wget command followed by the file’s URL, which should have the following format: <HUB_REPO_URL>/resolve/main/<FILE_NAME>. If you are unsure about the exact URL, you can just go to the “Files and versions” section and right-click the little arrow next to the file size to select the “Copy link address” option.

For instance, this would be a way to download the MRPC corpus that you mention:

wget https://huggingface.co/datasets/glue/resolve/main/dataset_infos.json
wget https://huggingface.co/datasets/glue/resolve/main/glue.py

And then you can enter python and do:

from datasets import load_dataset
mrpc = load_dataset(“./glue.py”, “mrpc”)

mikechen · September 22, 2023, 4:23am

Mapama247

It does not work.

First of all, it could not directly download the dataset . Second, even the above code does not work.

For instance, after downloading xsum.py, I use the following code and try to download the XSUM dataset.

from datasets import load_dataset
raw_datasets = load_dataset("./xsum.py",  "raw_datasets", split="train)

It shows the error as follows.

FileNotFoundError: Local file data/XSUM-EMNLP18-Summary-Data-Original.tar.gz doesn’t exist

I find there is one line code in the xsum.py

# From https://github.com/EdinburghNLP/XSum/issues/12
_URL_DATA = "data/XSUM-EMNLP18-Summary-Data-Original.tar.gz"

Option 1:

It can download the dataset but ReadError while “Generating train split” if f I use the following code to replace the above “_URL_DATA…”

_URL_DATA = “http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz”

Option 2

After adding ssl code, it works if using the original code as follows.

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# The xsum dataset is stored in .cache/huggingface/datasets/xsum
from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

Anyway, the method is not a direct method. It could not save the code locally.

The direct downloading method is listed as follows.

$ wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz --no-check-certificate

However, it is not easy to get such a downloading weblink for every dataset in HuggingFace.

marccasalssalvador · November 4, 2024, 11:30am

Nice. Additionally, you can download using git lfs, and then using

git clone https://huggingface.co/datasets/user/dataset-name

Topic		Replies	Views
How to load local dataset 🤗Datasets	1	1374	May 2, 2023
How to use local version of super_glue dataset instead of downloading it? 🤗Datasets	1	790	October 31, 2022
Save and load datasets 🤗Datasets	2	39070	August 16, 2021
How to download files stored in repo of dataset script? 🤗Datasets	1	895	March 7, 2022
Loading downloaded dataset from local directory 🤗Datasets	0	238	April 20, 2024

Downloading a dataset files locally

Related topics