Downloading a dataset files locally

Due to proxies and various other restrictions and policies, I cannot download the data using the APIs like:

from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")

I had the same problem when downloading pretrain models, but there is an alternative, to download the model files and load the model locally, for example:

git lfs install
git clone https://huggingface.co/bert-base-uncased

Then i can use

model = AutoModelForSequenceClassification.from_pretrained("path/to/locally/downloaded/model/files")

Can I download the dataset files in a similar fashion directly and for example use? if yes how?

raw_datasets = load_dataset("path/to/locally/downloaded/dataset/files")

You can use the wget command followed by the file’s URL, which should have the following format: <HUB_REPO_URL>/resolve/main/<FILE_NAME>. If you are unsure about the exact URL, you can just go to the “Files and versions” section and right-click the little arrow next to the file size to select the “Copy link address” option.

For instance, this would be a way to download the MRPC corpus that you mention:

wget https://huggingface.co/datasets/glue/resolve/main/dataset_infos.json
wget https://huggingface.co/datasets/glue/resolve/main/glue.py

And then you can enter python and do:

from datasets import load_dataset
mrpc = load_dataset(“./glue.py”, “mrpc”)

1 Like