Huggingface-cli to load_dataset

pseudotensor · January 30, 2024, 8:42am

I can’t understand how to go from huggingface-cli or git clone to load_dataset() from that cached location. Despite the 3 methods all here: Downloading datasets they seem incompatible. The only poor suggestion here was to never use the other 2 methods: python - How to load a huggingface dataset from local path? - Stack Overflow

So idea is I want to use hf_transfer since that is about 10x faster for downloads than load_dataset. So I do:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download --repo-type dataset vietgpt/the_pile_openwebtext2

I get much faster results than load_dataset:

num_proc_load_dataset=32
dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)

But, then I spent about 2 hours trying to figure out how to load the dang result using load_dataset by playing with cache_dir, data_dir, data_files, dataset name, etc. I tried linking, copying, all sorts of files everywhere and no luck.

I can see all the files in:

/home/fsuser/.cache/huggingface/hub/datasets--vietgpt--the_pile_openwebtext2/snapshots/1de27c660aefd991700e5c13865a041591394834/data

as same ones in the original repo as if I would do git clone to that same hash folder.

But when using load_dataset, seems to go to totally different location, and on separate computer where I did that, all different file names. Very confusing.

I was able to specify some hacked setup where it stopped complaining about the files, but then asked for 1 example of a feature or something related, which I didn’t have and shouldn’t need.

Generally, suppose one uses huggingface-cli download or git clone, how does one go directly from that do load_dataset? It must be possible since the original files are exactly like that in the repo that is the origin.

Thanks!
Jon

Hung190100 · January 30, 2024, 8:48am

You need login Hugging Face ( Quickstart (huggingface.co)), then load data set
dataset = load_dataset(dataset_name)

pseudotensor · January 30, 2024, 8:53am

I don’t understand why logging in matters, but I tried it and it still re-downloads the data instead of using the huggingface-cli download result I already downloaded

Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from huggingface_hub import login
>>> login()

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/fsuser/.cache/huggingface/token
Login successful
>>> num_proc = 32
>>> num_proc_load_dataset = num_proc
>>> from datasets import load_dataset
>>> dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)
Resolving data files: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 35/35 [00:02<00:00, 16.72it/s]
Downloading data:   3%|????                                                                                                                       | 29.4M/954M [00:01<00:54, 17.0MB/s]
Downloading data:   4%|?????

Hung190100 · January 30, 2024, 9:05am

It seems to be stored in cache, so you can save it to a folder you can see easily

GuoDanding · March 6, 2024, 2:27pm

I am also confused with this! I try to download HuggingFaceH4/ultrachat_200k. When I use cli, the raw dataset .parquet file (organized as HuggingFaceH4/ultrachat_200k) is downloaded and saved in ~/.cache/huggingface/hub. But when I use load_dataset, it is saved in ~/.cache/huggingface/datasets and seems to be preprocessed to .arrow file. I just follow the step to load my HuggingFaceH4/ultrachat_200k dataset downloaded by cli:

Use the cli download option --local-dir to save the file. (which can be skipped, just load_datasets from ~/.cache/huggingface/hub/your datasets is ok):

huggingface-cli download --resume-download --repo-type dataset HuggingFaceH4/ultrachat_200k --local-dir ./modeling/datasets/HuggingFaceH4/ultrachat_200k

See the readme in the downloaded file to know how to organize dataset:

configs:

config_name: default

data_files:

split: train_sft
path: data/train_sft-*

split: test_sft
path: data/test_sft-*

split: train_gen
path: data/train_gen-*

split: test_gen
path: data/test_gen-*

dataset_info:

features:

name: prompt
dtype: string

name: prompt_id
dtype: string

name: messages
list:

name: content
dtype: string

name: role
dtype: string

Process it!

from datasets import load_dataset
from datasets import load_dataset,Features,Value

context_feat = Features({
    'prompt': Value(dtype='string', id=None),
    'prompt_id': Value(dtype='string', id=None),
    'messages': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}]
})

url = "./modeling/datasets/HuggingFaceH4/ultrachat_200k/data/"
>
data_files = {'train_sft': [url + "train_sft-00000-of-00003-a3ecf92756993583.parquet", url + "train_sft-00001-of-00003-0a1804bcb6ae68c6.parquet", url + "train_sft-00002-of-00003-ee46ed25cfae92c6.parquet"],
              'test_sft': url + "test_sft-00000-of-00001-f7dfac4afe5b93f4.parquet",
              'train_gen': [url + "train_gen-00000-of-00003-a6c9fb894be3e50b.parquet", url + "train_gen-00001-of-00003-d6a0402e417f35ca.parquet", url + "train_gen-00002-of-00003-c0db75b92a2f48fd.parquet"],
              'test_gen': url + "test_gen-00000-of-00001-3d4cd8309148a71f.parquet"}

Load the dataset!

raw_datasets = load_dataset("parquet", data_files=data_files, features=context_feat)

Maybe some official document of load_dataset and how to process different type of data file will be helpful.
Hope that help!

Payoto · July 23, 2025, 1:15pm

I faced the same issue - while @GuoDanding 's solution works (thanks ) in later versions of load_dataset (I’m using datasets==3.6) you can just call it with the path to the snapshot you have downloaded.

So simply do:

dataset = load_dataset("/home/fsuser/.cache/huggingface/hub/datasets--vietgpt--the_pile_openwebtext2/snapshots/1de27c660aefd991700e5c13865a041591394834/")

Now for a big dataset this will try to prepare and load the entire dataset into memory so you will want to pass streaming=True:

ds = load_dataset("/path/where/you/downloaded/dataset/<snapshot-id>/", streaming=True)

Since you’re streaming from a local folder you don’t have any of the remote access issues you might normally expect from “streaming=True”.

For big datasets streaming=True might be a good idea for @GuoDanding answer as well - when I used it for FineWeb-Edu, it spent 20h preparing all the parquet files into arrow files.

Hope this saves someone some time!

Topic		Replies	Views
How to load local dataset 🤗Datasets	1	1388	May 2, 2023
Accessing local data files 🤗Datasets	1	535	September 23, 2022
Loading downloaded dataset from local directory 🤗Datasets	0	243	April 20, 2024
Downloading a dataset files locally Beginners	3	37194	November 4, 2024
Using load_datasets for newly created datasets 🤗Datasets	2	456	August 27, 2021

Huggingface-cli to load_dataset

Related topics