Huggingface-cli to load_dataset

I can’t understand how to go from huggingface-cli or git clone to load_dataset() from that cached location. Despite the 3 methods all here: Downloading datasets they seem incompatible. The only poor suggestion here was to never use the other 2 methods: python - How to load a huggingface dataset from local path? - Stack Overflow

So idea is I want to use hf_transfer since that is about 10x faster for downloads than load_dataset. So I do:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download --repo-type dataset vietgpt/the_pile_openwebtext2

I get much faster results than load_dataset:

num_proc_load_dataset=32
dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)

But, then I spent about 2 hours trying to figure out how to load the dang result using load_dataset by playing with cache_dir, data_dir, data_files, dataset name, etc. I tried linking, copying, all sorts of files everywhere and no luck.

I can see all the files in:

/home/fsuser/.cache/huggingface/hub/datasets--vietgpt--the_pile_openwebtext2/snapshots/1de27c660aefd991700e5c13865a041591394834/data

as same ones in the original repo as if I would do git clone to that same hash folder.

But when using load_dataset, seems to go to totally different location, and on separate computer where I did that, all different file names. Very confusing.

I was able to specify some hacked setup where it stopped complaining about the files, but then asked for 1 example of a feature or something related, which I didn’t have and shouldn’t need.

Generally, suppose one uses huggingface-cli download or git clone, how does one go directly from that do load_dataset? It must be possible since the original files are exactly like that in the repo that is the origin.

Thanks!
Jon

You need login Hugging Face ( Quickstart (huggingface.co)), then load data set
dataset = load_dataset(dataset_name)

I don’t understand why logging in matters, but I tried it and it still re-downloads the data instead of using the huggingface-cli download result I already downloaded

Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from huggingface_hub import login
>>> login()

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/fsuser/.cache/huggingface/token
Login successful
>>> num_proc = 32
>>> num_proc_load_dataset = num_proc
>>> from datasets import load_dataset
>>> dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)
Resolving data files: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 35/35 [00:02<00:00, 16.72it/s]
Downloading data:   3%|????                                                                                                                       | 29.4M/954M [00:01<00:54, 17.0MB/s]
Downloading data:   4%|?????                                                                                                                      

It seems to be stored in cache, so you can save it to a folder you can see easily