Huggingface-cli to load_dataset

I can’t understand how to go from huggingface-cli or git clone to load_dataset() from that cached location. Despite the 3 methods all here: Downloading datasets they seem incompatible. The only poor suggestion here was to never use the other 2 methods: python - How to load a huggingface dataset from local path? - Stack Overflow

So idea is I want to use hf_transfer since that is about 10x faster for downloads than load_dataset. So I do:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download --repo-type dataset vietgpt/the_pile_openwebtext2

I get much faster results than load_dataset:

num_proc_load_dataset=32
dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)

But, then I spent about 2 hours trying to figure out how to load the dang result using load_dataset by playing with cache_dir, data_dir, data_files, dataset name, etc. I tried linking, copying, all sorts of files everywhere and no luck.

I can see all the files in:

/home/fsuser/.cache/huggingface/hub/datasets--vietgpt--the_pile_openwebtext2/snapshots/1de27c660aefd991700e5c13865a041591394834/data

as same ones in the original repo as if I would do git clone to that same hash folder.

But when using load_dataset, seems to go to totally different location, and on separate computer where I did that, all different file names. Very confusing.

I was able to specify some hacked setup where it stopped complaining about the files, but then asked for 1 example of a feature or something related, which I didn’t have and shouldn’t need.

Generally, suppose one uses huggingface-cli download or git clone, how does one go directly from that do load_dataset? It must be possible since the original files are exactly like that in the repo that is the origin.

Thanks!
Jon

You need login Hugging Face ( Quickstart (huggingface.co)), then load data set
dataset = load_dataset(dataset_name)

I don’t understand why logging in matters, but I tried it and it still re-downloads the data instead of using the huggingface-cli download result I already downloaded

Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from huggingface_hub import login
>>> login()

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/fsuser/.cache/huggingface/token
Login successful
>>> num_proc = 32
>>> num_proc_load_dataset = num_proc
>>> from datasets import load_dataset
>>> dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)
Resolving data files: 100%|???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 35/35 [00:02<00:00, 16.72it/s]
Downloading data:   3%|????                                                                                                                       | 29.4M/954M [00:01<00:54, 17.0MB/s]
Downloading data:   4%|?????                                                                                                                      

It seems to be stored in cache, so you can save it to a folder you can see easily

I am also confused with this! :sob: I try to download HuggingFaceH4/ultrachat_200k. When I use cli, the raw dataset .parquet file (organized as HuggingFaceH4/ultrachat_200k) is downloaded and saved in ~/.cache/huggingface/hub. But when I use load_dataset, it is saved in ~/.cache/huggingface/datasets and seems to be preprocessed to .arrow file. I just follow the step to load my HuggingFaceH4/ultrachat_200k dataset downloaded by cli:

  1. Use the cli download option --local-dir to save the file. (which can be skipped, just load_datasets from ~/.cache/huggingface/hub/your datasets is ok):
huggingface-cli download --resume-download --repo-type dataset HuggingFaceH4/ultrachat_200k --local-dir ./modeling/datasets/HuggingFaceH4/ultrachat_200k
  1. See the readme in the downloaded file to know how to organize dataset:

configs:

  • config_name: default
  • data_files:
    • split: train_sft
      path: data/train_sft-*
    • split: test_sft
      path: data/test_sft-*
    • split: train_gen
      path: data/train_gen-*
    • split: test_gen
      path: data/test_gen-*

dataset_info:

  • features:
    • name: prompt
      dtype: string
    • name: prompt_id
      dtype: string
    • name: messages
      list:
      • name: content
        dtype: string
      • name: role
        dtype: string
  1. Process it!
from datasets import load_dataset
from datasets import load_dataset,Features,Value

context_feat = Features({
    'prompt': Value(dtype='string', id=None),
    'prompt_id': Value(dtype='string', id=None),
    'messages': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}]
})

url = "./modeling/datasets/HuggingFaceH4/ultrachat_200k/data/"
>
data_files = {'train_sft': [url + "train_sft-00000-of-00003-a3ecf92756993583.parquet", url + "train_sft-00001-of-00003-0a1804bcb6ae68c6.parquet", url + "train_sft-00002-of-00003-ee46ed25cfae92c6.parquet"],
              'test_sft': url + "test_sft-00000-of-00001-f7dfac4afe5b93f4.parquet",
              'train_gen': [url + "train_gen-00000-of-00003-a6c9fb894be3e50b.parquet", url + "train_gen-00001-of-00003-d6a0402e417f35ca.parquet", url + "train_gen-00002-of-00003-c0db75b92a2f48fd.parquet"],
              'test_gen': url + "test_gen-00000-of-00001-3d4cd8309148a71f.parquet"}
  1. Load the dataset! :star_struck:
raw_datasets = load_dataset("parquet", data_files=data_files, features=context_feat)

Maybe some official document of load_dataset and how to process different type of data file will be helpful.
Hope that help! :smiling_face_with_three_hearts:

2 Likes