How to download data from hugging face that is visible on the data viewer but the files are not available?

brando · August 12, 2023, 9:29pm

I can see them:

but no matter how I change the download url I can’t get the data. Files are not there and their script doesn’t work.

Anyone know how to get the splits and know which are the splits?

Related conversations/discussion:

so: huggingface transformers - How to download data from hugging face that is visible on the data viewer but the files are not available? - Stack Overflow

julien-c · August 14, 2023, 12:12pm

maybe for @severo or @lhoestq or someone from your teams

severo · August 14, 2023, 4:49pm

Hi @brando,

you can get the parquet files for every config by clicking Auto-converted to Parquet.

Capture d’écran 2023-08-14 à 12.42.33

For example, for the hacker_news train split, it would send to EleutherAI/pile at refs/convert/parquet.

Also note that if you click on API,

Capture d’écran 2023-08-14 à 12.51.46

you have access to the REST API endpoints

So you can download:

the list of split names, for each config (https://datasets-server.huggingface.co/splits?dataset=EleutherAI%2Fpile, or even https://datasets-server.huggingface.co/splits?dataset=EleutherAI%2Fpile&config=hacker_news to get only the splits for the hacker_news config). As you can see, some configs only have 1 split, while other have up to 3 splits
any range of data for a given split, eg: https://datasets-server.huggingface.co/rows?dataset=EleutherAI%2Fpile&config=hacker_news&split=train&offset=0&limit=100 gives you the 100 first rows for the hacker_news train split
the list of parquet files I mentioned above, with https://huggingface.co/api/datasets/EleutherAI/pile/parquet/hacker_news/train

brando · August 14, 2023, 10:52pm

ok this is close. I will leave my question in case you know how to do this already, but do need it in a format compatible with HF’s datsets load_dataset(...) function. Ideally, all the rows, not just the first 100. I will try something now and report, maybe you will beat me

brando · August 14, 2023, 11:25pm

this seems to work but it’s rather annoying.

Summary of how to make it work:

get urls to parquet files into a list
load list to load_dataset via load_dataset('parquet', data_files=urls) (note api names to hf are really confusing sometimes)
then it should work, print a batch of text.

presudo code

urls_hacker_news = [
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00000-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00001-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00002-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00003-of-00004.parquet"
]

...


    # streaming = False
    from diversity.pile_subset_urls import urls_hacker_news
    path, name, data_files = 'parquet', 'hacker_news', urls_hacker_news
    # not changing
    batch_size = 512
    today = datetime.datetime.now().strftime('%Y-m%m-d%d-t%Hh_%Mm_%Ss')
    run_name = f'{path} div_coeff_{num_batches=} ({today=} ({name=}) {data_mixture_name=} {probabilities=})'
    print(f'{run_name=}')

    # - Init wandb
    debug: bool = mode == 'dryrun'
    run = wandb.init(mode=mode, project="beyond-scale", name=run_name, save_code=True)
    wandb.config.update({"num_batches": num_batches, "path": path, "name": name, "today": today, 'probabilities': probabilities, 'batch_size': batch_size, 'debug': debug, 'data_mixture_name': data_mixture_name, 'streaming': streaming, 'data_files': data_files})
    # run.notify_on_failure() # https://community.wandb.ai/t/how-do-i-set-the-wandb-alert-programatically-for-my-current-run/4891
    print(f'{debug=}')
    print(f'{wandb.config=}')

    # -- Get probe network
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    # -- Get data set
    def my_load_dataset(path, name):
        print(f'{path=} {name=} {streaming=}')
        if path == 'json' or path == 'bin' or path == 'csv':
            print(f'{data_files_prefix+name=}')
            return load_dataset(path, data_files=data_files_prefix+name, streaming=streaming, split="train").with_format("torch")
        elif path == 'parquet':
            print(f'{data_files=}')
            return load_dataset(path, data_files=data_files, streaming=streaming, split="train").with_format("torch")
        else:
            return load_dataset(path, name, streaming=streaming, split="train").with_format("torch")
    # - get data set for real now
    if isinstance(path, str):
        dataset = my_load_dataset(path, name)
    else:
        print('-- interleaving datasets')
        datasets = [my_load_dataset(path, name).with_format("torch") for path, name in zip(path, name)]
        [print(f'{dataset.description=}') for dataset in datasets]
        dataset = interleave_datasets(datasets, probabilities)
    print(f'{dataset=}')
    batch = dataset.take(batch_size)
    print(f'{next(iter(batch))=}')
    column_names = next(iter(batch)).keys()
    print(f'{column_names=}')

    # - Prepare functions to tokenize batch
    def preprocess(examples):
        return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(preprocess, batched=True, remove_columns=remove_columns)
    tokenized_batch = map(batch)
    print(f'{next(iter(tokenized_batch))=}')

brando · August 15, 2023, 3:10am

lhoestq · August 15, 2023, 4:30pm

The parquet fiels are under the git revision “refs/convert/parquet”. So you can try

ds = load_dataset("EleutherAI/pile", revision="refs/convert/parquet", data_dir="hacker_news")

brando · August 15, 2023, 4:41pm

too late? Seems I already tried a different solution and it seems to work. Printed some data and it looks right.

That does seem cleaner though!

Thank y’all!

Topic		Replies	Views
Dataset not downloading 🤗Datasets	3	967	April 19, 2023
Dataset Viewer for dataset with downloadable data 🤗Datasets	3	23	March 6, 2025
How to download dataset on Huggingface? Beginners	4	9807	October 19, 2023
Which URLs should be reachable to work with Huggingface hub 🤗Datasets	2	1936	January 26, 2022
Emotion dataset not available Beginners	3	411	August 24, 2024

How to download data from hugging face that is visible on the data viewer but the files are not available?

Related topics