Where are the actual files to download?

DR1912 · January 8, 2024, 6:15am

Maybe I don’t understand the structure you have on HuggingFace, but when I download all the files from cerebras/SlimPajama-627B at main I get merely 8GB of files. Where are the 895GB that are supposed to be in the dataset? Was I downloading just a preview?

nielsr · January 8, 2024, 8:05am

All HF libraries use a shared cache. Read all about it at Manage huggingface_hub cache-system and Cache management for the Datasets library specifically.

DR1912 · January 8, 2024, 7:42pm

Thank you, @nielsr , but I need to download the dataset just once, so I don’t need to use cache.
I even don’t want to store the dataset locally on disk, but to stream it directly from HF to my DropBox account, using Java.
All I need is to know is where all the data set files are. The files at the SlimPajama page are only a small preview, so I wonder where the actual dataset is.

radames · January 8, 2024, 8:14pm

hi @DR1912 ,
You could set a different cache dir during dataset load

from datasets import load_dataset
dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")

ref1: https://huggingface.co/docs/datasets/v2.15.0/en/cache#cache-directory
ref2: https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/loading_methods#datasets.load_dataset.cache_dir
ref3: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome

DR1912 · January 8, 2024, 8:29pm

Thank you, @radames , but I don’t want to store the dataset in ANY cache folder. I want to stream it in memory from HF directly to DropBox website. I know how to do it, I just don’t know where the files on HF are.

radames · January 8, 2024, 8:35pm

hi @DR1912 , I think all files are split across folders, you can resolve then and stream it individually

https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/

eg:
https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/test/chunk1/example_holdout_0.jsonl.zst

DR1912 · January 8, 2024, 8:47pm

Hey @radames, I have already done exactly that, but the files there sum up to only about 8GB, as opposed to the size of about 900GB that they are supposed to be, so I figured that these are just a small preview, and the actual dataset is somewhere else.

radames · January 8, 2024, 9:20pm

if load it via stream=True , this is the metadata, this is all data on that repo. have you tried to decompress the .zst files to check the total size?

IterableDatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 118332
    })
    validation: IterableDataset({
        features: Unknown,
        n_shards: 31428
    })
    test: IterableDataset({
        features: Unknown,
        n_shards: 31411
    })
})

Please open a discussion here cerebras/SlimPajama-627B · Discussions and tag the author @rskuzma

And different from the original dataset, RedPajama, it contains a dataset script to download the data, but SlimPajama doens’t have it, so all data is on that dataset repo.
ref: RedPajama-Data-V2.py · togethercomputer/RedPajama-Data-V2 at main

Topic		Replies	Views
Huggingface-cli to load_dataset 🤗Datasets	4	3808	March 6, 2024
I don't see my uploaded dataset 🤗Datasets	0	30	July 16, 2024
How do I customize .cache/huggingface Beginners	2	2849	November 1, 2022
Understanding the `Datasets` cache system 🤗Datasets	2	3234	May 19, 2023
Streaming for Saving 🤗Datasets	1	39	January 26, 2025

Where are the actual files to download?

Related topics