Where are the actual files to download?

Maybe I don’t understand the structure you have on HuggingFace, but when I download all the files from cerebras/SlimPajama-627B at main I get merely 8GB of files. Where are the 895GB that are supposed to be in the dataset? Was I downloading just a preview?

All HF libraries use a shared cache. Read all about it at Manage huggingface_hub cache-system and Cache management for the Datasets library specifically.

Thank you, @nielsr , but I need to download the dataset just once, so I don’t need to use cache.
I even don’t want to store the dataset locally on disk, but to stream it directly from HF to my DropBox account, using Java.
All I need is to know is where all the data set files are. The files at the SlimPajama page are only a small preview, so I wonder where the actual dataset is.

hi @DR1912 ,
You could set a different cache dir during dataset load

from datasets import load_dataset
dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")

ref1: https://huggingface.co/docs/datasets/v2.15.0/en/cache#cache-directory
ref2: https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/loading_methods#datasets.load_dataset.cache_dir
ref3: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome

Thank you, @radames , but I don’t want to store the dataset in ANY cache folder. I want to stream it in memory from HF directly to DropBox website. I know how to do it, I just don’t know where the files on HF are.

hi @DR1912 , I think all files are split across folders, you can resolve then and stream it individually

https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/

eg:
https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/test/chunk1/example_holdout_0.jsonl.zst

Hey @radames, I have already done exactly that, but the files there sum up to only about 8GB, as opposed to the size of about 900GB that they are supposed to be, so I figured that these are just a small preview, and the actual dataset is somewhere else.

if load it via stream=True , this is the metadata, this is all data on that repo. have you tried to decompress the .zst files to check the total size?

IterableDatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 118332
    })
    validation: IterableDataset({
        features: Unknown,
        n_shards: 31428
    })
    test: IterableDataset({
        features: Unknown,
        n_shards: 31411
    })
})

Please open a discussion here cerebras/SlimPajama-627B · Discussions and tag the author @rskuzma

And different from the original dataset, RedPajama, it contains a dataset script to download the data, but SlimPajama doens’t have it, so all data is on that dataset repo.
ref: RedPajama-Data-V2.py · togethercomputer/RedPajama-Data-V2 at main