Maybe I don’t understand the structure you have on HuggingFace, but when I download all the files from cerebras/SlimPajama-627B at main I get merely 8GB of files. Where are the 895GB that are supposed to be in the dataset? Was I downloading just a preview?
All HF libraries use a shared cache. Read all about it at Manage huggingface_hub cache-system and Cache management for the Datasets library specifically.
Thank you, @nielsr , but I need to download the dataset just once, so I don’t need to use cache.
I even don’t want to store the dataset locally on disk, but to stream it directly from HF to my DropBox account, using Java.
All I need is to know is where all the data set files are. The files at the SlimPajama page are only a small preview, so I wonder where the actual dataset is.
hi @DR1912 ,
You could set a different cache dir during dataset load
from datasets import load_dataset
dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
ref1: https://huggingface.co/docs/datasets/v2.15.0/en/cache#cache-directory
ref2: https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/loading_methods#datasets.load_dataset.cache_dir
ref3: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome
Thank you, @radames , but I don’t want to store the dataset in ANY cache folder. I want to stream it in memory from HF directly to DropBox website. I know how to do it, I just don’t know where the files on HF are.
hi @DR1912 , I think all files are split across folders, you can resolve then and stream it individually
https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/
eg:
https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/test/chunk1/example_holdout_0.jsonl.zst
Hey @radames, I have already done exactly that, but the files there sum up to only about 8GB, as opposed to the size of about 900GB that they are supposed to be, so I figured that these are just a small preview, and the actual dataset is somewhere else.
if load it via stream=True
, this is the metadata, this is all data on that repo. have you tried to decompress the .zst
files to check the total size?
IterableDatasetDict({
train: IterableDataset({
features: Unknown,
n_shards: 118332
})
validation: IterableDataset({
features: Unknown,
n_shards: 31428
})
test: IterableDataset({
features: Unknown,
n_shards: 31411
})
})
Please open a discussion here cerebras/SlimPajama-627B · Discussions and tag the author @rskuzma
And different from the original dataset, RedPajama, it contains a dataset script to download the data, but SlimPajama doens’t have it, so all data is on that dataset repo.
ref: RedPajama-Data-V2.py · togethercomputer/RedPajama-Data-V2 at main