How can you use downloaded dataset in streaming mode offline?

jsmidt · May 5, 2024, 1:47am

I am trying to train an LLM that requires big datasets. The streaming=True option is really helpful for that like the huggingface tutorials explain. For example, I would like to stream the wikipedia dataset such as:

raw_datasets = load_dataset('wikipedia', '20220301.en',split="train",streaming=True)

When I submit a job to a cluster, the nodes on the backend do not have access to the internet. Therefore I need to run in offline mode as:

import os
os.environ['HF_DATASETS_OFFLINE'] = "1"

When I do these together, streaming and offline mode, I get this error:

No such file or directory: 'data/20220301.en/train-00000-of-00041.parquet'

The first strange thing is it does not seem to be checking in the standard huggingface cache directory for files. Given this dataset has already been downloaded to

~/.cache/huggingface/datasets/wikipedia/20220301.en

and works just fine in offline mode when streaming=False, how do I get this to also work with streaming=True? Which is helpful for giant datasets? Thanks!

Topic		Replies	Views
How to load dataset that exist in cache path Beginners	5	4955	December 6, 2023
Load dataset from cache in offline mode 🤗Datasets	1	1688	January 23, 2023
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025
Can't use datasets offline, even if I have uploaded the datasets to .cache dir 🤗Datasets	10	7934	December 1, 2022
How do i load part of the data set Beginners	3	86	May 5, 2025

How can you use downloaded dataset in streaming mode offline?

Related topics