Loading Dataset from Cache Data

abirmhadi · September 30, 2024, 6:43pm

!!SOLVED!!

Indeed I had a corrupt cache file. Just re-downloaded that specific file solved the issue.

N.B.: The environment variables also need to be set before importing anything related to datasets lib.

Hi there!

I’m trying to load pixparse/cc3m-wds dataset from cached data. Long story short, my cluster network is having some issues reaching hugging face hub and I cannot directly download dataset using the library. So I used Google Colab to download the dataset into a cache folder. I have since downloaded the entire cache folder to my node disk. Now trying to load dataset and build from the existing downloaded shards from this cache folder using,

ds = load_dataset("pixparse/cc3m-wds", cache_dir="./cc3m/")

But it keeps spitting out:

File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:1017, in MemoryMappedTable.from_file(cls, filename, replays)
   1015 @classmethod
   1016 def from_file(cls, filename: str, replays=None):
-> 1017     table = _memory_mapped_arrow_table_from_file(filename)
   1018     table = cls._apply_replays(table, replays)
   1019     return cls(table, filename, replays)

File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:64, in _memory_mapped_arrow_table_from_file(filename)
     62 def _memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
     63     opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
---> 64     pa_table = opened_stream.read_all()
     65     return pa_table

File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/ipc.pxi:757, in pyarrow.lib.RecordBatchReader.read_all()

File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

OSError: Expected to be able to read 8457224 bytes for message body, got 6054384

I double checked the downloaded files and matched their size. In google colab it progresses to,

But runs out of disk memory as expected. Can someone help me rebuild this dataset from the cache I have in my HPC node?

Thanks!

John6666 · September 30, 2024, 10:56pm

The save method is correct, but the load method may be wrong.
The cache_dir= specified at load time is the location of HF’s proprietary cache, and it will not work properly if the file or configuration file is not located in the correct place, even slightly.
Why don’t you forget about cache_dir= and use path= to directly specify the path where the dataset entities are located?
This basically works as long as you have the dataset file.

Topic		Replies	Views
Load dataset from a specific cache file 🤗Datasets	3	1264	February 26, 2024
How to load dataset that exist in cache path Beginners	5	4974	December 6, 2023
Best way to access the cached transformation arrow file 🤗Datasets	9	3123	January 19, 2024
Datasets not using the cache dir 🤗Datasets	2	723	November 29, 2023
Loading dataset from cache .arrow file 🤗Datasets	1	749	March 28, 2023

Loading Dataset from Cache Data

Related topics