Loading Dataset from Cache Data

!!SOLVED!!

Indeed I had a corrupt cache file. Just re-downloaded that specific file solved the issue.

N.B.: The environment variables also need to be set before importing anything related to datasets lib.


Hi there!

I’m trying to load pixparse/cc3m-wds dataset from cached data. Long story short, my cluster network is having some issues reaching hugging face hub and I cannot directly download dataset using the library. So I used Google Colab to download the dataset into a cache folder. I have since downloaded the entire cache folder to my node disk. Now trying to load dataset and build from the existing downloaded shards from this cache folder using,

ds = load_dataset("pixparse/cc3m-wds", cache_dir="./cc3m/")

But it keeps spitting out:

File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:1017, in MemoryMappedTable.from_file(cls, filename, replays)
   1015 @classmethod
   1016 def from_file(cls, filename: str, replays=None):
-> 1017     table = _memory_mapped_arrow_table_from_file(filename)
   1018     table = cls._apply_replays(table, replays)
   1019     return cls(table, filename, replays)

File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:64, in _memory_mapped_arrow_table_from_file(filename)
     62 def _memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
     63     opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
---> 64     pa_table = opened_stream.read_all()
     65     return pa_table

File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/ipc.pxi:757, in pyarrow.lib.RecordBatchReader.read_all()

File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

OSError: Expected to be able to read 8457224 bytes for message body, got 6054384

I double checked the downloaded files and matched their size. In google colab it progresses to,

But runs out of disk memory as expected. Can someone help me rebuild this dataset from the cache I have in my HPC node?

Thanks!

The save method is correct, but the load method may be wrong.
The cache_dir= specified at load time is the location of HF’s proprietary cache, and it will not work properly if the file or configuration file is not located in the correct place, even slightly.
Why don’t you forget about cache_dir= and use path= to directly specify the path where the dataset entities are located?
This basically works as long as you have the dataset file.

1 Like