!!SOLVED!!
Indeed I had a corrupt cache file. Just re-downloaded that specific file solved the issue.
N.B.: The environment variables also need to be set before importing anything related to datasets lib.
Hi there!
I’m trying to load pixparse/cc3m-wds dataset from cached data. Long story short, my cluster network is having some issues reaching hugging face hub and I cannot directly download dataset using the library. So I used Google Colab to download the dataset into a cache folder. I have since downloaded the entire cache folder to my node disk. Now trying to load dataset and build from the existing downloaded shards from this cache folder using,
ds = load_dataset("pixparse/cc3m-wds", cache_dir="./cc3m/")
But it keeps spitting out:
File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:1017, in MemoryMappedTable.from_file(cls, filename, replays)
1015 @classmethod
1016 def from_file(cls, filename: str, replays=None):
-> 1017 table = _memory_mapped_arrow_table_from_file(filename)
1018 table = cls._apply_replays(table, replays)
1019 return cls(table, filename, replays)
File ~/.conda/envs/lavis/lib/python3.8/site-packages/datasets/table.py:64, in _memory_mapped_arrow_table_from_file(filename)
62 def _memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
63 opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
---> 64 pa_table = opened_stream.read_all()
65 return pa_table
File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/ipc.pxi:757, in pyarrow.lib.RecordBatchReader.read_all()
File ~/.conda/envs/lavis/lib/python3.8/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
OSError: Expected to be able to read 8457224 bytes for message body, got 6054384
I double checked the downloaded files and matched their size. In google colab it progresses to,
But runs out of disk memory as expected. Can someone help me rebuild this dataset from the cache I have in my HPC node?
Thanks!