Loading dataset from disk taking more time than expected

I am having a dataset around 150 Gb size. So I have divided into 5 splits and saved the dataset in the disk using save_to_disk method. When I am trying to load the dataset using load_from_disk method, it is taking more time to load the dataset. I am not sure about the time. But it is approx takinge 25 mins to load 1 GB of data. When I interupputed the script I got some traceback. I think it is extracting every row from py arrow and converting it to pydict. I may be wrong.

Is there any method to load the dataset from disk very quickly when working with large datasets?

File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1659, in _iter
    yield self._getitem(
  File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1910, in _getitem
    formatted_output = format_table(
  File "/opt/conda/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 532, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/opt/conda/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 281, in __call__
    return self.format_row(pa_table)
  File "/opt/conda/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 310, in format_row
    row = self.python_arrow_extractor().extract_row(pa_table)
  File "/opt/conda/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 140, in extract_row
    return _unnest(pa_table.to_pydict())
KeyboardInterrupt

And this convertion is happening when trainer.train() is invoked

1 Like