Using datasets 1.8.0.
Normal? Situations
If I use load_dataset()
to load data, it generates cache files. If you then apply .map()
on that dataset, corresponding cache files are generated as expected. Following is a simple code to reproduce the results.
from datasets import load_dataset, Dataset
def add_prefix(example):
example['sentence1'] = 'My sentence: ' + example['sentence1']
return example
def main():
dataset = load_dataset('glue', 'mrpc', split='train')
print(dataset)
dataset = dataset.map(add_prefix)
print(dataset)
if __name__ == "__main__":
main()
Running the above will generate the following output. As we can see with the tqdm progress bar, the .map()
is applied and the cache files are saved accordingly.
Downloading: 28.8kB [00:00, 25.6MB/s]
Downloading: 28.7kB [00:00, 28.8MB/s]
Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Downloading: 6.22kB [00:00, 7.73MB/s]
Downloading: 1.05MB [00:00, 9.16MB/s]
Downloading: 441kB [00:00, 4.83MB/s]
0 examples [00:00, ? examples/s]2021-06-26 00:07:58.740056: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Dataset glue downloaded and prepared to /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.
Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 3668
})
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3668/3668 [00:00<00:00, 20903.25ex/s]
Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 3668
})
Running the exact code again will use the saved cache for both load_dataset()
and .map()
.
Reusing dataset glue (/home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 3668
})
Loading cached processed dataset at /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-53c01e7abeb2b20b.arrow
Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 3668
})
Not-so-normal Situations
However, using from_dict()
to load dataset from the memory does not generate cache files. Because the dataset did not initially generate the cache files, applying .map()
on this dataset also does not generate cache files. This can be observed in arrow_dataset.py lines 1803-1813.
# Check if we've already cached this computation (indexed by a hash)
if self.cache_files: # !!!THIS WILL BE AN EMPTY LIST when loaded with .from_dict()!!!
if cache_file_name is None:
# we create a unique hash from the function,
# current dataset file and the mapping args
cache_file_name = self._get_cache_file_path(new_fingerprint)
if os.path.exists(cache_file_name) and load_from_cache_file:
logger.warning("Loading cached processed dataset at %s", cache_file_name)
info = self.info.copy()
info.features = features
return Dataset.from_file(cache_file_name, info=info, split=self.split)
Why is this a problem?
In the thread below, it was suggested that one use torch.distributed.barrier()
when using a distributed framework to process a large dataset.
For my specific use-case, I create the dataset using the .from_dict()
method. I then process the dataset using .map()
using the main process (no cache files get saved automatically unless you specify the cache_file_name
parameter of .map()
). Once the non-main processes are resumed, they are not able to locate the cache files.
Possible Workarounds
- I could save all data (maybe a csv) and load it using
load_dataset()
. This way, cache files are generated and all is well. However, my dataset is large (~100GB), and multiple processing using.map()
further increases the data size making it very difficult to store and load data efficiently with limited HW resources. Thatβs why Iβm usingfrom_dict()
to load smaller data on-the-fly. - Instead of using in-memory data to create the dataset, I could first save the data as csv or json. I could then use
load_dataset('csv' or 'json', ...)
. I will try this method for now. I think it will work.