Caching a dataset with map() when loaded with from_dict()

Using datasets 1.8.0.

Normal? Situations
If I use load_dataset() to load data, it generates cache files. If you then apply .map() on that dataset, corresponding cache files are generated as expected. Following is a simple code to reproduce the results.

from datasets import load_dataset, Dataset

def add_prefix(example):
    example['sentence1'] = 'My sentence: ' + example['sentence1']
    return example

def main():
    dataset = load_dataset('glue', 'mrpc', split='train')
    print(dataset)
    dataset = dataset.map(add_prefix)
    print(dataset)

if __name__ == "__main__":
    main()

Running the above will generate the following output. As we can see with the tqdm progress bar, the .map() is applied and the cache files are saved accordingly.

Downloading: 28.8kB [00:00, 25.6MB/s]                                                                                                                                                         
Downloading: 28.7kB [00:00, 28.8MB/s]                                                                                                                                                         
Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Downloading: 6.22kB [00:00, 7.73MB/s]
Downloading: 1.05MB [00:00, 9.16MB/s]
Downloading: 441kB [00:00, 4.83MB/s]
0 examples [00:00, ? examples/s]2021-06-26 00:07:58.740056: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Dataset glue downloaded and prepared to /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.
Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 3668
})
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3668/3668 [00:00<00:00, 20903.25ex/s]
Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 3668
})

Running the exact code again will use the saved cache for both load_dataset() and .map().

Reusing dataset glue (/home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 3668
})
Loading cached processed dataset at /home/jasonyoun/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-53c01e7abeb2b20b.arrow
Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 3668
})

Not-so-normal Situations
However, using from_dict() to load dataset from the memory does not generate cache files. Because the dataset did not initially generate the cache files, applying .map() on this dataset also does not generate cache files. This can be observed in arrow_dataset.py lines 1803-1813.

# Check if we've already cached this computation (indexed by a hash)
if self.cache_files:  # !!!THIS WILL BE AN EMPTY LIST when loaded with .from_dict()!!!
    if cache_file_name is None:
        # we create a unique hash from the function,
        # current dataset file and the mapping args
        cache_file_name = self._get_cache_file_path(new_fingerprint)
    if os.path.exists(cache_file_name) and load_from_cache_file:
        logger.warning("Loading cached processed dataset at %s", cache_file_name)
        info = self.info.copy()
        info.features = features
        return Dataset.from_file(cache_file_name, info=info, split=self.split)

Why is this a problem?
In the thread below, it was suggested that one use torch.distributed.barrier() when using a distributed framework to process a large dataset.

For my specific use-case, I create the dataset using the .from_dict() method. I then process the dataset using .map() using the main process (no cache files get saved automatically unless you specify the cache_file_name parameter of .map()). Once the non-main processes are resumed, they are not able to locate the cache files.

Possible Workarounds

  1. I could save all data (maybe a csv) and load it using load_dataset(). This way, cache files are generated and all is well. However, my dataset is large (~100GB), and multiple processing using .map() further increases the data size making it very difficult to store and load data efficiently with limited HW resources. That’s why I’m using from_dict() to load smaller data on-the-fly.
  2. Instead of using in-memory data to create the dataset, I could first save the data as csv or json. I could then use load_dataset('csv' or 'json', ...). I will try this method for now. I think it will work.
1 Like

Hi !

When dataset.cache_files is empty (i.e. when your dataset comes from python objects, not from data from your disk), the map() method doesn’t know where to write the resulting dataset.

In this case, you have to pass cache_file_name=<path/to/resulting/cache/file.arrow> manually to map().

For the subsequent calls to .map(), you won’t need to specify this anymore, since it will store the cache files in the same directory as the path your provided in the first place.

For what it’s worth, I got around this issue by using datasets.Dataset.from_file in conjunction with datasets.Dataset.from_dict. The pattern looks something like:

dataset_path = '/some/path'
if not os.path.exists(dataset_path):
  d = datasets.Dataset.from_dict(...)
  d.save_to_disk(dataset_path)

d = datasets.Dataset.from_file(
  filename=os.path.join(dataset_path, 'dataset.arrow'),
  info=datasets.DatasetInfo(
    builder_name="my_dataset", 
    config_name="my_dataset_config"
  )
)

Note that the first time this code runs, when we create the dataset using from_dict, we have to save it and load it again so that dataset.cache_files will be non-empty. The reason this works is that dataset.cache_files will be automatically set if the dataset is loaded from disk.

1 Like

Also stumbled on this bug. Any plans to fix it?