I have a function get_dataset() that downloads a Dataset from s3 and then runs some maps and filters on it. When I run this code outside of a notebook, it loads the cached versions of these operations, but when I call get_dataset() from within a Jupyter notebook, it re-computes and creates its own set of cache files (which it will load if I run again from within the notebook). I did ensure that in both cases it’s downloading and loading the Dataset from the same path, and the new cache files are living alongside the original ones.
I’d love to be able to access my cached Datasets from within the notebook… any thoughts?
Hi, I’m sorry if I wasn’t clear. The code inside get_dataset() doesn’t change. Only the context in which I’m calling it. For example, within the notebook if I call dset = get_dataset() it re-computes everything and creates its own cache files. But if I call !python /path/to/test/script.py where dset = get_dataset() is called from inside scipt.py, then it loads the cached files.
It seems that the call coming from within the notebook itself changes the hash?
Outline of what I’m doing:
From inside /path/to/test/script.py I do:
def get_dataset():
# cached download and load dataset
dir = cached_download()
dset = datasets.load_from_disk(dir)
# run some maps and filters
dset = dset.filter(...)
dset = dset.map(...)
dset = dset.map(...)
dset = dset.filter(...)
return dset
if __name__ == '__main__':
# this will compute the first time and re-load the cache every time thereafter
dset = get_dataset()
Now inside the notebook
# this recomputes EVEN IF I ALREADY COMPUTED IT AS ABOVE
# if run a second time it loads the new cache files which live alongside the original cache files
# (i.e. in the same directory with the dataset arrow file)
import get_dataset
dset = get_dataset()
# interestingly, this will load the original cached arrow files from the script
!python /path/to/test/script.py
Sorry if it wasn’t clear, but if you look at my pseudocode/outline, you can see that get_dataset is not defined in __main__ but before it - and both are in script. I just checked and get_dataset.__module__ is script in both contexts.
However, your mention of __gloabls__ was useful. I inspected that attribute of the function when imported in the ipynb notebook versus just plain python and noticed that they are nearly identical save that the notebook version includes the following two entries inside get_dataset.__globals__['__builtins__']:
I tried poping these entries before calling get_dataset inside the notebook, but it still triggered a re-compute as opposed to loading the cache. Are there any other attributes or anything to check that might affect the match? Also, is there a way to test if the functions are the same without running get_dataset - like calling a hash function on them in the two contexts?
I feel like this is a bug in that I am not changing any code between the two contexts. I’m calling the exact same function just from different contexts (python vs ipynb).