Cache is not being loaded when code is called from a Jupyter notebook

I have a function get_dataset() that downloads a Dataset from s3 and then runs some maps and filters on it. When I run this code outside of a notebook, it loads the cached versions of these operations, but when I call get_dataset() from within a Jupyter notebook, it re-computes and creates its own set of cache files (which it will load if I run again from within the notebook). I did ensure that in both cases it’s downloading and loading the Dataset from the same path, and the new cache files are living alongside the original ones.

I’d love to be able to access my cached Datasets from within the notebook… any thoughts?

Any change you make to the function you pass to map causes it to recompute. Defining your function in a jupyter notebook or outside may affect this.

Hi, I’m sorry if I wasn’t clear. The code inside get_dataset() doesn’t change. Only the context in which I’m calling it. For example, within the notebook if I call dset = get_dataset() it re-computes everything and creates its own cache files. But if I call !python /path/to/test/script.py where dset = get_dataset() is called from inside scipt.py, then it loads the cached files.

It seems that the call coming from within the notebook itself changes the hash?

Outline of what I’m doing:

From inside /path/to/test/script.py I do:

def get_dataset():
    # cached download and load dataset
    dir = cached_download()
    dset = datasets.load_from_disk(dir)
    # run some maps and filters
    dset = dset.filter(...)
    dset = dset.map(...)
    dset = dset.map(...)
    dset = dset.filter(...)

    return dset

if __name__ == '__main__':
    # this will compute the first time and re-load the cache every time thereafter
    dset = get_dataset()

Now inside the notebook

# this recomputes EVEN IF I ALREADY COMPUTED IT AS ABOVE
# if run a second time it loads the new cache files which live alongside the original cache files
# (i.e. in the same directory with the dataset arrow file)
import get_dataset
dset = get_dataset()

# interestingly, this will load the original cached arrow files from the script
!python /path/to/test/script.py

I hope that helps.

I think it might be because the __module__ and __globals__ of your function are different from within your script and within a notebook.

For example in your script, get_dataset.__module__ is __main__ while in your script it may be script.

Any slight change in the function will make the cache recompute the result.

As a workaround I’d suggest to try to define your function elsewere than in your __main__ script

Hi, thanks again for you input.

Sorry if it wasn’t clear, but if you look at my pseudocode/outline, you can see that get_dataset is not defined in __main__ but before it - and both are in script. I just checked and get_dataset.__module__ is script in both contexts.

However, your mention of __gloabls__ was useful. I inspected that attribute of the function when imported in the ipynb notebook versus just plain python and noticed that they are nearly identical save that the notebook version includes the following two entries inside get_dataset.__globals__['__builtins__']:

  'execfile': <function _pydev_bundle._pydev_execfile.execfile(file, glob=None, loc=None)>,
  'runfile': <function _pydev_bundle.pydev_umd.runfile(filename, args=None, wdir=None, namespace=None)>

I tried poping these entries before calling get_dataset inside the notebook, but it still triggered a re-compute as opposed to loading the cache. Are there any other attributes or anything to check that might affect the match? Also, is there a way to test if the functions are the same without running get_dataset - like calling a hash function on them in the two contexts?

I feel like this is a bug in that I am not changing any code between the two contexts. I’m calling the exact same function just from different contexts (python vs ipynb).

Thanks again!

Ok I see ! I agree this is looks like a bug then.

You can indeed hash your function

from datasets.fingerprint import Hasher 

print(Hasher.hash(get_dataset))

And you may also hash your functions attributes to find which one is changing instead of staying the same in the two contexts