Best way to access the cached transformation arrow file

I was experimenting with the cached arrow dataset but ran into an error

The steps followed

  1. Loaded the dataset from HF using load_dataset()
  2. I created a simple transformation by shuffling the data
    dataset = dataset["test"].shuffle(seed=42)

This step produced a new cache file in the hugging face cache directory -
~/.cache/huggingface/datasets/bt-tech___ofac/dataset/1.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow

After I restart the session, what is the best way to load the transformed cache?
With load_dataset(bt-tech__ofac", "dataset"), I get the following error

FileNotFoundError: Couldn't find a dataset script at /Users/home/Documents/bt/RD1615/bt-tech__ofac/bt-tech__ofac.py or any data file in the same directory. Couldn't find 'bt-tech__ofac' on the Hugging Face Hub either: FileNotFoundError: Dataset 'bt-tech__ofac' doesn't exist on the Hub

I made sure to export HF_DATASETS_OFFLINE=1.
I am not sure why it searched for the script in my working directory but in the HF cache, I can see the script. Also, I noticed that the delimiter between the namespace and configuration name is “–” instead of “___”.
/Users/home/.cache/huggingface/modules/datasets_modules/datasets/bt-tech--ofac/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559
Files in this directory

-rw-r--r--  1 home  staff  6356 Jan 10 10:37 README.md
-rw-r--r--  1 home  staff     0 Jan 10 10:37 __init__.py
-rw-r--r--  1 home  staff   415 Jan 10 10:37 ofac.json
-rw-r--r--  1 home  staff  5307 Jan 10 10:37 ofac.py

So, why did it fail? I also realise that this would likely fetch the original cached file - ofac-test.arrow instead of the transformed cache.

I tried to access the transformed cache by saving to disk - dataset.save_to_disk("cached_dir/") and then accessing it via load_from_disk(“cached_dir/”) - This worked.

I am just wondering why the earlier approached failed. Any hints?

Hi ! How did you load your dataset in the first place ? Using the same code it should reload the dataset from your cache.

@lhoestq

This is how I initially loaded it - load_dataset("bt-tech/ofac")

I then exported the following global variable export HF_DATASETS_OFFLINE=1. The code ran, but I couldn’t tell whether it picked up the cache because I was offline or because the cache file was already present. However, when I explicitly turn off the internet, it throws a different message but it worked

“”“Using the latest cached version of the module from /Users/home/.cache/huggingface/modules/datasets_modules/datasets/bt-tech–ofac/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559 (last modified on Fri Jan 13 15:49:50 2023) since it couldn’t be found locally at bt-tech/ofac., or remotely on the Hugging Face Hub.
“””

What does the above env variable do?

Since any transformation creates a new cached file. What is the best way to access it?

If I might add another observation:

when I save the transformed cache using dataset.save_to_disk(), the resulting arrow file is significantly larger ~2.5x - compared to the transformed cache sitting in the user’s home directory. I notice that the save_to_disk() flatten’s the indices. Why is it doing this operation? Is this the reason for the file to inflate in size?

It skips the timeout steps when trying to reach the HF Hub. to fall back on the cache directly.

datasets may use an indices mapping when shuffling/shard/select to not have to copy all the data on your disk over and over again. When calling save_to_disk though, we remove the indices mapping to end up with all the data your need and nothing more in one directory.

e.g. if you concatenate 100 times the same dataset, it doesn’t copy the data 100 time on disk, but use the same data 100 times. Then if you shuffle the dataset it applies an indices mapping to know which row to pick for each example from the data on disk. Finally if you save_to_disk then if will write the full shuffled dataset with 100x the size of the original one.

1 Like

@lhoestq
Thanks for clarifying these queries. I really appreciate it.

With respect to accessing the transformed cached file
~/.cache/huggingface/datasets/bt-tech___ofac/dataset/1.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow

Is it possible to use the load_dataset() without receiving the “FileNotFound” error?

If you re-run the same map function it will reload from the cache automatically. To load one file manually you can use ds = Dataset.from_file(path)

@lhoestq
Sorry but I haven’t used any map function in this example. Could you please clarify?

I was speaking in general, sorry. In your case you used shuffle, which creates a shuffled list of indices on top of the Arrow table containing your data. The cache file aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow generated by shuffle contains this list of indices

To reload the dataset from the cache via load_dataset, you need to call load_dataset with the same parameters as you did the first time. If it was pointing to a local directory with a .py file, this directory must exist and contain the script. This is because the cache works using a hash based on the .py file content