Best way to access the cached transformation arrow file

sl02 · January 12, 2023, 6:57am

I was experimenting with the cached arrow dataset but ran into an error

The steps followed

Loaded the dataset from HF using load_dataset()
I created a simple transformation by shuffling the data
dataset = dataset["test"].shuffle(seed=42)

This step produced a new cache file in the hugging face cache directory -
~/.cache/huggingface/datasets/bt-tech___ofac/dataset/1.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow

After I restart the session, what is the best way to load the transformed cache?
With load_dataset(bt-tech__ofac", "dataset"), I get the following error

FileNotFoundError: Couldn't find a dataset script at /Users/home/Documents/bt/RD1615/bt-tech__ofac/bt-tech__ofac.py or any data file in the same directory. Couldn't find 'bt-tech__ofac' on the Hugging Face Hub either: FileNotFoundError: Dataset 'bt-tech__ofac' doesn't exist on the Hub

I made sure to export HF_DATASETS_OFFLINE=1.
I am not sure why it searched for the script in my working directory but in the HF cache, I can see the script. Also, I noticed that the delimiter between the namespace and configuration name is “–” instead of “___”.
/Users/home/.cache/huggingface/modules/datasets_modules/datasets/bt-tech--ofac/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559
Files in this directory

-rw-r--r--  1 home  staff  6356 Jan 10 10:37 README.md
-rw-r--r--  1 home  staff     0 Jan 10 10:37 __init__.py
-rw-r--r--  1 home  staff   415 Jan 10 10:37 ofac.json
-rw-r--r--  1 home  staff  5307 Jan 10 10:37 ofac.py

So, why did it fail? I also realise that this would likely fetch the original cached file - ofac-test.arrow instead of the transformed cache.

I tried to access the transformed cache by saving to disk - dataset.save_to_disk("cached_dir/") and then accessing it via load_from_disk(“cached_dir/”) - This worked.

I am just wondering why the earlier approached failed. Any hints?

lhoestq · January 13, 2023, 4:56pm

Hi ! How did you load your dataset in the first place ? Using the same code it should reload the dataset from your cache.

sl02 · January 14, 2023, 3:18pm

@lhoestq

This is how I initially loaded it - load_dataset("bt-tech/ofac")

I then exported the following global variable export HF_DATASETS_OFFLINE=1. The code ran, but I couldn’t tell whether it picked up the cache because I was offline or because the cache file was already present. However, when I explicitly turn off the internet, it throws a different message but it worked

“”“Using the latest cached version of the module from /Users/home/.cache/huggingface/modules/datasets_modules/datasets/bt-tech–ofac/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559 (last modified on Fri Jan 13 15:49:50 2023) since it couldn’t be found locally at bt-tech/ofac., or remotely on the Hugging Face Hub.
“””

What does the above env variable do?

Since any transformation creates a new cached file. What is the best way to access it?

sl02 · January 16, 2023, 10:35am

If I might add another observation:

when I save the transformed cache using dataset.save_to_disk(), the resulting arrow file is significantly larger ~2.5x - compared to the transformed cache sitting in the user’s home directory. I notice that the save_to_disk() flatten’s the indices. Why is it doing this operation? Is this the reason for the file to inflate in size?

lhoestq · January 16, 2023, 1:41pm

It skips the timeout steps when trying to reach the HF Hub. to fall back on the cache directly.

datasets may use an indices mapping when shuffling/shard/select to not have to copy all the data on your disk over and over again. When calling save_to_disk though, we remove the indices mapping to end up with all the data your need and nothing more in one directory.

e.g. if you concatenate 100 times the same dataset, it doesn’t copy the data 100 time on disk, but use the same data 100 times. Then if you shuffle the dataset it applies an indices mapping to know which row to pick for each example from the data on disk. Finally if you save_to_disk then if will write the full shuffled dataset with 100x the size of the original one.

sl02 · January 16, 2023, 3:43pm

@lhoestq
Thanks for clarifying these queries. I really appreciate it.

With respect to accessing the transformed cached file
~/.cache/huggingface/datasets/bt-tech___ofac/dataset/1.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow

Is it possible to use the load_dataset() without receiving the “FileNotFound” error?

lhoestq · January 17, 2023, 4:29pm

If you re-run the same map function it will reload from the cache automatically. To load one file manually you can use ds = Dataset.from_file(path)

sl02 · January 17, 2023, 7:21pm

@lhoestq
Sorry but I haven’t used any map function in this example. Could you please clarify?

lhoestq · January 23, 2023, 1:17pm

I was speaking in general, sorry. In your case you used shuffle, which creates a shuffled list of indices on top of the Arrow table containing your data. The cache file aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-c7f4daa392c5ed6a.arrow generated by shuffle contains this list of indices

To reload the dataset from the cache via load_dataset, you need to call load_dataset with the same parameters as you did the first time. If it was pointing to a local directory with a .py file, this directory must exist and contain the script. This is because the cache works using a hash based on the .py file content

n1k31t4 · January 19, 2024, 1:56pm

The files are not found perhaps because there is a typo…

load_dataset(bt-tech__ofac", "dataset") is what you tried, but got file-not-found

Juding by the original output path, it should be:
load_dataset(bt-tech___ofac", "dataset")

You’re missing an underscore - there are 3, not 2.

Topic		Replies	Views
Load dataset from a specific cache file 🤗Datasets	3	1232	February 26, 2024
Huggingface-cli to load_dataset 🤗Datasets	4	3779	March 6, 2024
Where are the actual files to download? Beginners	7	1987	January 8, 2024
Can't load dataset with simple CSV files 🤗Datasets	1	354	March 11, 2024
Can't use datasets offline, even if I have uploaded the datasets to .cache dir 🤗Datasets	10	7933	December 1, 2022

Best way to access the cached transformation arrow file

Related topics