Loading Huge Image Dataset seems to take a lot of time

Hello all,

I am working on training an image model on a large image dataset. The dataset has about ~6M images and I am using huggingface load_dataset to load the dataset, run a bunch of transforms using Dataset.map method, and store the processed dataset using Dataset.save_to_disk method.

My understanding is that after load_dataset is finished once, subsequent calls should use the cached dataset.

I am running in a SLURM environment and my process died after load_dataset and before Dataset.map.

Submitting a run again, which has started, load_dataset has taken about 2 hours and it doesn’t seem to have loaded the cached dataset.

I want to understand if this is the expected outcome and if not what changes should I be making ?

Thanking in advance.

Hi ! Can you provide more information on how you load your images ? Are you using a dataset script or just loading local files ?

load_dataset first creates an arrow file that is cached so that subsequent calls reload it.

Hello @lhoestq ,

I am using dataset = load_dataset("imagefolder", data_dir="/path/to/dir", split="train") to load the dataset and the directory format is

data
β”œβ”€β”€ class_0
β”‚   β”œβ”€β”€ image_1.png
β”‚   β”œβ”€β”€ .......
β”‚   └── image_n1.jpeg
β”œβ”€β”€ class_1
β”‚   β”œβ”€β”€ image_1.png
β”‚   β”œβ”€β”€ .......
β”‚   └── image_n2.jpeg
β”œβ”€β”€ class_2
β”œβ”€β”€ ......
└── class_m
    β”œβ”€β”€ image_1.png
    β”œβ”€β”€ .......
    └── image_n2.jpeg

I forgot to mention earlier that in the SLURM cluster my dataset is in the scratch space. Will this affect how long it takes to load the dataset ?

Also could you explain what happens when loading the arrow file ?
Does it try to load the dataset to the RAM or Does it read from the file whenever necessary ?

Is it advised to load my dataset as a Map-style dataset ?

Hello,

As an update:

[INFO]: Starting `load_dataset`.
Downloading and preparing dataset image_folder/default to /scratch/path/to/image_folder/default-1b3600f249b91a7f/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597...
Dataset image_folder downloaded and prepared to /scratch/path/to/image_folder/default-1b3600f249b91a7f/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597. Subsequent calls will reuse this data.
[INFO]: Finished `load_dataset`.

Total time: 0Y 0M 0D 8h 5m 40s

I have loaded the dataset for the 5th time and in each call it outputs the above output instead of reusing the dataset.

NOTE: The [INFO] and the Total time are outputs from my code to test and track how long long it takes for the dataset to load.

I forgot to mention earlier that in the SLURM cluster my dataset is in the scratch space. Will this affect how long it takes to load the dataset ?

It depends if SCRATCH is deleted after each job or not, or how often it is wiped.

If the dataset already exists in /scratch/path/to/image_folder/default-1b3600f249b91a7f then it will reload it, otherwise it will generate the dataset at this location.

The hash part in β€œdefault-1b3600f249b91a7f” is computed using the list of paths of the images, and other metadata such as their β€œlast modified date”.

Also could you explain what happens when loading the arrow file ?
Does it try to load the dataset to the RAM or Does it read from the file whenever necessary ?

The arrow file is memory mapped, which means that it is not loaded in your RAM. In short, the portion of the disk that contains the data is mapped in memory (as virtual memory), and therefore loading the dataset takes almost 0 RAM and only uses your disk.

When you access one example from the dataset, then only this part is loaded in memory as a python object containing the image content.

Hello @lhoestq
The scratch space is not wiped immediately, there is always a buffer of 2 weeks before permanent deletion.

So ideally when I run two jobs one after another, each takes one day, it should reuse the cached dataset.

is there a way to minimize the time taken to compute the hash ?

UPDATE 1 :
The cache folder seems like it has 4 arrow files.

[user@HPC image_folder]$ pwd
/scratch/user/huggingface/datasets/image_folder

[user@HPC image_folder]$ ll
total 100
drwxrwxr-x. 3 user user 25600 May 15 21:54 default-1b3600f249b91a7f
drwxrwxr-x. 3 user user 25600 May 14 17:32 default-6b48d0d2905d69ed
drwxrwxr-x. 3 user user 25600 May 13 22:05 default-823b2944394c48b9
drwxrwxr-x. 3 user user 25600 May 12 21:32 default-c43615b269769e8e

[user@HPC image_folder]$ 

[user@HPC image_folder]$ tree -L 4 ./default-1b3600f249b91a7f/
./default-1b3600f249b91a7f/
└── 0.0.0
    └── 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
        β”œβ”€β”€ dataset_info.json
        └── image_folder-train.arrow

2 directories, 2 files

[user@HPC image_folder]$ tree -L 4 ./default-6b48d0d2905d69ed/
./default-6b48d0d2905d69ed/
└── 0.0.0
    └── 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
        β”œβ”€β”€ dataset_info.json
        └── image_folder-train.arrow

2 directories, 2 files

[user@HPC image_folder]$ tree -L 4 ./default-823b2944394c48b9/
./default-823b2944394c48b9/
└── 0.0.0
    └── 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
        β”œβ”€β”€ dataset_info.json
        └── image_folder-train.arrow

2 directories, 2 files

[user@HPC image_folder]$ tree -L 4 ./default-c43615b269769e8e/
./default-c43615b269769e8e/
└── 0.0.0
    └── 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
        β”œβ”€β”€ dataset_info.json
        └── image_folder-train.arrow

2 directories, 2 files

0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597 is constant while the default-<hash_value> keeps varying.

UPDATE 2:

We have a recurring job that changes the modified date of the folder and image files to keep the dataset from being deleted. Hence the multiple hashes and multiple arrow files.

Is there a way for me to change the hash or provide a different method to compute the hash ?

This is probably related. Because of this, we can’t guarantee that the images are the same between two jobs from the POV of the datasets cache.

You can workaround this by generating the dataset once, and then save it somewhere with .save_to_disk("path/to/saved/dataset"). Then in subsequent jobs you can reload it with load_from_disk("path/to/saved/dataset")

Thanks a lot. I will test this and get back if any issues arise.