Hello all,
I am working on training an image model on a large image dataset. The dataset has about ~6M images and I am using huggingface load_dataset
to load the dataset, run a bunch of transforms using Dataset.map
method, and store the processed dataset using Dataset.save_to_disk
method.
My understanding is that after load_dataset
is finished once, subsequent calls should use the cached dataset.
I am running in a SLURM environment and my process died after load_dataset
and before Dataset.map
.
Submitting a run again, which has started, load_dataset
has taken about 2 hours and it doesnβt seem to have loaded the cached dataset.
I want to understand if this is the expected outcome and if not what changes should I be making ?
Thanking in advance.
Hi ! Can you provide more information on how you load your images ? Are you using a dataset script or just loading local files ?
load_dataset first creates an arrow file that is cached so that subsequent calls reload it.
1 Like
Hello @lhoestq ,
I am using dataset = load_dataset("imagefolder", data_dir="/path/to/dir", split="train")
to load the dataset and the directory format is
data
βββ class_0
β βββ image_1.png
β βββ .......
β βββ image_n1.jpeg
βββ class_1
β βββ image_1.png
β βββ .......
β βββ image_n2.jpeg
βββ class_2
βββ ......
βββ class_m
βββ image_1.png
βββ .......
βββ image_n2.jpeg
I forgot to mention earlier that in the SLURM cluster my dataset is in the scratch
space. Will this affect how long it takes to load the dataset ?
Also could you explain what happens when loading the arrow file ?
Does it try to load the dataset to the RAM or Does it read from the file whenever necessary ?
Is it advised to load my dataset as a Map-style dataset ?
1 Like
Hello,
As an update:
[INFO]: Starting `load_dataset`.
Downloading and preparing dataset image_folder/default to /scratch/path/to/image_folder/default-1b3600f249b91a7f/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597...
Dataset image_folder downloaded and prepared to /scratch/path/to/image_folder/default-1b3600f249b91a7f/0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597. Subsequent calls will reuse this data.
[INFO]: Finished `load_dataset`.
Total time: 0Y 0M 0D 8h 5m 40s
I have loaded the dataset for the 5th time and in each call it outputs the above output instead of reusing the dataset.
NOTE: The [INFO]
and the Total time
are outputs from my code to test and track how long long it takes for the dataset to load.
I forgot to mention earlier that in the SLURM cluster my dataset is in the scratch
space. Will this affect how long it takes to load the dataset ?
It depends if SCRATCH is deleted after each job or not, or how often it is wiped.
If the dataset already exists in /scratch/path/to/image_folder/default-1b3600f249b91a7f
then it will reload it, otherwise it will generate the dataset at this location.
The hash part in βdefault-1b3600f249b91a7fβ is computed using the list of paths of the images, and other metadata such as their βlast modified dateβ.
Also could you explain what happens when loading the arrow file ?
Does it try to load the dataset to the RAM or Does it read from the file whenever necessary ?
The arrow file is memory mapped, which means that it is not loaded in your RAM. In short, the portion of the disk that contains the data is mapped in memory (as virtual memory), and therefore loading the dataset takes almost 0 RAM and only uses your disk.
When you access one example from the dataset, then only this part is loaded in memory as a python object containing the image content.
Hello @lhoestq
The scratch space is not wiped immediately, there is always a buffer of 2 weeks before permanent deletion.
So ideally when I run two jobs one after another, each takes one day, it should reuse the cached dataset.
is there a way to minimize the time taken to compute the hash ?
UPDATE 1 :
The cache folder seems like it has 4 arrow files.
[user@HPC image_folder]$ pwd
/scratch/user/huggingface/datasets/image_folder
[user@HPC image_folder]$ ll
total 100
drwxrwxr-x. 3 user user 25600 May 15 21:54 default-1b3600f249b91a7f
drwxrwxr-x. 3 user user 25600 May 14 17:32 default-6b48d0d2905d69ed
drwxrwxr-x. 3 user user 25600 May 13 22:05 default-823b2944394c48b9
drwxrwxr-x. 3 user user 25600 May 12 21:32 default-c43615b269769e8e
[user@HPC image_folder]$
[user@HPC image_folder]$ tree -L 4 ./default-1b3600f249b91a7f/
./default-1b3600f249b91a7f/
βββ 0.0.0
βββ 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
βββ dataset_info.json
βββ image_folder-train.arrow
2 directories, 2 files
[user@HPC image_folder]$ tree -L 4 ./default-6b48d0d2905d69ed/
./default-6b48d0d2905d69ed/
βββ 0.0.0
βββ 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
βββ dataset_info.json
βββ image_folder-train.arrow
2 directories, 2 files
[user@HPC image_folder]$ tree -L 4 ./default-823b2944394c48b9/
./default-823b2944394c48b9/
βββ 0.0.0
βββ 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
βββ dataset_info.json
βββ image_folder-train.arrow
2 directories, 2 files
[user@HPC image_folder]$ tree -L 4 ./default-c43615b269769e8e/
./default-c43615b269769e8e/
βββ 0.0.0
βββ 48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
βββ dataset_info.json
βββ image_folder-train.arrow
2 directories, 2 files
0.0.0/48efdc62d40223daee675ca093d163bcb6cb0b7d7f93eb25aebf5edca72dc597
is constant while the default-<hash_value>
keeps varying.
UPDATE 2:
We have a recurring job that changes the modified date of the folder and image files to keep the dataset from being deleted. Hence the multiple hashes and multiple arrow files.
Is there a way for me to change the hash or provide a different method to compute the hash ?
This is probably related. Because of this, we canβt guarantee that the images are the same between two jobs from the POV of the datasets
cache.
You can workaround this by generating the dataset once, and then save it somewhere with .save_to_disk("path/to/saved/dataset")
. Then in subsequent jobs you can reload it with load_from_disk("path/to/saved/dataset")
1 Like
Thanks a lot. I will test this and get back if any issues arise.