Duplicated cache- arrow files when uploading large folder?

Hello,

I have a folder with .arrow files that I previously created using save_to_disk. When I now use upload_large_folder I see some .cache- files being pushed to the hub, that I don’t see in the directory i’m pushing in my local machine. Is this normal? Are they duplicates or is HF splitting the file into two?

It seems that all files from my local folder were uploaded and these are additional

This is an example of the ā€œextraā€ files:

[ā€˜uniref50_202401/arrow/train/cache-5438b1d15cbf9f5a_00004_of_00024.arrow’, ā€˜uniref50_202401/arrow/train/cache-77dd2d54eba47e69_00004_of_00024.arrow’]

Can i just delete those in the repo?

1 Like

It seems like it’s okay to delete it, but if you’re worried, call lhonestq.

cache files are the cache results of e.g. map() operations on the original data, you can delete them if you don’t need the map() results anymore :wink:

1 Like