Disk full, how to change download folder, downloading datasets to a new folder

First off, hello.

Now that the pleasantries are out of the way… let’s just be brutally honest.

You programmers suck at instructions. I mean epically, horrifically, SO(*FDU)(*H bad

Programming is a language and should be taught as such.

I’m here because of a near-miss with my sledgehammer and my computer.

Why? (as if you care, you don’t, but you’re trying to be polite)

Because of this:
" File “e:\Dev\CodeWriter4.0.venv\Lib\site-packages\huggingface_hub\file_download.py”, line 860, in hf_hub_download
return _hf_hub_download_to_cache_dir(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “e:\Dev\CodeWriter4.0.venv\Lib\site-packages\huggingface_hub\file_download.py”, line 1009, in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
File “e:\Dev\CodeWriter4.0.venv\Lib\site-packages\huggingface_hub\file_download.py”, line 1543, in _download_to_tmp_and_move
http_get(
File “e:\Dev\CodeWriter4.0.venv\Lib\site-packages\huggingface_hub\file_download.py”, line 455, in http_get
temp_file.write(chunk)
OSError: [Errno 28] No space left on device"

Console barf…

And to my fellow Asperger-types - yes I RTMFM (I added a letter, you figure it out)

I can’t tell you how many times I’ve said “It just can’t be this hard”.

Anyway.

Here was the fix:
os.environ[“HUGGINGFACE_HUB_CACHE”] = r"G:\HuggingFace\hub"

How do I know that was the fix?

Well, for the 97th time I “searched the site” and “read the MFM” and still got “disk full”. So added

os.environ[“HF_HOME”] = r"G:\huggingface"
OSError: [Errno 28] No space left on device"
:face_with_symbols_over_mouth:

Then added…

os.environ[“HF_DATASETS_CACHE”] = r"G:\HuggingFace\datasets"
OSError: [Errno 28] No space left on device"
:face_with_symbols_over_mouth::face_with_symbols_over_mouth:

Then added…

os.environ[“HF_MODELS_CACHE”] = r"G:\HuggingFace\models"
OSError: [Errno 28] No space left on device"
:face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth:

Then added…

os.environ[“TRANSFORMERS_CACHE”] = r"G:\HuggingFace\transformers"
OSError: [Errno 28] No space left on device"

:face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth:
Then added…

os.environ[“HF_DATASETS_DOWNLOADED_DATASETS_PATH”] = r"G:\HuggingFace\datasets"
OSError: [Errno 28] No space left on device"
:face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage: :rage:

Then finally…

os.environ[“HUGGINGFACE_HUB_CACHE”] = r"G:\HuggingFace\hub"
:partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face:

And since I left the disk full so the process would fail in seconds instead of WASTING another 43 minutes of my life. (which doesn’t seem like much until you add up all the above trials and the fact that it took 43 minutes to fail each time)

In the process, I Googled, “AI’d”, etc., etc., etc., etc., etc.,

I changed, rebooted, took copious notes because I could just hear some schmuck say “well did you try {insert condescending crap you say because you’re an arrogant programmer and unless someone has 234,232 leetcode solves, 1,234,611 Github stars, and 235 commits per day for the last 10 years straight with zero missees (or whatever) they’re just “not that good”}”

(admittedly, dealing with people will do that to you so no blame in the above, just making a blunt observation)

All in an effort to do what should be a simple :face_with_symbols_over_mouth::face_with_symbols_over_mouth: task.

Why is it so hard to just list all (ALL) of the variables involved?
Sorta like…

"Hey, HuggingFace will need to download a crap ton of data. We’re talking HUNDREDS of gigs. So be careful where you store this crap.

Here’s how we store stuff on your precocious little machine:
dir1
dir2,
etc…

If you want to change these it’s pretty simple

In your code:

examples

If you want to do it another way, figure that crap out because if you start mucking with environment variables, shell variables, session variables… well… your computer will become a smoking pile of hot garbage pretty quickly.

Anyway, simple works is easy to test and debug, so we go with simple!

And we like to keep our docs updated so if this ever is outdated, we will pay the first person to find the mistake $100 cash and publicly flog our entire development team. And it comes out of our lead developer’s pocket so he gets the benefit of TWO painful lessons!

Happy coding!"

Anyway, that’s how I’d run things if I were in charge.

Which is probably why I’m not in charge now that I think about it…

Hope this helps you find your way to changing the download directory, change the download folder or just deal with HuggingFace filling up your MASSIVE hard drive in short order.

And for those of us on the spectrum types, I hope this has been entertaining, enlightening, and if not… sorry about your luck.

1 Like

Yeah… I’m back.

If you’re a coder you can probably relate to that “I’m up at 3am because I can’t stop thinking about this project” state…

This morning, I realized I might have left out some data for a person who, like me, may have trouble processing the totally crappy “programmer speak” when it comes to getting stuff done with code.

This is NOT the same as saying you’re somehow “less than good” it just means you process information differently - like me.

Anyway…

One thing to remember is that computers are epically, tragically stupid.

Programming languages, then, need to be very specific (the only defense to stupidity is specificity and/or a bludgeon).

That means if a program, like HuggingFace, needs to save something to your computer, it needs very specific instructions.

Otherwise it would ask “Well, what about THIS file?” 57,823 times.

On top of that the language you are using to do all this crap with ALSO needs specific instructions.

Put those two things together and you’ve got a big 4$$ mess on your hands. Which leads me right back to my opening statement that programmers absolutely ()(&KJ suck at teaching how to use programming to do stuff.

Anyway, that’s another show but I just can’t help complaining about it because most programmers will say stupid stuff like “oh that’s just part of learning” when someone expresses how insanely frustrated they are with the utterly stupid way both the construction of the language and the teaching methods are when paired in the way they are…

Whatever.

Back to saving stuff.

Since both your chosen programming language and the HuggingFace program need to ‘talk’ it is important you know all the stuff they will want to talk about.

Kind of like going to the store for someone else. But in this case you’re going to the store in another country, you’re blindfolded on the way and kicked in the head as you get thrown from a moving vehicle for maximum disorientation and distraction due to pain and suffering.

(and if you care, I’m typing this while waiting for crap to upload to Pinecone so, no I’m not just sitting here stewing in my own spleen complaining. And, no, it’s not working so that’s just fuel to my spleen which is overflowing with bile and vitriol due to programmer-speak induced rage)

Back to getting kicked in the head and thrown from a moving vehicle in front of the MegaMart.

To get the stuff for the other person you’ll need to know what they want and where it is in the store.

If you actually thought about all the steps required to actually do that you’d probably never leave your house for fear of forgetting a step and ending this particular branch of the multiverse. (I don’t actually think that particular theory has any merit but that’s another show)

But this is EXACTLY the problem with programming (and programmers and how programming is taught).

What seems like a few steps is usually a dozen or so. Even simple programming examples explode with complexity.

Let’s look at one example:
Once you’ve found an interesting dataset on the Hugging Face Hub, you can load the dataset using :hugs: Datasets. You can click on the Use this dataset button to copy the code to load a dataset.

First you need to Login with your Hugging Face account, for example using:

huggingface-cli login

And then you can load a dataset from the Hugging Face Hub using

from datasets import load_dataset

dataset = load_dataset(“username/my_dataset”) # or load the separate splits if the dataset has train/validation/test splits

train_dataset = load_dataset(“username/my_dataset”, split=“train”) valid_dataset = load_dataset(“username/my_dataset”, split=“validation”)
test_dataset = load_dataset(“username/my_dataset”, split=“test”)

Yes, I took this right from the “manual” all the arrogant programmers will tell me to read.

Just this line contains MASSIVE complexity: from datasets import load_dataset

Obviously, HuggingFace can’t teach me the basics, that’s someone else’s job. I’m illustrating the point that, if you’re new enough, the smallest piece of code is overwhelmingly complex.

And I’m underscoring the point that a core element of both teaching and communication (which go hand in hand, I presume) is: “simplicity is the best policy.”

So instead of “And then you can load a dataset from the Hugging Face Hub using…” {insert nonsense}

Perhaps better said:

And then you can load a dataset from the Hugging Face Hub. Here is the step-by-step.

First, most datasets are HA-UGE. So you’ll need a lot of space. Based on the dataset you’re looking at you can expect {insert dynamic download size} of data to hit your hard drive.

We’re just as controlling as Bill Gates so we will store our stuff where ever we want - unless you tell us to stuff it somewhere specific.
If you want to change that location, here’s the way we store stuff and how you can be choosy about where we put stuff on that machine you spent all that hard-won cash on.

{link to other docs or not, dealer’s choice}
Like all good coders we love complexity and complexity that would make Dr. Strange’s spell to make everyone forget Peter Parker was SpiderMan in that horrific soy-boy version of SpiderMan that we all want to forget look like the instructions to boil water in comparison.

So when we store our stuff in your house, we want to do it like this:

First, we’ll make a HuggingFace directory (or folder if you’re a Mac type)
Then, we’re going to make directories inside of THAT directory (who doesn’t love NESTING!!??)
So look for:
HuggingFace

  • datasets
  • hub
  • modules
  • transformers
    There will be some files in there too but those are just stuff to make us look super smart and you don’t need to worry about those. Unless you really want to… in that case go here and come right back {here links to the docs explaining those files}

Since we store stuff in more than one directory, you can tell us where each directory can go. We do recommend you organize your stuff, feel free to organize it how you see fit.
Let’s say you wanted to save all your stuff on a new drive you got from Newegg’s Christmas Sale (BEST time, by the way {insert affiliate link}) and that new 12 TB drive is just WAITING for some data…

You’d be smart to put ALL of your stuff in one place.
Let’s say that new drive is “G” (if it is a different letter just replace it in these examples)
First is the “cache” for the datasets
os.environ[“HF_DATASETS_CACHE”] = r"G:\HuggingFace\datasets"

Then the “cache” for the models
os.environ[“HF_MODELS_CACHE”] = r"G:\HuggingFace\models"

Then the “cache” for the transformers
os.environ[“TRANSFORMERS_CACHE”] = r"G:\HuggingFace\transformers"

Then the “cache” for the datasets
os.environ[“HF_DATASETS_DOWNLOADED_DATASETS_PATH”] = r"G:\HuggingFace\datasets"

Then the “cache” for the hub
os.environ[“HUGGINGFACE_HUB_CACHE”] = r"G:\HuggingFace\hub"

If you’re an Uber coder, you can see how the choices abound but if you’re looking for a broadsword quality “just get this done so I can see the result NOW”… there you go.

That will put all the stuff on the drive in the place you want.

etc…

1 Like

Even after decades, our everyday lives will still require us to use ESP abilities to decipher error messages.
Well, let’s just put that aside for now and hope for a little progress in humanity in the future.

I think the unique part of Hugging Face in this kind of problem is that the PC environment of the staff and heavy users of Hugging Face and the developers of large companies who write the samples is fundamentally different from the PC environment of 99% of the users in the world.
In many cases, the samples are written on the assumption that the VRAM is at least 40GB!
If users run them as they are, the lifespan of the average user’s PC will be shortened…

Well, Hugging Face is a very good place overall, but on the other hand, it has a lot of stupid little problems.:sweat_smile: If you notice something and think there might be a way to improve it in this world, it’s a good idea to suggest it below, as it’s likely to be adopted.