Process for download immediately hangs on particular machine

I’m trying to download a model (Llama 70B) that I have access to:

from transformers import AutoTokenizer, AutoModelForCausalLM


model_name = "meta-llama/Llama-2-7b-chat-hf"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        # load_in_8bit=True,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        # rope_scaling = {"type": "dynamic", "factor": scaling_factor}
    )

I have access to this model (I know because I can successfully download the model to other hosts on my cluster). When I try downloading on one particular machine, I see the following:

Downloading config.json: 100%|████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<00:00, 2.71MB/s]

The process then becomes immediately unresponsive. I tried waiting overnight to no avail. I’ve tried on other days without success.

Why is this happening and how do I solve it?

As a possibly relevant observation, I think I interrupted a previous download of this model. I can’t quite remember. I’ve already tried clearing .cache/huggingface/hub but maybe I need to clear/reset something else?

I am also facing similar issue. Did you find any solution ? if yes, Please share that…

I did not find a solution.

Then, how are you using the machine. You reboot it or what ??

The problem went away on its own. No idea why :man_shrugging: