OSError: Unable to load weights from pytorch checkpoint file

Hi, everyone. I need some help. I have been developing the Flask website that has embedded one of Transformer’s fine-tuned models within it. I fine-tuned the model with PyTorch. I’ve tested the web on my local machine and it worked at all.

I used fine-tuned model that I’ve already saved the weight to use locally, as pictured in the figure below:

The saved results contain:

  • config.json
  • pytorch_model.bin
  • special_tokens_map.json
  • tokenizer_config.json
  • vocab.txt

Then, I tried to deploy it to the cloud instance that I have reserved. Everything worked well until the model loading step and it said:
OSError: Unable to load weights from PyTorch checkpoint file at <my model path/pytorch_model.bin>. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I’ve searched around the internet to solve it but still nil. Can I get some enlightenment?

By the way, I’m using Ubuntu 18.04 instance and the environments that I’m used are:

  • torch 1.7.0
  • transformers 3.5.1

Thank you before!

Hi @aswincandra were you able to load the tokenizer in the Flask app without problems? My first guess is that the path you are pointing to in the app is not correct.

Thank you for your response, @lewtun. I’ve tried that way too, commented the model loading’s line of code and just loaded the tokenizer. Then… It worked :sweat_smile: So, I think there is no issue with the path. :thinking:

Interesting, then one possible way to debug the problem would be to try loading thestate_dict in native PyTorch and then seeing what the error is, e.g.

import torch

state_dict = torch.load(path_to_pytorch_bin_file, map_location="cpu")

This seems to be the step that is raising your OS error and could be a starting point

1 Like

Thank you so much, Lewis! I’ll try it in the next few hours. But may I know first what does the torch.load() function return?

Is that a same thing as my previously used? Previously, I used this function:
model = BertForSequenceClassification.from_pretrained(<my_model_path>)

Hi @aswincandra the state_dict is just a Python dict that maps each layer to its corresponding tensors: What is a state_dict in PyTorch — PyTorch Tutorials 1.7.1 documentation

The reason I mentioned it is because I think your error is coming from this line of the from_pretrained function: transformers/modeling_utils.py at 748006c0b35d64cdee23a3cdc2107a1ce64044b5 · huggingface/transformers · GitHub

Right now you can’t see the lower-level error message from PyTorch, so trying to load it directly might shed some light on what the problem is :slight_smile:

Thank you for the insights @lewtun . The torch.load() function loaded all of the parameters in each layer properly and I don’t know why I think the model also can be loaded too now :sweat_smile:. Perhaps because I had time to change the library versions and then brought them back to the version that I’ve used initially. Now, I’m having another issue that doesn’t relate to the framework anymore. Thank you once again, Lewis!

1 Like

@lewtun hi, im having the same problem. I tried the torch.load() and it give this error RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory

im trying to load model that i saved using transformers.Trainer.save_model.
image

hey @imtrying, if you saved the model using the Trainer you should be able to use the from_pretrained function to load the model as follows:

# pick the appropriate Auto class for your task
from transformers import AutoModel

model = AutoModel.from_pretrained("path/to/folder/where/you/saved/your/model")

if that doesn’t work, perhaps you can share which version of transformers you are using and how you created the Trainer?

hi its solved. it turns out that the pytorch_model.bin is somehow corrupted. Maybe because i saved the model in GCP AI platform, download it directly and upload it back to google colab

1 Like

good to hear it’s solved!

I see your problem is solved, but for some folks that may get this error for the same reason as me, I was loading my pytorch checkpoint with torch v1.4 while the torch version I used when pretraining my model and saving my checkpoint was v1.9 (I pretrained my models on one server and was loading them on another server.) So, double-checking the torch version might help to resolve this error in some cases.

2 Likes

I had the same error message as @aswincandra , but I was just loading a GPTNeo model:

  File "<userDir>/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1285, in from_pretrained
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'EleutherAI/gpt-neo-2.7B' at '<userDir>/.cache/huggingface/transformers/0839a11efa893f2a554f8f540f904b0db0e5320a2b1612eb02c3fd25471c189a.a144c17634fa6a7823e398888396dd623e204dce9e33c3175afabfbf24bd8f56'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 

I tried setting from_tf to True, and got:

404 Client Error: Not Found for url: https://huggingface.co/EleutherAI/gpt-neo-2.7B/resolve/main/tf_model.h5
Traceback (most recent call last):
  File "<userDir>/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1253, in from_pretrained
    resolved_archive_file = cached_path(
  File "<userDir>/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1370, in cached_path
    output_path = get_from_cache(
  File "<userDir>/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1541, in get_from_cache
    r.raise_for_status()
  File "<userDir>/.local/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/EleutherAI/gpt-neo-2.7B/resolve/main/tf_model.h5

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<userDir>/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 384, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "<userDir>/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1270, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load weights for 'EleutherAI/gpt-neo-2.7B'. Make sure that:

- 'EleutherAI/gpt-neo-2.7B' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'EleutherAI/gpt-neo-2.7B' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
torch version 1.9.0.
transformers version 4.9.2
python 3.8.0

I’m running this over SSH, if it matters, so I’m not sure what kind of configuration the remote machine has. I can’t run it on my machine because I don’t have enough RAM :sweat_smile:

I tried to run

state_dict = torch.load(path_to_pytorch_bin_file, map_location="cpu")

But I’m not sure if this applies to me since I didn’t train my own model.

I had the same problem while training a Roberta model. I tried to resume my training with the last checkpoint without success. When I tried to load the second to last checkpoint it worked fine, therefore, my last checkpoint was corrupted and the solution is to restore a previous checkpoint.

I am also experiencing this error. I load the tokenizer without problem from the same checkpoint as the model, but on loading the model I get raise OSError( OSError: Unable to load weights from pytorch checkpoint file . The checkpoint was created under Pytorch 1.9, my current Pytorch is 1.10. The checkpoint is on a network drive, if I try my code and checkpoint on a local drive then I have no problem, its just when operating from a network.

Because the tokenizer is constructed with no problem from this same checkpoint I was wondering if there is a difference in the handling of OS file types between the tokenizer and the model. Because there was a space in the name of one of the directories on the path I tried relocating to a place with no space in the name, and it seemed to fix the problem, but only initially - when I tried again a day later the problem reappeared.

Still, I feel there is something different in how the OS, path, or file is handled between the tokenizer and the model use of the checkpoint.

1 Like


Hi everyone, I got this error while i was installing Stable Diffusion web UI on windows, how to resolve this problem?

1 Like

hi, have you found the answer?

Also have issue like this…

This seems like a serialisation error? Perhaps re-install pytorch or tensorflow and try again?

Hi @lewtun , I follow it and the Tracebask is:

Traceback (most recent call last):
  File "/switch/test_load.py", line 3, in <module>
    state_dict = torch.load(path, map_location='cpu')
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
ValueError: invalid buffering size

So how to solve it?
Thanks.