ValueError: Incompatible safetensors file. File metadata is not ['pt', 'tf', 'flax', 'mlx'] but None

Hi experts,

I have trained a custom LLMs from scratch using pytorch and saved the model checkpoint. According to documentation, for custom pytorch models, I used the PyTorchModelHubMixin in my model class, to make it compatible. Now when I push it to hub using the following code:

GPT_CONFIG = {
    "model_type": "gpt",
    "vocab_size": 26000,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 16,
    "n_layers": 12,
    "drop_rate": 0.2,
    "qkv_bias": False,
    "flash": True,
}

from model import GPTModel
import torch

model = GPTModel(GPT_CONFIG)

checkpoint = torch.load("/teamspace/studios/this_studio/model/gpt_model_checkpoint.pth", map_location="cpu")
model.load_state_dict(checkpoint['model_state_dict'])

model.save_pretrained(
    save_directory="local-save-dir2",
    config=GPT_CONFIG,
)

repo_id = "angkul07/llm_100M"

model.push_to_hub(
    repo_id=repo_id,
    commit_message="Initial commit of GPTModel checkpoint",
    private=False
)

When I try to load it using the AutoModel:

model = AutoModel.from_pretrained("angkul07/my-awesome-model")

I get the following Value error:

ValueError: Incompatible safetensors file. File metadata is not ['pt', 'tf', 'flax', 'mlx'] but None
```.


I have tried looking for it on the internet but its no help. So, how can I fix it? How can I add a metadata?
1 Like

This is a very rare error, but it may just be that there is no metadata.

1 Like

hey @John6666, thanks this works like a charm. Thank you so much.

Btw, I am facing one more issue, I have a custom trained sentencepiece tokenizer. So, two files tokenizer.model and tokenizer.vocab. Now, I want to convert them into the AutoTokenizer format to match the compatibility. I used the following code to convert:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="/teamspace/studios/this_studio/model/tokenizer.model",
    model_max_length=256,                
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>"             
)

tokenizer.save_pretrained("my-tokenizer")

But I get the following error:

Exception: stream did not contain valid UTF-8

Do you have any idea how to convert this sentencepiece tokenizer to AutoTokenizer format? Thanks.

1 Like

Maybe it’s a character encoding issue?

For example, Windows 10 Notepad saves files in UTF-16, so comments that aren’t in English may cause errors…
This probably won’t happen if you’re using VSCode, and if you’re using a Colab environment, the cause is likely something else.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.