Loading a safetensors model from HF using Whisper library

I need to load ivrit.ai/whisper-large-v3 model on my machine using Whisper library. This model has the following files:

As you can see it uses .safetensors format which needs to be loaded using transformers library. I converted the model (both .safetensors files) and combined them using the script:

for filename in tqdm(glob(f"{base_path}/*.safetensors")):
    ckpt = load_file(filename)
    torch.save(ckpt, filename.replace(".safetensors", ".bin"))

And,

part1 = torch.load("model-00001-of-00002.bin")
part2 = torch.load("model-00002-of-00002.bin")

combined = {**part1, **part2}  # Be cautious if keys overlap

torch.save(combined, "ivrit-ai-whisper-large-v3.pt")

I also set model dimensions using:

dims = ModelDimensions(
    n_vocab=51865,
    n_audio_ctx=1500,
    n_audio_state=1280,  # Use 1280 to match d_model embedding size
    n_audio_head=8,
    n_audio_layer=32,
    n_text_ctx=192,
    n_text_state=1280,
    n_text_head=20,
    n_text_layer=24,
    n_mels=80
)


model = Whisper(dims)

checkpoint = {
    "dims": vars(dims),  # Convert ModelDimensions to dict
    "model_state_dict": model.state_dict(),
    "decoder_state": None,
    "version": 2,
    "init_args": {
        "device": "cpu",
        "n_vocab": dims.n_vocab,
        "n_audio_ctx": dims.n_audio_ctx,
        "n_audio_state": dims.n_audio_state,
        "n_audio_head": dims.n_audio_head,
        "n_audio_layer": dims.n_audio_layer,
        "n_text_ctx": dims.n_text_ctx,
        "n_text_state": dims.n_text_state,
        "n_text_head": dims.n_text_head,
        "n_text_layer": dims.n_text_layer,
        "n_mels": dims.n_mels
    }
}


torch.save(checkpoint, "ivrit-ai-whisper-large-v3.pt")

And using

model = whisper.load_model("ivrit-ai-whisper-large-v3.pt")

does work, but does NOT produce Hebrew words probably because I didn’t integrate the vocabulary and other files of the original HuggingFace repo.

How can I integrate whatever is needed to make the model work exactly as if I was loading it using transformers?

1 Like

In Whisper, the HF format and OpenAI format appear to be different. Therefore, conversion work such as renaming keys is necessary, not just loading and saving.

Or how about just using the Faster Whisper version as-is?