RuntimeError: The size of tensor a (48) must match the size of tensor b (64) at \nnon-singleton dimension 0"}

Hi there,
i have fine tuned phi-4-mini-instruct model and try deploy in huggingface endpoint inference and got the error.
Error is:
Server message]Endpoint failed to start
Exit code: 1. Reason: �� │\n[rank1]: │ │ 1., 1., 1., 1., 1., 1., │ │\n[rank1]: │ │ │ │ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │\n[rank1]: │ │ 1.], device=‘cuda:1’) │ │\n[rank1]: │ ╰──────────────────────────────────────────────────────────────────────────╯ │\n[rank1]: ╰──────────────────────────────────────────────────────────────────────────────╯\n[rank1]: RuntimeError: The size of tensor a (48) must match the size of tensor b (64) at \n[rank1]: non-singleton dimension 0"},“target”:“text_generation_launcher”,“span”:{“rank”:1,“name”:“shard-manager”},“spans”:[{“rank”:1,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.246550Z”,“level”:“ERROR”,“fields”:{“message”:“Shard 1 failed to start”},“target”:“text_generation_launcher”}
{“timestamp”:“2025-04-29T07:29:42.246595Z”,“level”:“INFO”,“fields”:{“message”:“Shutting down shards”},“target”:“text_generation_launcher”}
{“timestamp”:“2025-04-29T07:29:42.269764Z”,“level”:“INFO”,“fields”:{“message”:“Terminating shard”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.269807Z”,“level”:“INFO”,“fields”:{“message”:“Waiting for shard to gracefully shutdown”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
{“timestamp”:“2025-04-29T07:29:42.470177Z”,“level”:“INFO”,“fields”:{“message”:“shard terminated”},“target”:“text_generation_launcher”,“span”:{“rank”:0,“name”:“shard-manager”},“spans”:[{“rank”:0,“name”:“shard-manager”}]}
Error: ShardCannotStart

then i try to deploy the base microsoft-phi-4-mini-instruct model and got the same error.

so can anyone help me in this, to resolve the issue?

1 Like

Seems unresolved issue of TGI?

rdaya
In case anyone runs into this, the trick (a bad one) is to set USE_FLASH_ATTENTION=false