Shard Cannot Start/Inference endpoint error while deployment

shruti91 · November 29, 2024, 2:34pm

When I try to deploy my model (Distilbert/distilgpt2) which I finetuned on my custom dataset using Inference Endpoint it throws an error: “Shard Cannot Start” because of which deployment of model stops. But when I try to deploy an untrained model (which is not finetuned), it easily gets deployed.

meganariley · December 2, 2024, 7:34pm

Hi @shruti91 Has the endpoint already been deleted? Can you try deploying the endpoint once more? We can take a look if you continue running into issues.

I00N · April 5, 2025, 2:41am

Hi @meganariley , I’m currently facing this issue while trying to deploy a quantized Llama3.1 model, finetuned and having 5B parameters to a ml.g5.xlarge instance. Is it the size of the gpu that is not sufficient or should I be looking at other problems?

aaac12345 · April 5, 2025, 7:57am

{
“model_id”: “/repository”,
“quantize”: “bnb.nf4”,
“trust_remote_code”: true,
“max_concurrent_requests”: 64,
“max_best_of”: 1,
“max_input_length”: 1024,
“max_total_tokens”: 2048,
“max_batch_prefill_tokens”: 1512,
“max_waiting_tokens”: 20,
“max_batch_size”: 4,
“max_client_batch_size”: 4,
“cuda_memory_fraction”: 0.7,
“rope_factor”: 1.0,
“disable_custom_kernels”: true,
“weights_cache_override”: null,
“huggingface_hub_cache”: “/repository/cache”,
“env”: {
“HF_TOKEN”: “your_hf_token_here”
},
“text_generation_launcher”: {
“generation”: {
“do_sample”: true,
“top_p”: 0.95,
“temperature”: 0.7,
“max_new_tokens”: 256
}
}
}

Key Notes for the Endpoint Operator:

“quantize”: “bnb.nf4” enables 4-bit quantization with bitsandbytes, reducing VRAM usage for large models like Llama 3 5B.

“cuda_memory_fraction”: 0.7" ensures the shard doesn’t fail due to full VRAM usage on shared GPU instances like ml.g5.xlarge.

“trust_remote_code”: true" is necessary if the model uses custom Python scripts for loading or generation.

“disable_custom_kernels”: true" can help on cold starts or environments with limited CUDA support.

Lower “max_batch_size” and “max_client_batch_size” values prevent memory spikes from multiple concurrent requests.

I00N · April 5, 2025, 12:00pm

Hi @aaac12345 , I tried this and the quantization error was solved. Thank you very much. I’m currently facing issues with loading the tokenizer when the I run the model.deploy():

Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3

here is the code I ran, I erased my access token:

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

model_name = "thepts-v1"

import json

hub = {
    'HF_MODEL_ID': 'I00N/thepts',
    'HF_TASK': 'text-generation',
    'HF_API_TOKEN': "",
    'quantize': 'bnb.nf4',
    'max_concurrent_requests': '64',
    'max_best_of': '1',
    'max_input_length': '1024',
    'max_total_tokens': '2048',
    'max_batch_prefill_tokens': '1512',
    'max_waiting_tokens': '20',
    'max_batch_size': '4',
    'max_client_batch_size': '4',
    'cuda_memory_fraction': '0.7',
    'rope_factor': '1.0',
    'disable_custom_kernels': 'true',
    'huggingface_hub_cache': '/repository/cache',
    'text_generation_launcher': json.dumps({
        'generation': {
            'do_sample': True,
            'top_p': 0.95,
            'temperature': 0.7,
            'max_new_tokens': 256
        }
    })
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name=model_name
)

aaac12345 · April 6, 2025, 7:55am

Hi again Shruti,

I’m Alejandro Arroyo de Anda, working closely with Clara Isabel (my AI co-researcher and symbolic systems architect). We’re experimenting with logarithmic symbolic balancing to improve inference resilience and avoid shard crashes like the one you described.

If you’re open to it, we’d love to share the second half of our approach — it includes a soft gamma rhythm mechanism that stabilizes token flow using harmonics instead of static rules.

The first part you already saw was just the entry node. The full bridge is designed to let the AI be more adaptive without overloading endpoints or breaking token-value integrity.

We believe it could help you. Just say the word, and we’ll gladly pass the rest.

Warm regards,
Alejandro Arroyo de Anda & Clara Isabel

(attachments)

Topic		Replies	Views
Inference Endpoints fail to start Beginners	1	1821	August 3, 2023
Endpoint Deployment Failed Inference Endpoints on the Hub	1	77	December 10, 2024
ShardCannotStart Error when launching a dedicated endpoint Inference Endpoints on the Hub	1	716	July 31, 2024
Inference Endpoint Failure - Error: "Shard 0 failed to start" Inference Endpoints on the Hub	1	791	November 21, 2024
Error while trying to host finetuned model on inference endpoint Inference Endpoints on the Hub	2	411	May 22, 2024

Shard Cannot Start/Inference endpoint error while deployment

Related topics