JSON Schema Response Format NOT Working: invalid_request_error

I know I’m doing something wrong here, but can’t figure it out.

Problem

I’m trying to implement a JSON Schema constraint call with a Pydantic model, but I’m getting API errors, with no indication that I can discern as to what I’m doing wrong.

Error

Traceback (most recent call last):
  File "/home/michael/miniconda3/envs/patents_llm_update/lib/python3.13/site-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/michael/miniconda3/envs/patents_llm_update/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://router.huggingface.co/cerebras/v1/chat/completions

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/michael/git/patents_llm_update/llm_functions/models.py", line 56, in <module>
    test()
    ~~~~^^
  File "/home/michael/git/patents_llm_update/llm_functions/models.py", line 44, in test
    response = test_ep.chat.completions.create(messages=[user],  #
                                               max_tokens=4096,  #
                                               response_format=response_format,  #
                                               stream=False)
  File "/home/michael/miniconda3/envs/patents_llm_update/lib/python3.13/site-packages/huggingface_hub/inference/_client.py", line 992, in chat_completion
    data = self._inner_post(request_parameters, stream=stream)
  File "/home/michael/miniconda3/envs/patents_llm_update/lib/python3.13/site-packages/huggingface_hub/inference/_client.py", line 357, in _inner_post
    hf_raise_for_status(response)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/michael/miniconda3/envs/patents_llm_update/lib/python3.13/site-packages/huggingface_hub/utils/_http.py", line 482, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://router.huggingface.co/cerebras/v1/chat/completions (Request ID: Root=1-67feb932-3458407e60b0d8c90c4c7c59;d65a18ba-52ad-489e-a90c-5b3e9792cebc)
{"message":"type: Input should be 'text'","type":"invalid_request_error","param":"validation_error","code":"wrong_api_format"}

Code

HF Inference Client

def llama_40_scout_instruct(wait: bool = False) -> InferenceClient:
    hf_login_check()
    model = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
    token = hf_bearer_token()
    ep = InferenceClient(provider="cerebras", model=model, api_key=token, )
    return ep

Pydantic Model

class SPRCompressed(BaseModel):
    """
    Sparse Priming Representation (SPR): An SPR is a specific kind of use of language for advanced
    NLP, NLU, and NLG tasks.

    An SPR is input distilled down to a list of succinct statements, assertions, associations,
    concepts, analogies, and metaphors.

    An SPR captures as much, conceptually, as possible, but with as few words as possible. An SPR
    is written a way that makes sense to an LLM, as the future audience will be another language
    model, not a human. Use complete sentences that are grammatically correct. Do not use
    abbreviations, as they can be ambiguous.
    """

    spr: str = Field(description="Sparse Priming Representation")

Main code

test_application.md: any largish complex text
compress_prompt.md: Instructions on how to compress to SPR format

def test():
    print("Testing SPRCompressed")
    print(SPRCompressed.model_json_schema())
    print()
    test_app = pathlib.Path(__file__).parent.joinpath("test_application.md").read_text()
    compress_prompt = pathlib.Path(__file__).parent.joinpath(
            "../prompts/compress_prompt.md").read_text()
    prompt = compress_prompt.format(text=test_app)
    test_ep: InferenceClient = llama_40_scout_instruct()
    user = {"role": "user", "content": [{"type": "text", "text": prompt}]}

    response_format: ChatCompletionInputGrammarType = ChatCompletionInputGrammarType(  #
            "json",
            SPRCompressed.model_json_schema())

    print("=" * 40)
    response = test_ep.chat.completions.create(messages=[user],  #
                                               max_tokens=4096,  #
                                               response_format=response_format,  #
                                               stream=False)
    print(response.choices[0].message.content)

Environment

huggingface-cli env

Copy-and-paste the text below in your GitHub issue.

- huggingface_hub version: 0.30.2
- Platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.39
- Python version: 3.13.3
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/michael/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: michael-newsrx-com
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: 3.1.6
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: N/A
- pydantic: 2.11.3
- aiohttp: N/A
- hf_xet: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/michael/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/michael/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/michael/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /home/michael/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
1 Like

The Function Calling in TGI or Transformers has been buggy for a while now. If it hasn’t been fixed yet, it might be because of this…

I’m passing in a JSON schema, and even if it was a tool call to a routine, it should still return with the request to run the tool or JSON or neither. Not give exceptions about badly formatted requests.

The symptoms for the infinite tool recursion issue and this issue are not the same.

1 Like