Serverless Inference API doesn't seem to support a dedicated JSON mode

Hi Hugging Face community,

I’m encountering an issue when trying to generate JSON output using the Meta-Llama-3-70B-Instruct model hosted on the :hugs: Serverless Inference API. My code is attempting to parse the generated text as JSON, but I’m receiving following structures:

[{'generated_text': '\n\nI\'ll create a comprehensive set of diverse problematics with precise outputs in a JSON file, intended for training professional embedding models from authoritative textual content, aiming to provide domain-specific knowledge to a foundation model.\n\nHere are the problematics:\n\n```\n[\n  {\n    "query": "What are the key principles of jurisdictional competence in international private law, considering the concepts of domicile, residence, and nationality?"\n  },\n  {\n    "query": "How does the concept of \'centre of main interests\' influence the determination of a company\'s COMI in the context of international insolvency law?"\n  },\n  {\n    "query": "What are the implications of the\'piercing the corporate veil\' doctrine on the liability of shareholders and directors in French company law?"\n  },\n  {\n    "query": "Under what circumstances can a French court assert jurisdiction over a foreign company in a dispute involving international trade, considering the rules of the French Civil Procedure Code?"\n  },\n  {\n    "query": "What are the requirements for a valid arbitration agreement in France, and how does the French Civil Code regulate the arbitrability of disputes?"\n  },\n  {\n    "query": "How does the French concept of \'ordre public\' impact the'}]

This is a sample of my code:

import json
import logging
import requests

from typing import (
    IO,
    TYPE_CHECKING,
    Any,
    Dict,
    List,
    Type,
    Tuple,
    Union,
    Mapping,
    TypeVar,
    Callable,
    Optional,
    Sequence,
)

# Set up logging
logging.basicConfig(
    level=logging.INFO, 
    format="%(asctime)s - %(levelname)s - %(message)s"
)

class Retriever:
    """
    A class for retrieving completions from an API client.

    This class provides methods for generating text completions using an API client,
    either synchronously or asynchronously.

    Attributes
    ----------
    api_key : str
        The API key for authenticating requests.

    headers : dict
        Headers to be included in the API request.

    Methods
    -------
    completion(payload)
        Generate a completion using the API.

    async_completion(payload)
        Asynchronously completes a chat conversation using the API.
    """
    def __init__(
        self, 
        api_key: str, 
        headers: Optional[dict] = None
    ):
        """
        Initialize the Retriever with an API key, API URL, and optional headers.

        Parameters
        ----------
        api_key : str
            The API key for authenticating requests.

        headers : dict, optional
            Headers to be included in the API request (default is None, which sets the Authorization header using the provided api_key).
        """
        self.api_key: str = api_key
        self.headers: dict = headers if headers else {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }


    def completion(
        self, 
        payload: dict,
        api_url: Optional[str] = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct", 
        validate_payload: Optional[str] = False
    ) -> dict:
        """
        Generate a completion using the API.

        Parameters
        ----------
        payload : dict
            A dictionary containing the input and optional parameters for the API call.
            - top_k : int, optional
                Define the top tokens considered within the sample operation to create new text.
            - top_p : float, optional
                Define the tokens that are within the sample operation of text generation.
            - temperature : float, optional
                The temperature of the sampling operation (default is 1.0).
            - repetition_penalty : float, optional
                The penalty for repeating tokens (default is None).
            - max_new_tokens : int, optional
                The amount of new tokens to be generated (default is None).
            - max_time : float, optional
                The maximum time in seconds for the query (default is None).
            - return_full_text : bool, optional
                Whether to return the full text including the input (default is True).
            - num_return_sequences : int, optional
                The number of propositions to return (default is 1).
            - do_sample : bool, optional
                Whether to use sampling (default is True).
            - options : dict, optional
                A dictionary containing additional options:
                    - use_cache : bool, optional
                        Whether to use caching (default is True).
                    - wait_for_model : bool, optional
                        Whether to wait for the model if not ready (default is False).

            Example structure:
            {
                "inputs": "Your input string here",
                "parameters": {
                    "top_k": int, optional,
                    "top_p": float, optional,
                    "temperature": float, optional (default is 1.0),
                    "repetition_penalty": float, optional,
                    "max_new_tokens": int, optional,
                    "max_time": float, optional,
                    "return_full_text": bool, optional (default is True),
                    "num_return_sequences": int, optional (default is 1),
                    "do_sample": bool, optional (default is True),
                    "options": {
                        "use_cache": bool, optional (default is True),
                        "wait_for_model": bool, optional (default is False)
                    }
                }
            }
        
        api_url : str, optional
            The URL endpoint of the API (default is "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct").

        validate_payload : str, optional
            A string indicating whether to validate the payload or not. 
            If set to True, the payload will be validated. 
            If set to False (default), the payload will not be validated.

        Returns
        -------
        dict
            A dictionary containing the generated completion data.

        Raises
        ------
        ValueError
            If the payload or its components have invalid types.

        requests.HTTPError
            If the API request returns an error status code.

        requests.RequestException
            If there is an error making the API request.
        """
        try:
            if validate_payload:
                self._validate_payload(
                    payload=payload
                )

            response = requests.post(
                api_url, 
                headers=self.headers, 
                json=payload
            )

            response.raise_for_status()  # Raise an exception for HTTP errors
            print("Raw response:", response.json())  # Print the raw response for debugging

            return json.loads(
                response.json()[0]["generated_text"]
            )

        except ValueError as ve:
            raise ve

        except requests.exceptions.HTTPError as http_err:
            raise requests.HTTPError(
                f"HTTP error occurred: {http_err}"
            )

        except requests.exceptions.RequestException as req_err:
            raise requests.RequestException(
                f"Request error occurred: {req_err}"
            )

payload = {
    "parameters": {
        "temperature": 0.9,
        "return_full_text": False,
        "max_new_tokens": 250,
        "do_sample": True,
        "top_k": 50,
        "top_p": 0.95
    },
    "options": {
        "use_cache": False,
        "wait_for_model": True
    }
}

# Local function to apply a chat template...
# inputs = gspl.apply_chat_template(
#     column="queries",
#     system_prompt=completion_system_prompt, 
# )

# payload["inputs"] = inputs

client = Retriever(
    api_key="api_key"
)

client.completion(
    payload=payload,
)["generated_text"]

This is the system prompt I used:

Objective: Development of a comprehensive set of diverse issues with their precise outputs in a JSON file, intended for the training of professional embedding models based on authoritative textual content, aiming to impart the knowledge of a specific professional domain to a foundational model.

Desired JSON schema:
  {
      "query" : "xxx",
  }

Requirements to be met:
1. Use of elaborate sentence structures. Favor the use of complex sentence structures that expand the scope of expression.
2. Linguistic quality. The query must be written in French without any spelling, syntax, punctuation, or grammatical errors.
3. Professional and academic language. The query must be reformulated to adopt a professional and academic discourse, characterized by its rigor, justification, and detailed structure.
4. Neutrality or nuance. The perspective must remain neutral or nuanced.
5. Contextualization of legal themes. The query must explicitly refer to the legal theme and the subject of the source to contextualize the result.
6. Literary style and exemplification. The query must be formulated in a literary style. Examples, when relevant, can be used to reinforce the query while ensuring a high degree of certainty.
7. Directiveness of instructions. Use a direct style favoring impersonal formulations.
8. Use of the source text. Use the provided source text to formulate the query. The source text should be considered of high quality and authoritative.
9. Purpose of the response. Only the dictionary in JSON format should constitute the response to this request. No introduction or conclusion is required.
10. Use of all knowledge. The query must encompass all the information present in the source text to ensure no knowledge is omitted. It is essential to maintain the specialized nature of the text without generalizing its content.
11. When the source text includes an example, notably citing flows between named or unnamed persons or entities, or numerical values, this data must be included in the query. The instruction should be adapted to reflect the legal theme in question and reduce the overall ambiguity of the dictionary.
12. Absence of source citation. Never mention the book, code, or article number of the source text. The query must be precise enough to avoid ambiguity without mentioning the article or code to align with reality. Example:
Do not produce: 
"'query': "Conformément à l'article 4 B du Code général des impôts, quels sont les critères permettant de considérer une personne, y compris les dirigeants d'entreprises et les agents de l'Etat, des collectivités territoriales et de la fonction publique hospitalière exerçant à l'étranger, comme ayant son domicile fiscal en France, en fonction de leur foyer ou lieu de séjour, de l'exercice d'une activité professionnelle, du centre de leurs intérêts économiques ou de leur statut ?"
  
But rather:
"'query': "Quels sont les critères permettant de considérer une personne, y compris les dirigeants d'entreprises et les agents de l'Etat, des collectivités territoriales et de la fonction publique hospitalière exerçant à l'étranger, comme ayant son domicile fiscal en France, en fonction de leur foyer ou lieu de séjour, de l'exercice d'une activité professionnelle, du centre de leurs intérêts économiques ou de leur statut ?"
13. Prohibition of implicit citation. Never use phrases to refer to the source text (the "source text", "the cited article", "the mentioned article", "provided source text", etc.). The query must be formulated to be independent of the referenced content.

It seems the model is sometimes generating text that includes JSON, but it’s wrapped in additional content and may get cut off at the end. I’m trying to parse this using json.loads(response.json()[0]["generated_text"]) , but it’s failing due to the additional text and potentially incomplete JSON.

After further investigation, I’ve realized that the Inference API doesn’t seem to support a dedicated JSON mode. This explains why I’m receiving text output that includes JSON-like content rather than pure, parseable JSON.

Given this limitation, I’m wondering:

  1. Are there any recommended workarounds or best practices for extracting structured data from the text output of this model?
  2. Is there a plan to implement a JSON mode or structured output option for the Meta-Llama-3-70B-Instruct model in future versions of the Inference API?
  3. Are there alternative models available through the Hugging Face Inference API that do support direct JSON output and also provide high-quality support for French language processing and advanced text labeling capabilities?

Any insights on these questions or suggestions for handling unstructured text output when structured data is needed would be greatly appreciated. Thank you for your help and for continuing to improve these powerful tools!

Louis Brulé Naudet :hugs: