OpenAI ApI standards & onprem model options

Hi, total AI newb here. We are looking to install a model Onprem. The supplier of the software requiring the model, suggested Azure OpenAI (MS) but from that I understand this cannot be installed 100% onprem.

The other option from the supplier is Llama but then they have requested information from me as to whether Llama 3 has the Open AI API standards. I can’t seem to find anything information on this.

Thanks for any info around this.

1 Like

There are quite a few ways to set up a server that is compatible with the OpenAI API. The model suggested below is Llama 3.1 405B, but in reality, there is no particular reason to stick with Llama. You can just look for a good model from the leaderboard, etc.

OpenAI API compatible solutions

To find good Open-source LLMs

by Hugging Chat


To install and make Llama series models work according to the OpenAI API standard on-premise, here are some specific solutions:

1. Using Ollama

  • Description: Ollama simplifies running large language models (LLMs) like Llama on Linux or macOS. You can easily fetch and run Llama models from the Ollama library.
  • Steps:
    1. Install Ollama on your machine.
    2. Use the relevant command to run the Llama model from Ollama’s library.
  • Example:
    # Example command to run a LLaMA model
    ollama generate "llama" --prompt "Your Prompt"
    

2. Deploying with vLLM (LLaMA 3.1 405B)

  • Description: vLLM is an efficient inference server for large language models, supporting models like Llama 3.1 405B. It allows you to expose the model through an OpenAI-compatible API.
  • Steps:
    1. quantize the model if needed (e.g., using fp8).
    2. Load the model using vLLM and start the OpenAI-compatible API server.
  • Example:
    # Example script to start the OpenAI-compatible API server
    python -m vllm.entrypoints.openai.api_server \
        --model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic \
        --tensor-parallel-size 8
    

3. Using Hugging Face Endpoints

  • Description: Hugging Face provides endpoints for deploying models like Llama. You can deploy Llama models and access them via an API similar to OpenAI’s.
  • Steps:
    1. Deploy Llama models on Hugging Face using the NIM (Neural Inference Module) endpoints.
    2. Access the model through the provided API endpoint.
  • Example:
    • Visit the Llama model page on Hugging Face (example).
    • Use the NIM endpoints to deploy and access the model.

4. Local Installation with Fine-Tuning

  • Description: You can install Llama models locally and expose them via a custom API that mimics the OpenAI API standard.
  • Steps:
    1. Download the Llama model.
    2. Fine-tune the model for specific tasks if needed.
    3. Deploy the model locally and create an API endpoint to access it.
  • Example:
    • Use a local inference server or a custom script to expose the model via an API similar to OpenAI’s.

By leveraging these solutions, you can install and run Llama series models locally while maintaining compatibility with the OpenAI API standard.