OpenAI ApI standards & onprem model options

lynngre · March 20, 2025, 8:09am

Hi, total AI newb here. We are looking to install a model Onprem. The supplier of the software requiring the model, suggested Azure OpenAI (MS) but from that I understand this cannot be installed 100% onprem.

The other option from the supplier is Llama but then they have requested information from me as to whether Llama 3 has the Open AI API standards. I can’t seem to find anything information on this.

Thanks for any info around this.

John6666 · March 20, 2025, 5:29pm

There are quite a few ways to set up a server that is compatible with the OpenAI API. The model suggested below is Llama 3.1 405B, but in reality, there is no particular reason to stick with Llama. You can just look for a good model from the leaderboard, etc.

OpenAI API compatible solutions

To find good Open-source LLMs

by Hugging Chat

To install and make Llama series models work according to the OpenAI API standard on-premise, here are some specific solutions:

1. Using Ollama

Description: Ollama simplifies running large language models (LLMs) like Llama on Linux or macOS. You can easily fetch and run Llama models from the Ollama library.
Steps:
1. Install Ollama on your machine.
2. Use the relevant command to run the Llama model from Ollama’s library.

Example:

# Example command to run a LLaMA model
ollama generate "llama" --prompt "Your Prompt"

2. Deploying with vLLM (LLaMA 3.1 405B)

Description: vLLM is an efficient inference server for large language models, supporting models like Llama 3.1 405B. It allows you to expose the model through an OpenAI-compatible API.
Steps:
1. quantize the model if needed (e.g., using fp8).
2. Load the model using vLLM and start the OpenAI-compatible API server.

Example:

# Example script to start the OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
    --model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic \
    --tensor-parallel-size 8

3. Using Hugging Face Endpoints

Description: Hugging Face provides endpoints for deploying models like Llama. You can deploy Llama models and access them via an API similar to OpenAI’s.
Steps:
1. Deploy Llama models on Hugging Face using the NIM (Neural Inference Module) endpoints.
2. Access the model through the provided API endpoint.
Example:
- Visit the Llama model page on Hugging Face (example).
- Use the NIM endpoints to deploy and access the model.

4. Local Installation with Fine-Tuning

Description: You can install Llama models locally and expose them via a custom API that mimics the OpenAI API standard.
Steps:
1. Download the Llama model.
2. Fine-tune the model for specific tasks if needed.
3. Deploy the model locally and create an API endpoint to access it.
Example:
- Use a local inference server or a custom script to expose the model via an API similar to OpenAI’s.

By leveraging these solutions, you can install and run Llama series models locally while maintaining compatibility with the OpenAI API standard.

Topic		Replies	Views
How to use llm model's api? Beginners	2	2907	November 14, 2024
Open API standard for open-source LLMs Research	0	894	July 1, 2023
Lama 3.23b performs great when I download and use using ollama but when I manually download the model or if I use the gguf model by unsloth, it gives me irrelevant response. Please help me out Beginners	9	1358	October 31, 2024
Is there llama3 api for hugging face to use? Beginners	4	890	September 8, 2024
Advice for locally run AI Assistant Beginners	6	888	March 10, 2025