Llama: local model doesn't match hosted model, why?

cdcv · May 28, 2024, 11:32am

We can’t get the same (or even similar) results using a llama model running locally vs. the same model in the cloud. Shouldn’t they be identical? How can we get them to be?

Context.
We have been sending prompts to several models in the cloud using openrouter, and getting the performance we need. For example, we’ve got good benchmarks using meta-llama/llama-2-13b-chat.
We need the same performance on “our own server”, but we have been getting very different results, not comparable. We don’t understand why or how to fix it.
We’ve been running the “same” model, meta-llama/llama-2-13b-chat-hf on google colab. The performance metrics are nowhere near the same, and even from single prompts its obvious that the responses are qualitatively completely different. We can’t even get a straight answer to trivial test prompts like versions of “2+2=???”; we get long wandering nonsense when the same prompt just gives “4” in the cloud version. We’ve tried adjusting parameters and not made progress. We don’t know the exact details of how the cloud hosted version of the model, from openrouter, is set up. We’ve tried several llama models and in each case we can’t get them to behave even broadly similarly, much less quantitatively the same.

Here’s our current code, running on google colab:
model_id = “meta-llama/Llama-2-13b-chat-hf”
pipeline = load_model(model_id)

# Create Flask app
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    print("Received request")
    data = request.json
    input_text = data.get('input_text', '')

    if not input_text:
        print("No input_text provided")
        return jsonify({'error': 'No input_text provided'}), 400

    try:
        outputs = pipeline(
            input_text,
            temperature=0.5,
            top_p=1,
        )

max_new_tokens=12, # Reduce max_new_tokens to save memory

do_sample=True,

        print("Outputs:", outputs)  # Log the entire output response

Topic		Replies	Views
Llama2 response times - feedback Beginners	0	621	February 6, 2024
meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API 🤗Transformers	1	116	February 2, 2025
Only hosted models vs local models Beginners	1	77	February 9, 2025
Meta-llama / Meta-Llama-3-70B-Instruct is not available as a serverless API Models	10	1615	September 28, 2024
meta-llama/Llama-3.2-11B-Vision-Instruct did not reply 🤗Transformers	10	12915	October 29, 2024

Llama: local model doesn't match hosted model, why?

max_new_tokens=12, # Reduce max_new_tokens to save memory

do_sample=True,

Related topics