Llama: local model doesn't match hosted model, why?

We can’t get the same (or even similar) results using a llama model running locally vs. the same model in the cloud. Shouldn’t they be identical? How can we get them to be?

Context.
We have been sending prompts to several models in the cloud using openrouter, and getting the performance we need. For example, we’ve got good benchmarks using meta-llama/llama-2-13b-chat.
We need the same performance on “our own server”, but we have been getting very different results, not comparable. We don’t understand why or how to fix it.
We’ve been running the “same” model, meta-llama/llama-2-13b-chat-hf on google colab. The performance metrics are nowhere near the same, and even from single prompts its obvious that the responses are qualitatively completely different. We can’t even get a straight answer to trivial test prompts like versions of “2+2=???”; we get long wandering nonsense when the same prompt just gives “4” in the cloud version. We’ve tried adjusting parameters and not made progress. We don’t know the exact details of how the cloud hosted version of the model, from openrouter, is set up. We’ve tried several llama models and in each case we can’t get them to behave even broadly similarly, much less quantitatively the same.

Here’s our current code, running on google colab:
model_id = “meta-llama/Llama-2-13b-chat-hf”
pipeline = load_model(model_id)

# Create Flask app
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    print("Received request")
    data = request.json
    input_text = data.get('input_text', '')

    if not input_text:
        print("No input_text provided")
        return jsonify({'error': 'No input_text provided'}), 400

    try:
        outputs = pipeline(
            input_text,
            temperature=0.5,
            top_p=1,
        )

max_new_tokens=12, # Reduce max_new_tokens to save memory

do_sample=True,

        print("Outputs:", outputs)  # Log the entire output response