meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API

or that affects only speed

Basically, this should be the case, and there are few cases where you get half-baked results due to insufficient hardware performance. It’s either it works or it doesn’t, and it’s either fast or slow.

I found the official HF implementation for Llama2. It may be that tokenizer.use_default_system_prompt = False is meaningful.

Since Llama2 has been around for a long time, it has been affected by various HF specification changes, so there is likely to be some confusion about how to use it.