WARN/ERRORs when loading `togethercomputer/LLaMA-2-7B-32K`

I am using the HF text generation interface Docker container.

When I load the model togethercomputer/LLaMA-2-7B-32K, the log file shows the following warnings and errors. I am trying to get a better feel for what the warnings and errors mean and how I can best handle them moving forward (e.g.: Report challenges with a particular model that has errors, ignore because while making the download/building inefficient than needed, they don’t cause a challenge, or some other. So below I put down the warnings and errors with great hope someone will help me understand - and in the case of errors, help with next steps.


WARN download: text_generation_launcher: No safetensors weights found for model togethercomputer/LLaMA-2-7B-32K at revision None. Converting PyTorch weights to safetensor.

My Interpretation

Bummer…the team that posted the model didn’t serialize their tensors so now we have to rebuild them with PyTorch. Not a big deal, but takes longer to load and hence costs more GPU/CPU time.
→ Is that correct?


WARN text_generation_router: router/src/main.rs:136: Could not find a fast tokenizer implementation for togethercomputer/LLaMA-2-7B-32K

My Interpretation

According to the HF Fast Tokenizer explanation YouTube video, another inefficient bummer. Because this fast tokenizer implementation doesn’t exist, doing this step can take up to 20x longer. But the model can still be used.
→ Is this a correct interpretation?


WARN text_generation_router: router/src/main.rs:139: Rust input length validation and truncation is disabled

My Interpretation

I assume this means another inefficiency. Context sent into the model does not use Rust, so another inefficiency?

Then I can access the model, however the answer is a garbled bunch of bytes for example:

### What nematodes would you recommend to get rid of fungus gnats in the soil? 

I am plodding my way through trying to understand these steps so that I can better evaluate which models are best to use both from “precision” but also from efficiency/robustness.

Please help me understand better how to figure these things out. Especially in this case where other models give me answers yet here I get garbage out.

Thank you.