How can I deploy a Llama2-like model in int4/int8 on inference endpoints?

Hello team,

I am attempting to deploy a Llama2 model for testing purposes, I’ve encountered some challenges given my novice status in this domain.

The model I want to deploy is a Llama2 model (hdnh2006/llama-test · Hugging Face). This is done just for testing purposes but I have many doubts about it and I have been unable to deploy a model on inference endpoints.

Let me provide you with more context. I just loaded the model in int8 precision running the following lines of code:

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_name = "llama-hf" # local llama2 exported 2 hugging face format
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map=0, torch_dtype=torch.float16)

# Export model
tokenizer.save_pretrained('llama-test')
model.save_pretrained('llama-test')

You can see the config.json file I have gotten in the public repo. And it clearly has the parameter "load_in_8bit": true. Given this setup, I anticipated a seamless deployment on a Tesla T4, as has been the case with our cloud provider and my personal laptop (RTX 4060 8GB).

Nonetheless, I have been unable to deploy a model using inference points. The logs suggest an oversight on my part. I am keen on understanding how to deploy the model via your service, particularly with configurations for half or int8 precision. Would you be able to furnish relevant documentation or insights into the discrepancy observed?

PArt of the logs:
9kb8c 2023-10-27T09:35:24.335Z INFO | Repository Revision: fd2b689bc5fb8f38dfd016b81d7459ea330fc794
9kb8c 2023-10-27T09:35:24.335Z INFO | Used configuration:
9kb8c 2023-10-27T09:35:24.335Z INFO | Start loading image artifacts from huggingface.co
9kb8c 2023-10-27T09:35:24.335Z INFO | Repository ID: hdnh2006/llama-test
9kb8c 2023-10-27T09:35:24.646Z INFO | Ignore regex pattern for files, which are not downloaded: openvino, mlmodel, flax, *tflite, onnx, ckpt, rust, *safetensors, tar.gz, tf
9kb8c 2023-10-27T09:35:48.801Z Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.
9kb8c 2023-10-27T09:35:48.801Z Token is valid.
9kb8c 2023-10-27T09:35:48.801Z Your token has been saved to /root/.cache/huggingface/token
9kb8c 2023-10-27T09:35:48.801Z Login successful
9kb8c 2023-10-27T09:38:12.898Z {“timestamp”:“2023-10-27T09:38:12.898077Z”,“level”:“INFO”,“fields”:{“message”:“Starting download process.”},“target”:“text_generation_launcher”,“span”:{“name”:“download”},“spans”:[{“name”:“download”}]}
9kb8c 2023-10-27T09:38:12.898Z {“timestamp”:“2023-10-27T09:38:12.897947Z”,“level”:“INFO”,“fields”:{“message”:“Args { model_id: "/repository", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 1512, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 2048, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "pen-istemas76bb-my-llama2-7646c7965-9kb8c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: , watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }”},“target”:“text_generation_launcher”}
9kb8c 2023-10-27T09:38:15.902Z {“timestamp”:“2023-10-27T09:38:15.902389Z”,“level”:“WARN”,“fields”:{“message”:“No safetensors weights found for model /repository at revision None. Converting PyTorch weights to safetensors.\n”},“target”:“text_generation_launcher”}