How to speed up CodeLlama inference?

I’m running CodeLlama 7B on 8*A100 PCIE 40G.

q_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
) 
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map = "auto",
    quantization_config=q_config,
    max_memory = {0:"39GB",1:"39GB",2:"39GB",3:"39GB",4:"39GB",5:"39GB",6:"39GB",7:"39GB"}
)

With max_new_tokens=4096 Each inference takes 20 seconds.
I’m wondering if there are other ways to speed up inference besides lowering max_new_tokens.
Thanks