I’m running CodeLlama 7B on 8*A100 PCIE 40G.
q_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map = "auto",
quantization_config=q_config,
max_memory = {0:"39GB",1:"39GB",2:"39GB",3:"39GB",4:"39GB",5:"39GB",6:"39GB",7:"39GB"}
)
With max_new_tokens=4096
Each inference takes 20 seconds.
I’m wondering if there are other ways to speed up inference besides lowering max_new_tokens
.
Thanks