Hi,
I am trying to convert GPTQ Llama2 to bfloat16 and for ALL computations to be conducted in bfloat16.
I have done the following:
gptq_config = GPTQConfig(bits=4, disable_exllama=True)
model_path = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=False, revision="main", quantization_config=gptq_config, torch_dtype=torch.bfloat16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
input_ids = tokenizer("Tell me an interesting fact!", return_tensors="pt").input_ids.to("cuda").to(torch.bfloat16)
output = model.generate(input_ids)
but I get the following error:
File "/home/jeevan/miniconda3/envs/llama2_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got CUDABFloat16Type instead (while checking arguments for embedding)
Edit:
Update, this is not specific to GPTQ. If I use standard Llama e.g.
model_path = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=False, revision="main", torch_dtype=torch.bfloat16).to("cuda")
the same issue occurs.
If I convert the model to bfloat but keep the input as int64, then it works fine. But I want all activations in the model to be bfloat16, hence I require the input to also be broadcast to bfloat16.