GPTQ model to bfloat16

Hi,
I am trying to convert GPTQ Llama2 to bfloat16 and for ALL computations to be conducted in bfloat16.

I have done the following:

gptq_config = GPTQConfig(bits=4, disable_exllama=True)

model_path = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=False, revision="main", quantization_config=gptq_config, torch_dtype=torch.bfloat16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

input_ids = tokenizer("Tell me an interesting fact!", return_tensors="pt").input_ids.to("cuda").to(torch.bfloat16)
output = model.generate(input_ids)

but I get the following error:

  File "/home/jeevan/miniconda3/envs/llama2_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got CUDABFloat16Type instead (while checking arguments for embedding)

Edit:
Update, this is not specific to GPTQ. If I use standard Llama e.g.

model_path = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=False, revision="main", torch_dtype=torch.bfloat16).to("cuda")

the same issue occurs.

If I convert the model to bfloat but keep the input as int64, then it works fine. But I want all activations in the model to be bfloat16, hence I require the input to also be broadcast to bfloat16.