Loading a specific model configuration in TGI

Hi Everyone,

I trained a LLAMA model with multiple QLORA adapters, and wanted to move this setup to TGI, however I’m facing an issue

Everything runs fine without errors, however the performance is completely bad (the model doesn’t give coherent responses after being perfect on a similar examples)

Here’re some areas where the issue might be happening:

  1. Model Loading in the original mode has multiple parameters that I can’t pass in TGI like (dtype bfloat which i can’t pass with quantize, double_quant which is for qlora, and tokenizer configuration):
compute_dtype = getattr(torch, "bfloat16")
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
        )
original_model = AutoModelForCausalLM.from_pretrained(model_name,
                                                        device_map=device_map,
                                                        quantization_config=bnb_config,
                                                        trust_remote_code=True,
                                                        torch_dtype=torch.bfloat16,
                                                        )
    
    print("Model loaded successfully")
    # print model arch
    print(original_model)

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token

  1. Loading adapters locally (not very sure that will cause any issue)

So something I’m thinking about doing is
Saving the model after loading it in the configuration that works with the LORA type and loading it in TGI without quantize

So wanted to know did you guys face something like this before? would I face issues using QLORA with TGI or would it support? and how when I can’t pass the double_quant config? also not sure what to do with the tokenizer configuration too