Loading a specific model configuration in TGI

Entj · July 15, 2024, 7:45pm

Hi Everyone,

I trained a LLAMA model with multiple QLORA adapters, and wanted to move this setup to TGI, however I’m facing an issue

Everything runs fine without errors, however the performance is completely bad (the model doesn’t give coherent responses after being perfect on a similar examples)

Here’re some areas where the issue might be happening:

Model Loading in the original mode has multiple parameters that I can’t pass in TGI like (dtype bfloat which i can’t pass with quantize, double_quant which is for qlora, and tokenizer configuration):

compute_dtype = getattr(torch, "bfloat16")
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
        )
original_model = AutoModelForCausalLM.from_pretrained(model_name,
                                                        device_map=device_map,
                                                        quantization_config=bnb_config,
                                                        trust_remote_code=True,
                                                        torch_dtype=torch.bfloat16,
                                                        )
    
    print("Model loaded successfully")
    # print model arch
    print(original_model)

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token

Loading adapters locally (not very sure that will cause any issue)

So something I’m thinking about doing is
Saving the model after loading it in the configuration that works with the LORA type and loading it in TGI without quantize

So wanted to know did you guys face something like this before? would I face issues using QLORA with TGI or would it support? and how when I can’t pass the double_quant config? also not sure what to do with the tokenizer configuration too

Topic		Replies	Views
Unable to load fine-tuned llm Beginners	4	3267	January 31, 2024
How to load a model fine-tuned with QLoRA 🤗Transformers	2	6557	July 29, 2024
`get_peft_model` or `model.add_adapter` Beginners	2	1161	February 17, 2025
Peft model from pretrained load in 8/4 bit 🤗Transformers	6	17496	October 12, 2023
Model is getting loaded unevenly on GPUs 🤗Transformers	1	50	July 11, 2024

Loading a specific model configuration in TGI

Related topics