Adapter-aware chat_template

Aurelien-Morgan · January 20, 2025, 1:39pm

Hello,

I do LoRa SFT with custom chat_template.
So, I have a pair “LoRa Adapter (config+weights)” and “chat_template”.
Adapter and chat_template were used together for training and, of course, work hand-in-hand for inferring.

Now, the thing is, I have LoRa adapters specialized for different tasks for one base_model.
Meaning I actually have several such pairs.

PEFT (and TGI) support multi-LoRa but, I’m not clear as to what the right way to handle the respective associated chat_template is in TGI.

Any pointer on a cookbook on how it can be done nicely would be of tremendous help to me.
I’d like not to have to invent custom logic for this if it exists.

I’d like to switch LoRa adapter with TGI (thus, switch LoRa AND chat_template)
I don’t know if people have done that before. In the tokenizer “chat_template” attribute, have Jinja conditional switch based on active adapter_name, maybe, but how ?

Looking for directions.
Please let me know if you’ve seen such chat_template switch in TGI (with LoRa adapter switch).

And, I apologize in advance if “Tokenizer” is not the best category for this topic.

Alanturner2 · January 22, 2025, 8:49am

Hello,

It sounds like you’re working with a setup that requires dynamically switching between multiple LoRa adapters and corresponding chat templates. For handling this scenario in TGI (Transformers-based Generative Inference) with PEFT (Parameter Efficient Fine-Tuning), you’re on the right track with considering how to manage the relationship between adapters and templates.

Here’s a suggestion on how to approach it:

Adapter and Template Mapping: For each task-specific LoRa adapter, you’ll need to maintain a mapping of which chat_template corresponds to which adapter. The goal is to dynamically switch between these pairs based on the active adapter.
Custom Logic for Switching Templates: While there might not be a cookbook that directly handles this specific situation, you can implement custom logic using a chat_template manager. The idea would be to have a structure that holds the templates for each adapter, and a function that, when you switch adapters, also switches the template.
Using Jinja for Conditional Template Switching: One way to manage conditional template switching based on the active adapter is by using a placeholder or identifier within your chat_template. Here’s a conceptual approach:

from transformers import pipeline
from jinja2 import Template

# A hypothetical mapping of adapters to templates
adapter_to_template = {
    "task1_adapter": "chat_template_1.jinja",
    "task2_adapter": "chat_template_2.jinja",
    # Add more adapters and their corresponding templates
}

def get_chat_template(adapter_name):
    # Load the corresponding template for the adapter
    template_path = adapter_to_template.get(adapter_name)
    if template_path:
        with open(template_path, 'r') as file:
            template_content = file.read()
        return Template(template_content)
    return None

def switch_adapter_and_template(adapter_name):
    # Load the correct LoRa adapter and the associated template
    chat_template = get_chat_template(adapter_name)
    if chat_template:
        # Now, configure TGI to use the new adapter and its corresponding template
        # This is where you plug the new adapter and template into the inference pipeline
        print(f"Switching to adapter: {adapter_name}")
        # Example: Call your PEFT method or TGI logic to load the new adapter and template
        # You might also update the tokenizer if needed
        # tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        # Or handle the tokenization flow based on the selected template
        return chat_template
    else:
        print(f"No template found for {adapter_name}")
        return None

Switching Based on Active Adapter: You can then switch the LoRa adapter and template based on user input or the task at hand, updating the tokenizer and pipeline as necessary. The chat_template can include placeholders or conditions to tailor the responses for different tasks.
TGI Integration: If you’re using TGI for inference, ensure that the new adapter and template are correctly loaded into the inference pipeline. You might need to pass the correct tokenizer and adapter to TGI’s model.

This approach should give you flexibility without reinventing the wheel. If TGI and PEFT have built-in support for multi-adapter, your work may just involve linking the templates correctly based on the active adapter.

Feel free to adapt this to your specific use case! Let me know if you need further help or clarification.

Aurelien-Morgan · January 23, 2025, 4:23pm

Obviously LLM-generated non-sense.

Aurelien-Morgan · February 21, 2025, 11:51am

I have seen a adapter_tokenizer field in the TGI traces
But it seems that the respective adapter tokenizers (with their respective chat_template from config) are never picked up by TGI and the field to always remain None so maybe it’s a placeholder for later.

In the meantime and for future references, I ended up making a toy implementation of this in a litserve server, the first version of which can be found on the retrain-pipelines GitHub [here].
A single-model, multi-lora endpoint with adapter-aware chat-template switch.
It’s not TGI with Rust for HTTP layer and scheduler, neither with flash inference and custom kernels, but it does the job for me for now.

Topic		Replies	Views
Using Text Generation Inference with LoRA adapter Beginners	3	2936	September 2, 2024
Addition of lm_head and embed_tokens layers to the lora adapter Beginners	3	559	February 16, 2025
Multiple chat templates Beginners	0	270	June 20, 2024
Fine Tune with/without LORA 🤗Transformers	1	232	October 7, 2024
`get_peft_model` or `model.add_adapter` Beginners	2	1174	February 17, 2025

Adapter-aware chat_template

Related topics