Pipelines for Chat Generation with Memory

mavaron · March 13, 2024, 1:56am

Good Night dear community,

I’m trying to build a chatbot using Pipeline with a text-generation model. So far, I have been able to create a successful response from the LLM using the following snippet:

Vicuna_pipe = pipeline(“text-generation”, model=llm_Vicuna, tokenizer=Vicuna_tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample = True,)

However, every time I instantiate the pipeline, with a new user prompt, the LLM starts from scratch forgetting all the previous context. I’ve done some research and some Langchain libraries do resolve this issue without using the pipeline abstraction. Has anyone here been able to add the memory buffer to the pipeline? What would be the best approach to do so? If someone can point me in the right direction, I would really appreciate it ! Thanks !

CKeibel · March 13, 2024, 2:51pm

Hi @mavaron

I don’t think it is possible to pass a chat buffer to a pipline. You could pass the complete history in your prompt or parts of it.
Another possibility would be to create a custom pipline. I have tested this as follows:

from pprint import pprint
from transformers import Pipeline, AutoTokenizer, AutoModelForCausalLM


# chat buffer
buffer = []

# function to embed messages in template format
def embed_message(message, role):
    return {
        "role": role,
        "content": message
    }

# custom pipeline
class ChatBufferPipeline(Pipeline):
    
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        
        # lookback on chat
        if "lookback" in kwargs:
            preprocess_kwargs["lookback"] = kwargs["lookback"]
            
        return preprocess_kwargs, {}, {}

    def preprocess(self, prompt, lookback=None):
        # initial system message
        messages = [
            {
                "role": "system",
                "content": "You are a friendly chatbot who answers user questions. You can use the previous examples if this helps you.",
            },
        ]
        # get chat history
        if lookback:
            buffer_messages = buffer[-(lookback):]
            messages += buffer_messages
        # embed user message in template format
        user_message = embed_message(prompt, "user")
        messages.append(user_message)
        # add new message to buffer
        buffer.append(user_message)
            
        messages = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        return self.tokenizer(messages, return_tensors="pt").input_ids.cuda()

    def _forward(self, model_inputs):
        outputs = self.model.generate(model_inputs, max_new_tokens=250, min_new_tokens=20)
        return {"outputs": outputs, "inputs": model_inputs}

    def postprocess(self, model_outputs):
        outputs = model_outputs["outputs"]
        inputs = model_outputs["inputs"]
        assistant_output = self.tokenizer.decode(outputs[0][len(inputs[0]):], add_special_tokens=False)
        buffer.append(embed_message(assistant_output, "assistant"))
        full_dialog = self.tokenizer.decode(outputs[0])
        return assistant_output, full_dialog

Now I load my model that I want to have as an assistant and instantiate the pipeline:

# model
model_id = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

# pipeline
chatpipe = ChatBufferPipeline(model=model, tokenizer=tokenizer)

The interaction with it looks like this (note that I am currently returning the new message and all the input with lookback as tuple to debug the chat history):

_, dialog = chatpipe("My favorite color is yellow, what is yours?")

pprint(dialog)
--------------------------
'<s> <|system|>\n'
 'You are a friendly chatbot who answers user questions. You can use the '
 'previous examples if this helps you.</s> \n'
 '<|user|>\n'
 'My favorite color is yellow, what is yours?</s> \n'
 '<|assistant|>\n'
 "I don't have a favorite color as I'm not capable of having preferences or "
 'feelings. However, my design and interface are primarily blue and green, '
 'which are calming and soothing colors that help users feel more relaxed and '
 'comfortable while interacting with me.</s></s> \n'

If I then ask another question, my message is in the buffer and the model uses it to answer.

_, dialog = chatpipe("What did I tell you my favorite color was?", lookback=10)

pprint(dialog)
--------------------------
'<s> <|system|>\n'
 'You are a friendly chatbot who answers user questions. You can use the '
 'previous examples if this helps you.</s> \n'
 '<|user|>\n'
 'My favorite color is yellow, what is yours?</s> \n'
 '<|assistant|>\n'
 "I don't have a favorite color as I'm not capable of having preferences or "
 'feelings. However, my design and interface are primarily blue and green, '
 'which are calming and soothing colors that help users feel more relaxed and '
 'comfortable while interacting with me.</s></s> \n'
 '<|user|>\n'
 'What did I tell you my favorite color was?</s> \n'
 '<|assistant|>\n'
 'You told me that your favorite color is yellow. Is there anything else I can '
 'help you with today?</s>'

That would be my idea of how to generate this behavior, but there may be even simpler ways.
The important thing is that I have created the buffer outside the pipeline in case I want to empty it. And I used a chat template that was given by the model. You can create your own chat templates and add them to the models tokenizer.
The documentation is very helpful for doing that.

I hope that this idea helps you.

mavaron · March 14, 2024, 10:56pm

@CKeibel Awesome man! You are the best … Quick question, I’m using a ctransformer model class (it’s a GGUF) so the apply_chat_template method for the Tokenizer doesn’t work ! Any idea on how to get around this? Bests !

CKeibel · March 15, 2024, 9:53am

Hi, you’re welcome!
I have to be honest, I’ve never worked with CTransformers before, do they work with a “normal” HuggingFace pipeline?
A simple approach would be to implement a helper function that converts our list of message dictionaries to string. But I will have a closer look later.

def embed_message(message, role) -> dict:
    return {
        "role": role,
        "content": message
    }

def message_to_string(messages: list[dict]) -> str:
    prompt = f""
    for message in messages:
        prompt += f"<|{message['role']}|>\n"
        prompt += f"{message['content']}\n"
    prompt += "<|Assistant|>\n"
    return prompt

And use it like this:

buffer = []

message = embed_message("How are you doing?", "User")

buffer.append(message)

pprint(message_to_string(buffer))
----------------------
'<|User|>\n
How are you doing?\n
<|Assistant|>\n'

Topic		Replies	Views
Pipeline Llama3 Text Generation Saving a Memory/Cache Beginners	9	2343	January 5, 2025
How to Add Conversational Memory to Open Source LLMs? Beginners	0	948	January 25, 2024
How does the API inference work on models such as Blenderbot? Models	4	930	May 14, 2022
Implementing chatbot history like chatgpt on gradio gr.chatinterface() 🔒 Gradio	2	4876	August 26, 2023
Conversational Memory with HF inference endpoints Inference Endpoints on the Hub	0	345	February 1, 2024

Pipelines for Chat Generation with Memory

Related topics