It looks as if it is not possible to use a model from CTransformers
with the ChatPromptTemplate
and a RAG chain.
The only thing I could find on the internet is using it with the PromptTemplate
from langchain.prompts
.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
template = """Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
response = llm_chain.run("What is AI?")
Reference
If you want to use a CTransformer
in a RAG setup, you could use faiss index or chromadb as vector store and an sbert model for document/ text embeddings. Then you would search with the sbert model in your vector store and retrieve documents which you pass on to you llm.
init embedding model
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer, Pipeline
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Inference pipeline for embedding model
class EmbeddingPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
return preprocess_kwargs, {}, {}
def preprocess(self, text):
encoded_text = self.tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(device)
return encoded_text
def _forward(self, model_inputs):
outputs = self.model(**model_inputs)
return {"outputs": outputs, "attention_mask": model_inputs["attention_mask"]}
def postprocess(self, model_outputs):
sentence_embeddings = mean_pooling(model_outputs["outputs"], model_outputs['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings[0].numpy()
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
encoder = EmbeddingPipeline(model=model, tokenizer=tokenizer, device=device)
setup chromadb index
import chromadb
collection = chroma_client.create_collection(name="squad_v2", metadata={"hnsw:space": "ip"})
# embed documents
embedding_vector = encoder(document).tolist()
collection.add(
embeddings=[embedding_vector],
documents=[document],
ids="1" # string id
)
query the vetor store
question = "Some question..."
embedded_question = encoder(question ).tolist()
result = collection.query(
query_embeddings=question,
n_results=5 # getting 5 best results
)
contexts = "\n".join(result["documents"])
use llm to generate answer
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
response = llm_chain.run(contexts , "What is AI?")
If the llm_chain.run(contexts , "What is AI?")
does not take 2 arguments you could use a simple function to create you prompts.
def get_prompt(question, contexts):
return f"""Answer the question based only on the following context:
{context}
Question: {question}"""
llm(get_prompt(context, question))
Here are some notebooks I implemented when I learned about RAG. (definitely not best practices )
hybrid search - just embedding model tests
Notebooks with different tests