GPU Google Colab not working with langchain

I am trying to peform Retrieval Augmented Generation with langchain and models on HF. It works but it is very slow as GPU seems not to work.

I checked the following:

  • GPU on Google Colab is on (I chose T4)
  • Some links ask to specify device or n_gpu_layers as arguments, but it doesn’t work (see code below). - Specifically for the device argument, a numerical id was mentioned, I tried 0 (as cuda:0 is retrieved)
    the cuda version of ctransformers is installed through !pip install ctransformers[cuda]
  • I also tried the “model.to(device)” but it seems not to be possible with my configuration.

Here’s the code but I also tried variations as mentioned above:

#Test Llama2
llm = CTransformers(device="cuda:0",n_gpu_layers = 110,
n_batch = 512,model="TheBloke/Llama-2-7B-Chat-GGUF", model_type="llama",config={"context_lengrecth" :

4096,"max_new_tokens": 1024})

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=retriever
)

import time
start_time = time.time()
output=rag_pipeline.invoke("Is a license compulsory for triathlon ?")
print("--- %s seconds ---" % (time.time() - start_time))

print(output)