I am trying to peform Retrieval Augmented Generation with langchain and models on HF. It works but it is very slow as GPU seems not to work.
I checked the following:
- GPU on Google Colab is on (I chose T4)
- Some links ask to specify device or n_gpu_layers as arguments, but it doesn’t work (see code below). - Specifically for the device argument, a numerical id was mentioned, I tried 0 (as cuda:0 is retrieved)
the cuda version of ctransformers is installed through !pip install ctransformers[cuda] - I also tried the “model.to(device)” but it seems not to be possible with my configuration.
Here’s the code but I also tried variations as mentioned above:
#Test Llama2 llm = CTransformers(device="cuda:0",n_gpu_layers = 110, n_batch = 512,model="TheBloke/Llama-2-7B-Chat-GGUF", model_type="llama",config={"context_lengrecth" : 4096,"max_new_tokens": 1024}) rag_pipeline = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', retriever=retriever ) import time start_time = time.time() output=rag_pipeline.invoke("Is a license compulsory for triathlon ?") print("--- %s seconds ---" % (time.time() - start_time)) print(output)