I’m currently using a ggml-format model (13b-chimera.ggmlv3.q4_1.bin) in an app using Langchain. I’ve found that the program is still only using the CPU, despite running it on a VM with a GPU.
I’ve tried using the line
torch.cuda.set_device(torch.device("cuda:0")) which returns the error
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'. But running the line
torch.cuda.is_available() returns true, indicating that CUDA can find the GPU.
Can ggml models work with GPUs in the first place? Do I need to use another format like GPTQ? If so, how would I implement it in the program, since ggml models are a single .bin file while GPTQ models seem to be a collection of files?