Hello, I am trying to get some HW to work with llama 2 the current hardware works fine but its a bit slow and i cant load the full models.
I had to go with quantized versions event though they get a bit slow on the inference time.
I saw that the Nvidia P40 arent that bad in price with a good VRAM 24GB and wondering if i could use 1 or 2 to run LLAMA 2 and increase inference times?
I saw a lot of people saying there are some limitations and others that they are a pain and other that should work just fine , i opened my own question to see if someone can share some light about this topic.
Thanks for any response.