I have a large language model that I’m using for text-generation, and I have it deployed right now, with GPU pinning enabled.
I’m extremely happy with the results so far.
But I frequently get CUDA out-of-memory errors if I supply too many tokens in my prompt, or if I request too many tokens in the completion. I don’t have exact numbers quite yet, but things seem to fail when the total token count (prompt + completion) is greater than roughly 500.
The model is based on GPT-J, which can theoretically handle 2048 context tokens, given sufficient memory, and I’d like to run some tests, with the model somewhat closer to the limits of its capabilities.
So I’d like to ask if it’s possible to upgrade my account, in order to get a larger allotment of GPU memory?