I am to work with a long-context LLM. I have read that there are several models with long context windows, Llama 3 has 128K. So I have tried creating endpoints with meta-llama/Meta-Llama-3.1-8B
and similar models.
I am having a terrible time creating these endpoints. They either fail to start, or the MAX_INPUT_TOKENS configuration info fails to register properly, and I get ValidationErrors when I send requests.
Can someone point me to a doc, or give me a step-by-step procedure for setting up an LLM (text generation) inference endpoint with a long context window, and then making requests to it? I would like to get to 100K tokens, but I will be happy with 32K or even 16K to start.
hi @dburfoot
I’m not sure whether it helps or not but did you check Longformer or LED?
These are mentioned in this course:
Hi, I figured out how to get this to work, so I’m typing up my notes to benefit others.
- On the Create Endpoint screen, open the advanced configuration pane. Select Text Generation, and input the desired values for max input input, max total tokens, and max prefill batch tokens. There are some restrictions on how these values must relate to each other, but the UI should tell you if you make a mistake.
- When you create an endpoint and it fails, the probable reason is that you didn’t pick an instance with enough VRAM. Note that the instance-selection UI shows you the total number of GPUs and the total VRAM for all GPUs, so you will need to divide to get VRAM per GPU. To get to 100K input tokens, I had to pick the A100 instance with 80 GB of VRAM for a single GPU.
- The logs contain good information about reasons for endpoint creation failure. You will have to sort through a lot of cruft, but you should be able to find the failure reason.
- If the endpoint fails to start, or if you are having other problems with it, I recommend just deleting the endpoint and creating a new one (containers should be immutable). I don’t trust the system’s ability to update config for an existing container.
- You shouldn’t need to modify your caller code to handle larger inputs. Just compile bigger input data and send it to the endpoint. But it is a good idea to calculate the number of tokens you get from the response, and incrementally scale up from 8K to 16K, 32K, etc, possibly lighting up new endpoints as you go.