How does Codex, a descendant of GPT-3 allow a context length 4096 tokens while GPT-3 allows only 2048, using Transformers architecture.
I have gone through the OpenAI Codex paper, but couldn’t find any information related to it. Could anyone tell how this token limit was increased and what was the technique used?