How does Gemini 1.5 achieve 10M context window?

Ayers13 · April 7, 2024, 10:13am

In the paper [2403.05530] Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, they claim their model passes the NeedleInHaystack test perfectly. This means they somehow need to calculate the whole attention matrix (10Mx10M) which is around ~100 Tera Byte information (depending on the precision). I would appreciate it if someone could cite some papers that introduce some new mechanisms/tricks that make this possible. Thanks in advance!

Topic		Replies	Views
Context window decreased after finetuning? 🤗Transformers	0	190	November 4, 2023
Is it ok to have max_length greater than context_length of the model 🤗Transformers	0	314	March 15, 2024
Wav2Vec2: how much context is available for self-attention Models	0	256	March 21, 2023
Model for big context window Beginners	0	190	June 30, 2024
Question About the Practicality of the Context Length Models	3	6685	August 8, 2024

How does Gemini 1.5 achieve 10M context window?

Related topics