How does Gemini 1.5 achieve 10M context window?

In the paper [2403.05530] Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, they claim their model passes the NeedleInHaystack test perfectly. This means they somehow need to calculate the whole attention matrix (10Mx10M) which is around ~100 Tera Byte information (depending on the precision). I would appreciate it if someone could cite some papers that introduce some new mechanisms/tricks that make this possible. Thanks in advance!