How to train transformer (seq-to-seq) for very large seq?

I have a seq-to-seq task but my input seq is super large (it has dimension of 10k tokens) the out put seq is normal though (less than 512).
I notice that normal transformer does not work very well for my case.

I was doing it in autoregressive style if it helps also.

I was wondering if folks here have any suggestion on how to do it?