I have a seq-to-seq task but my input seq is super large (it has dimension of 10k tokens) the out put seq is normal though (less than 512).
I notice that normal transformer does not work very well for my case.
I was doing it in autoregressive style if it helps also.
I was wondering if folks here have any suggestion on how to do it?
Thanks