I’m looking for ways to infer w/ a Transformer model in a continuous manner — basically, I want it to retain some information about the previous sample in case it was part of the same text segment.
One approach I’m trying out now is inferring with intersecting windows (stride < length), and aggregating encoder embeddings of the overlapping part of the sequence (i.e. use information from window N to infer N+1). I use summing to aggregate instead of mean/dot product, as it gives the closest result to inferring as usual, but the result still doesn’t account for earlier context, meaning the approach doesn’t work.
Has this problem been addressed already? Is the typical solution to just increase input length bound? (What if I don’t have enough compute to train a model with large input lengths?)