One other thing regarding long-contexts: you can also try using a sliding window approach as suggested in this thread.
This is actually how the question-answering pipelines are able to handle long contexts and I think the above suggestion could also work nicely in your case