What model to use?

I want to fine-tune a model that takes an image and/or a description (one or both) and outputs a super-long text.

The idea is to take a screenshot of a web page with an optional front page description and output the HTML code.

I’m worried about the context window not being large enough.
What model can I use as a base? Maybe Blip2?