I want to fine-tune a model that takes an image and/or a description (one or both) and outputs a super-long text.
The idea is to take a screenshot of a web page with an optional front page description and output the HTML code.
I’m worried about the context window not being large enough.
What model can I use as a base? Maybe Blip2?