What model to use?

AeroDEmi · July 11, 2024, 11:20pm

I want to fine-tune a model that takes an image and/or a description (one or both) and outputs a super-long text.

The idea is to take a screenshot of a web page with an optional front page description and output the HTML code.

I’m worried about the context window not being large enough.
What model can I use as a base? Maybe Blip2?

Topic		Replies	Views
Image to text model that can take an additional text input 🤗Transformers	1	291	October 2, 2023
Image to Text model that can take an additional text as input for context 🤗Hub	1	503	September 5, 2023
I'm looking for an 'image to text' model Beginners	0	838	April 2, 2023
Image Captioning fine tuning 🤗Transformers	0	448	February 25, 2023
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1183	June 23, 2021