Prototype feedback request - build visual understanding applications in minutes

KasperDS · October 27, 2025, 5:13pm

Looking for some honest feedback from people who prototype vision tasks often

In my experience, even for the simplest vision tasks it takes quite some time to collect and balance datasets, label, train a CNN & evaluate. With VLMs you can do much more out of the box, but running them at scale isn’t sustainable, and they still aren’t trained for your specific use case.

So my idea is to have a way to just upload a reference set of images (e.g. one of each class) and “annotate” them with text related to your specific use case, to help the VLM reason about the contents of new images.

This way, simple applications (e.g. this bottle classification case) can be solved in minutes without any training/finetuning but still getting your use-case specific output and format (2L bottle, 50cl bottle…)

While this runs, the VLM can quietly set aside every image - label pair until there’s sufficient data of each class to train a CNN and go from prototype to robust application.

The details are on kasqade.ai - I’m happy to hear thoughts, comments, suggestions.. on this approach.

Topic		Replies	Views
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1181	June 23, 2021
DALL-E - mini version Flax/JAX Projects	52	8604	August 22, 2021
Suggestions for an open source tagging tool to build custom LayoutLMv2 datasets Awesome paper	0	913	January 25, 2022
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	404	June 29, 2021
Generate GIF reply to English text with VQGAN + CLIP Flax/JAX Projects	23	3330	July 2, 2021

Prototype feedback request - build visual understanding applications in minutes

Related topics