Looking for some honest feedback from people who prototype vision tasks often
In my experience, even for the simplest vision tasks it takes quite some time to collect and balance datasets, label, train a CNN & evaluate. With VLMs you can do much more out of the box, but running them at scale isn’t sustainable, and they still aren’t trained for your specific use case.
So my idea is to have a way to just upload a reference set of images (e.g. one of each class) and “annotate” them with text related to your specific use case, to help the VLM reason about the contents of new images.
This way, simple applications (e.g. this bottle classification case) can be solved in minutes without any training/finetuning but still getting your use-case specific output and format (2L bottle, 50cl bottle…)
While this runs, the VLM can quietly set aside every image - label pair until there’s sufficient data of each class to train a CNN and go from prototype to robust application.
The details are on kasqade.ai - I’m happy to hear thoughts, comments, suggestions.. on this approach.