Mac compatible model for identifying and naming objects in pictures?

any mac os ARM compatible model best suited for detailed analysis of pictures (either real world ones or cartoons) in order to identify maximum number of objects and naming them ?
Goal is to help automate vocabulary learning for foreign speakers.

1 Like

Considering the constraints of Mac and ARM, using smaller models natively supported by Transformers is the simplest approach.

Florence 2 isn’t particularly strong on languages other than English when used alone, but using it as the first stage in a pipeline can significantly reduce the processing required by other models, making it a viable option. Florence 2 is a popular model, so troubleshooting is also relatively straightforward.
Combining Florence 2 with other multilingual models should allow for the creation of a relatively lightweight pipeline.

thx for the feedback, appreciated.As I’m more of a newbie, could you point me to a specific installer or way to install. Currently i use LM Studio.

1 Like

I see.:grinning_face:

When using models from the GUI, it’s difficult to perform overly complex tasks (beyond the capabilities of the model itself), and the types of models you can use are somewhat limited.
However, the fundamental concepts and the models you should use remain unchanged.

You can find LM Studio-compatible models here.


I’ll focus on what you already use: LM Studio.
Goal: “Install a vision model that can look at images, list objects, and help with vocabulary.”

I’ll walk through:

  1. What you need conceptually (very briefly).
  2. Exactly which models to look for in LM Studio.
  3. Step-by-step: download → load → use with images in the LM Studio app.
  4. Optional: how to call LM Studio from other tools later.

1. Mental model: LM Studio + VLMs

  • LM Studio is a local “host” for models. It:

    • Downloads models from Hugging Face (and similar).
    • Runs them locally on your Mac.
    • Provides a chat UI and a local API. (LM Studio)
  • Some models are text-only LLMs.

  • Some models are VLMs (Vision–Language Models): they accept text + image input. LM Studio calls these “vision-capable models”. (LM Studio)

For your vocabulary idea you specifically want a VLM: “look at this picture, list all objects, and translate them”.


2. Concrete model choices that work in LM Studio

LM Studio’s docs and community examples repeatedly use Qwen2-VL-2B-Instruct as the starter vision model:

  • LM Studio docs: they show qwen2-vl-2b-instruct as a standard example of a VLM, including lms get qwen2-vl-2b-instruct in the image-input docs. (LM Studio)
  • LM Studio community GGUF: lmstudio-community/Qwen2-VL-2B-Instruct-GGUF is an official community quantization for LM Studio (GGUF format for llama.cpp/MLX). (Hugging Face)
  • Qwen2-VL background: 2B, 7B, 72B vision–language models; 2B is small and designed to run locally, including on Mac; supports Japanese and other languages. (AIAI)

So for a beginner on LM Studio:

Start with Qwen2-VL-2B-Instruct in LM Studio.

Later, if your Mac has more RAM, you can experiment with:

  • Qwen2.5-VL-3B/7B (bigger, newer).
  • Gemma-3 vision models from the LM Studio Model Catalog (Google’s image + text models). (LM Studio)

But first, get one working model: Qwen2-VL-2B-Instruct.


3. Make sure LM Studio is ready for images

  1. Update LM Studio to a recent version (0.3.x or newer):

    • LM Studio’s “Getting up and running” docs say: first install the latest app, then download a model via the Discover tab. (LM Studio)
    • Vision/image features were improved in v0.3.3x (new Image resize bounds in Settings → Chat → Image Inputs). (LM Studio)
  2. In LM Studio:

    • Open Settings → Chat → Image Inputs.
    • Confirm there are “Image resize bounds” controls. These define how big images can be before being resized for the vision model. (LM Studio)
    • For now, keep defaults (they’re chosen to balance quality and speed).

If you see those options, LM Studio is ready to send images to vision models.


4. Download a vision model in LM Studio (GUI, no terminal)

LM Studio has a built-in downloader:

  • Docs: “Head over to the Discover tab to download models. Pick one of the curated options or search for models by query.” (LM Studio)

Step 4.1 – Open the Discover tab

  1. Start LM Studio.
  2. Click the Discover tab in the top navigation (or press the shortcut; on Mac it’s usually something like ⌘+2, depending on your version). (LM Studio)

You’ll see a search field and lists of models.

Step 4.2 – Find Qwen2-VL-2B-Instruct

In the Discover search field, type:

qwen2-vl-2b-instruct

You’re looking for a GGUF variant like:

  • “Qwen2-VL-2B-Instruct-Q4_K_M.gguf” or similar, from lmstudio-community/Qwen2-VL-2B-Instruct-GGUF. (Hugging Face)

LM Studio will usually show:

  • model name,
  • size,
  • and quantization.

As a beginner:

  • Choose a Q4 or Q5 quantization (balanced size + quality).
  • Avoid Q6/Q8 until you know your Mac’s memory limits.

Step 4.3 – Download the model

  1. Click the model entry.
  2. Click Download.

LM Studio downloads the GGUF file to your local models folder. The docs also note that you need a working internet connection to download models via Discover. (LM Studio)

Once it finishes:

  • The model will appear under My Models in LM Studio. (Pinggy)

5. Load the model into memory (so you can chat)

Downloading just saves the file. To actually use it, you must load it:

  • LM Studio docs: after downloading, you go to the Chat tab, press the model loader (cmd+L on macOS) and select a model. (leanpub.com)

Step 5.1 – Open Chat and the model loader

  1. Click the Chat tab.
  2. Use the shortcut ⌘+L (or the UI button) to open the model loader popup. (leanpub.com)

You should see your downloaded models listed.

Step 5.2 – Select Qwen2-VL-2B-Instruct

  1. In the model dropdown, pick your Qwen2-VL-2B-Instruct GGUF model.
  2. Click Load.

LM Studio now:

  • Allocates memory,
  • Loads the weights,
  • Prepares the model for chat.

You’re now ready to talk to a vision–language model.


6. Use the model with images in the LM Studio UI (no code)

Now you want to:

  • attach a picture (photo or cartoon),
  • ask the model to list objects and give vocabulary.

Step 6.1 – Attach an image to your message

LM Studio doesn’t spell out every UI detail in docs, but the Image Input docs explain that VLMs can accept images, and you send them alongside text. (LM Studio)

In the Chat tab:

  1. Make sure Qwen2-VL-2B-Instruct is the active model (shown at the top).

  2. In the message box area, look for:

    • an image icon, or
    • a “+” / attachment icon.
  3. Click it and select a PNG/JPEG image from your Mac.

LM Studio attaches the image to your next message (internally it will send it to the model as base64, as the API docs describe). (LM Studio)

Step 6.2 – A good beginner prompt for vocabulary

After attaching the image, write something like:

You are a language tutor.

1. Look carefully at the image.
2. List every distinct object you can clearly see.
3. For each object, give:
   - english: the English word (singular, lowercase)
   - target: the translation into Japanese
   - example_en: a very simple English sentence
   - example_target: the same sentence in Japanese

Return ONLY valid JSON in this format:

{
  "objects": [
    {
      "english": "cup",
      "target": "コップ",
      "example_en": "The cup is on the table.",
      "example_target": "ă‚łăƒƒăƒ—ăŻăƒ†ăƒŒăƒ–ăƒ«ăźäžŠă«ă‚ă‚ŠăŸă™ă€‚"
    }
  ]
}

Then send the message.

The model will see both:

  • your image, and
  • your instruction,

and should respond with a JSON structure containing object names and translations.

If it adds extra text (e.g., “Here is the JSON:”), tighten the prompt:

Return ONLY valid JSON. Do not write any explanations or extra text.

This “schema + strict instructions” pattern is exactly how LM Studio and other guides show structured JSON generation with vision models. (LM Studio)


7. What if nothing happens or it errors?

A few common issues:

  1. Image not used / ignored

    • Make sure the model really is a VLM (vision-capable). qwen2-vl-2b-instruct is one. (LM Studio)
    • Make sure you attach the image to the message (not just paste a file path).
  2. Model fails to load or crashes

    • Try a smaller quantization (Q3/Q4) if you’re low on RAM.
    • Ensure you’re on a somewhat recent LM Studio (0.3.x) where vision support is stable; older versions had Discover bugs on some Macs. (GitHub)
  3. Discover tab empty / broken

    • A GitHub issue describes cases where Discover stopped working in some 0.3.x builds; users sometimes revert to a known good version (0.3.4/0.3.5) or wait for fix. (GitHub)
    • As a fallback, you can side-load models: download GGUF from Hugging Face and put it into LM Studio’s models folder (see LM Studio “Offline Operation / sideloading” docs). (LM Studio)

8. Optional: using LM Studio models from other apps

Later, if you want to automate this (e.g., a script that takes many pictures, calls LM Studio, and builds vocab lists):

  1. In LM Studio, open the Developer tab and start the local server:

    • Docs (Elastic example): switch to Developer, click Start server, note the host/port (default http://localhost:1234), and choose your model in the dropdown. (Elastic)
  2. Use LM Studio’s OpenAI-compatible REST API from Python/TypeScript:

    • The Image Input docs show TypeScript and Python examples: connect to LM Studio, prepare an image (client.files.prepareImage()), then call .respond() with images: [image]. (LM Studio)

That’s an advanced step; for now, staying in the GUI is enough.


9. Very short “do this first” checklist

For your situation right now:

  1. Update LM Studio to a recent 0.3.x release. (LM Studio)

  2. Open Discover → search qwen2-vl-2b-instruct → download a Q4 GGUF variant. (LM Studio)

  3. Go to Chat → open the model loader (⌘+L) → select the Qwen2-VL-2B model → Load. (leanpub.com)

  4. Attach an image in the chat UI, and use the JSON-style prompt above to get:

    • object names,
    • translations,
    • simple sentences.

wow huge thanks for the detailed instructions.
Ok so i got the qwen2-vl-2b-instruct-q4_k_m.gguf model as it has the most downloads (not too many though.
After some trial and error, i believe i managed to make it work. First error i made was to choose ‘attach a file’ rather than attach an image (yea !). But then it worked with simply NL commands. No need for Jason. So thanks !