Which major-provider VLM/MMLM systems perform best at precisely marking an object in an image

Query: Research which major-provider VLM/MMLM systems perform best at precisely marking an object in an image, with emphasis on point-level localization for a point-of-mark interaction. Compare OpenAI GPT vision models, Google Gemini vision models, and Anthropic Claude vision models as the primary targets, using open VLMs only as benchmark context where helpful. Focus on evidence from 2025-2026 papers, benchmarks, eval repos, and strong practitioner comparisons that measure or illuminate precise localization of small, detailed, intricate objects in images. Include related evidence from box grounding, phrase grounding, segmentation, and pointing benchmarks when direct point-mark benchmarks are scarce, but keep the main question centered on single-point localization accuracy. Bias toward tasks most relevant to small mechanical parts and cluttered scenes. Output should be a research memo with: (1) ranked conclusions on likely best providers/models for point-of-mark use, (2) the most relevant papers and benchmarks, (3) notes on limitations of existing evidence, and (4) a concrete recommended evaluation setup we can run ourselves for Assemble Anything. Model: o4-mini-deep-research Date: 2026-04-10 Searches performed: 32 Sources cited: 14

Likely Best Models for Point-of-Mark

Note: Several open-source models (e.g. Qwen-2.5-VL, MoLMO, RoboRefer) have shown even better grounding performance when specially trained (www.themoonlight.io) (openreview.net). They may outperform the above, but for proprietary provider models the above ranking is most supported by current evidence.

Key Papers and Benchmarks

Other relevant references include benchmarks of referring-expression / phrase localization (e.g. RefCOCO) and segmentation (e.g. SAM tests), but few directly test single-point accuracy. The papers above most directly address our goal of point-level localization in cluttered scenes.

Limitations of Current Evidence

Proposed Evaluation for “Assemble Anything”

To measure point-of-mark performance in our context, we recommend building a targeted test suite and evaluation pipeline:

By focusing on actual assembly images and requiring explicit coordinate output, this evaluation will directly measure the models’ precision and robustness for our point-mark tasks. Use small, iterative batches to refine prompts/metrics before scaling up. The comparison can then clearly show which vision-language models achieve acceptable accuracy in pinpointing small parts under clutter.

Which major-provider VLM/MMLM systems perform best at precisely marking an object in an image

Likely Best Models for Point-of-Mark

Key Papers and Benchmarks

Limitations of Current Evidence

Proposed Evaluation for “Assemble Anything”

Summary

Sources