Query: Research which major-provider VLM/MMLM systems perform best at precisely marking an object in an image, with emphasis on point-level localization for a point-of-mark interaction. Compare OpenAI GPT vision models, Google Gemini vision models, and Anthropic Claude vision models as the primary targets, using open VLMs only as benchmark context where helpful. Focus on evidence from 2025-2026 papers, benchmarks, eval repos, and strong practitioner comparisons that measure or illuminate precise localization of small, detailed, intricate objects in images. Include related evidence from box grounding, phrase grounding, segmentation, and pointing benchmarks when direct point-mark benchmarks are scarce, but keep the main question centered on single-point localization accuracy. Bias toward tasks most relevant to small mechanical parts and cluttered scenes. Output should be a research memo with: (1) ranked conclusions on likely best providers/models for point-of-mark use, (2) the most relevant papers and benchmarks, (3) notes on limitations of existing evidence, and (4) a concrete recommended evaluation setup we can run ourselves for Assemble Anything. Model: o4-mini-deep-research Date: 2026-04-10 Searches performed: 32 Sources cited: 14


Likely Best Models for Point-of-Mark

  1. Google Gemini Vision (latest Ultra/Pro) – Overall evidence suggests Gemini’s visual reasoning is strongest. Tech reviews note Gemini 3 Pro “sees most clearly…excels at distinguishing overlapping objects” and is least prone to hallucination in clutter (www.techradar.com). In benchmarks, models fine-tuned on grounding (e.g. “Gemini-2.5-Pro” in Point-It-Out) outperform generalist GPT/GPT-4 Vision (openreview.net) (www.themoonlight.io). For small, detailed parts, Gemini-based models have demonstrated superior scene parsing and text-reading in complex images (www.techradar.com). We therefore expect Google’s Gemini Vision (especially “Pro” or latest versions) to be the top choice for precise pointing.

  2. OpenAI GPT-4 Vision (and successors) – GPT-4/5 with vision still leads on broad CV tasks but lags on fine-grained grounding. In one CV benchmark GPT-4o (“GPT-4′ ”) topped most semantic tasks but showed quirks in geometry and hallucinations (arxiv.org) (arxiv.org). Point-It-Out finds that “strong general-purpose models such as GPT-4o…underperform compared to some open-source models in precise visual grounding” (openreview.net). Likewise, synthetic tests (e.g. “Can VLMs see squares?”) show GPT-based models collapse on pixel-level localization (accuracy ~60–73%) in non-text visuals (arxiv.org). In practice, GPT vision should do well at identifying objects but often misses exact part locations unless given explicit cues or fine-tuning.

  3. Anthropic Claude Vision – Claude variants are generally more creative than precise. The “Vision-LMs are blind” study found Claude 3.5 slightly outperformed GPT-4 on simple shape tests (74.9% vs ~58%) but still far below human-level localizing (arxiv.org). In complex visual tasks, Claude tends to “wax lyrical” rather than pinpoint details (www.techradar.com) (www.techradar.com). In the grid experiment, Claude fell to ~60% accuracy and severely under-counted objects (arxiv.org). Overall, Claude Vision is unlikely to match Gemini or GPT-4 when asked to sharply localize small parts; its strength lies more in high-level image description than exact pointing.

Note: Several open-source models (e.g. Qwen-2.5-VL, MoLMO, RoboRefer) have shown even better grounding performance when specially trained (www.themoonlight.io) (openreview.net). They may outperform the above, but for proprietary provider models the above ranking is most supported by current evidence.

Key Papers and Benchmarks

Other relevant references include benchmarks of referring-expression / phrase localization (e.g. RefCOCO) and segmentation (e.g. SAM tests), but few directly test single-point accuracy. The papers above most directly address our goal of point-level localization in cluttered scenes.

Limitations of Current Evidence

Proposed Evaluation for “Assemble Anything”

To measure point-of-mark performance in our context, we recommend building a targeted test suite and evaluation pipeline:

By focusing on actual assembly images and requiring explicit coordinate output, this evaluation will directly measure the models’ precision and robustness for our point-mark tasks. Use small, iterative batches to refine prompts/metrics before scaling up. The comparison can then clearly show which vision-language models achieve acceptable accuracy in pinpointing small parts under clutter.

Summary

Based on the literature, Google’s Gemini Vision models (especially the latest “Pro/Ultra” versions) are our first candidates to test, followed by OpenAI’s GPT-4/5 Vision. Anthropic’s Claude is likely third. Our proposed evaluation – image-and-point queries with distance/IoU scoring – will give definitive data to confirm (or revise) these ranks for the assembly domain.


Sources

  1. Testing ChatGPT, Gemini, and Claude in the multimodal maze
  2. Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding | OpenReview
  3. [论文评述] Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
  4. How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
  5. Vision language models are blind
  6. Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
  7. Testing ChatGPT, Gemini, and Claude in the multimodal maze
  8. Testing ChatGPT, Gemini, and Claude in the multimodal maze
  9. GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
  10. Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
  11. SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
  12. PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
  13. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
  14. [论文评述] Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding