Deep Research memo · point-of-mark VLM comparison · 2026-04-10
Query: Research which major-provider VLM/MMLM systems perform best at precisely marking an object in an image, with emphasis on point-level localization for a point-of-mark interaction. Compare OpenAI GPT vision models, Google Gemini vision models, and Anthropic Claude vision models as the primary targets, using open VLMs only as benchmark context where helpful. Focus on evidence from 2025-2026 papers, benchmarks, eval repos, and strong practitioner comparisons that measure or illuminate precise localization of small, detailed, intricate objects in images. Include related evidence from box grounding, phrase grounding, segmentation, and pointing benchmarks when direct point-mark benchmarks are scarce, but keep the main question centered on single-point localization accuracy. Bias toward tasks most relevant to small mechanical parts and cluttered scenes. Output should be a research memo with: (1) ranked conclusions on likely best providers/models for point-of-mark use, (2) the most relevant papers and benchmarks, (3) notes on limitations of existing evidence, and (4) a concrete recommended evaluation setup we can run ourselves for Assemble Anything. Model: o4-mini-deep-research Date: 2026-04-10 Searches performed: 32 Sources cited: 14
Google Gemini Vision (latest Ultra/Pro) – Overall evidence suggests Gemini’s visual reasoning is strongest. Tech reviews note Gemini 3 Pro “sees most clearly…excels at distinguishing overlapping objects” and is least prone to hallucination in clutter (www.techradar.com). In benchmarks, models fine-tuned on grounding (e.g. “Gemini-2.5-Pro” in Point-It-Out) outperform generalist GPT/GPT-4 Vision (openreview.net) (www.themoonlight.io). For small, detailed parts, Gemini-based models have demonstrated superior scene parsing and text-reading in complex images (www.techradar.com). We therefore expect Google’s Gemini Vision (especially “Pro” or latest versions) to be the top choice for precise pointing.
OpenAI GPT-4 Vision (and successors) – GPT-4/5 with vision still leads on broad CV tasks but lags on fine-grained grounding. In one CV benchmark GPT-4o (“GPT-4′ ”) topped most semantic tasks but showed quirks in geometry and hallucinations (arxiv.org) (arxiv.org). Point-It-Out finds that “strong general-purpose models such as GPT-4o…underperform compared to some open-source models in precise visual grounding” (openreview.net). Likewise, synthetic tests (e.g. “Can VLMs see squares?”) show GPT-based models collapse on pixel-level localization (accuracy ~60–73%) in non-text visuals (arxiv.org). In practice, GPT vision should do well at identifying objects but often misses exact part locations unless given explicit cues or fine-tuning.
Anthropic Claude Vision – Claude variants are generally more creative than precise. The “Vision-LMs are blind” study found Claude 3.5 slightly outperformed GPT-4 on simple shape tests (74.9% vs ~58%) but still far below human-level localizing (arxiv.org). In complex visual tasks, Claude tends to “wax lyrical” rather than pinpoint details (www.techradar.com) (www.techradar.com). In the grid experiment, Claude fell to ~60% accuracy and severely under-counted objects (arxiv.org). Overall, Claude Vision is unlikely to match Gemini or GPT-4 when asked to sharply localize small parts; its strength lies more in high-level image description than exact pointing.
Note: Several open-source models (e.g. Qwen-2.5-VL, MoLMO, RoboRefer) have shown even better grounding performance when specially trained (www.themoonlight.io) (openreview.net). They may outperform the above, but for proprietary provider models the above ranking is most supported by current evidence.
Point-It-Out (ICLR 2026) – A new visual-grounding benchmark (stages S1–S3) focusing on fine localization and “pointing” tasks. It shows GPT-4o (GPT-4) performs worse than specialized or fine-tuned models (e.g. MoLMO, Gemini-2.5-Pro) in object localization (openreview.net). In S1/S2 (object and part pointing) “explicit-location supervised” models (Gemini-2.5-Pro, MoLMO etc.) beat GPT-4/Claude, highlighting that GPT-4 Vision needs more grounding data to localize precisely (www.themoonlight.io).
GroundingME (2025) – A benchmark of visual grounding with emphasis on tiny/occluded objects and ambiguous queries. It finds even state-of-the-art vision-LMs fail badly: the best model scored only 45.1% accuracy across tasks involving fine distinctions, and nearly 0% on “rejection” (unknowable) queries (arxiv.org). This underscores that none of the leading MLLMs (including GPT, Gemini, Claude) truly ground descriptions with human-level precision, especially for small targets.
“Can VLMs See Squares?” (2026) – A synthetic test of pure spatial reasoning. Three top models – Claude Opus, GPT-5.2 (ChatGPT 5.2), and Gemini 3 Thinking – were asked to transcribe 15×15 binary grids. When symbols were real text, Claude/GPT achieved ~91% accuracy, but with identical patterns rendered as graphic shapes all models collapsed to 60–73% accuracy (F1 ≈ 29–39%) (arxiv.org). The study concludes all exhibit “severely degraded spatial localization for non-textual elements” (arxiv.org). This suggests that without textual cues, even the best Vision-LMs fail at fine-grained localization. (Notably, GPT-5.2 performed best on the text-grid version but still suffered similarly on the filled-grid.)
“Vision-Language Models are Blind” (2024) – A set of simple geometric puzzles (circle overlap, counting, etc.). State-of-art MLLMs averaged only ~58.6% accuracy; best case was Claude 3.5 at 74.9% (arxiv.org). Models struggled whenever precise spatial reasoning was needed. This early evaluation foreshadows the consistent difficulty modern VLMs have with tasks requiring exact positioning.
“How Well Does GPT-4o Understand Vision?” (2025) – Evaluates GPT-4o (and Gemini 1.5/2.0) on standard CV tasks (detection, segmentation, depth, etc.). GPT-4o was best among the non-reasoning models on 4 of 6 tasks, but generally far from specialist SOTA (arxiv.org). Crucially, it notes that VLMs perform semantic tasks far better than geometric ones (depth, localization), and that GPT-4o exhibits spatial hallucinations (arxiv.org). This aligns with other findings that GPT-4 Vision is strong at category recognition but still weak at precise localization.
SpatialLadder (ICCV 2025) – A challenge/training framework for spatial reasoning. The trained “SpatialLadder” model (3B) achieved huge gains on localization tasks: it beat GPT-4o by 20.8% and Gemini-2.0 by 10.1% on their benchmarks (arxiv.org). In other words, out-of-the-box GPT/Gemini were far behind. This shows the potential of targeted training, but also highlights current gaps: even large models need specialized training to reach good spatial accuracy.
PIN: Positional Insert (2024) – Demonstrates that VLMs (like GPT-4V) do have latent localization ability. By adding a tiny learnable “positional prompt” to a frozen VLM, the authors achieved strong zero-shot box localization on COCO, LVIS etc (arxiv.org). This means models like GPT-4V can produce accurate bounding boxes when guided properly, hinting that in our evaluation we may unlock better performance with smart prompting or calibration. (Without such tricks, out-of-the-box accuracy would be much lower.)
Molmo2 (2026) – An open VLM specialized for video grounding. Its best 8B model outperforms Gemini 3 Pro on new pointing/tracking tasks: e.g. video-pointing F1 38.4 vs Gemini’s 20.0 (arxiv.org). While this is a video benchmark, it signals that state-of-the-art proprietary models still lag on explicit pointing tasks – even small open models can surpass them when trained on grounding data.
Other relevant references include benchmarks of referring-expression / phrase localization (e.g. RefCOCO) and segmentation (e.g. SAM tests), but few directly test single-point accuracy. The papers above most directly address our goal of point-level localization in cluttered scenes.
Indirect Metrics. Almost no existing benchmark asks for a single coordinate as output. Instead they use boxes or masks (e.g. IoU on RefCOCO, or F1 in “squares” tasks). We will have to map their metrics to our needs. For example, Point-It-Out uses a “normalized IoU” of predicted box vs ground-truth mask (www.themoonlight.io) to compare different output formats. In practice we may simply check if a predicted point lies inside the true object region (and measure error in pixels/IoU). The lack of directly comparable “pointing” benchmarks means extrapolating performance is approximate.
Domain Gap. Most evaluations use natural or synthetic scenes, not mechanical parts. Real assembly images may have reflective metal, fine threads, connectors, etc. Performance on MSCOCO, grids, or kitchen scenes does not guarantee similar results in an engineering context. We should be cautious interpreting generalized model rankings without domain-specific data.
Model Variants and Access. Many studies test only certain versions (e.g. GPT-4o, GPT-5.1, Claude 3.5/4.5, Gemini 1.5/2.5). Models evolve rapidly (Gemini 3/Ultra, Claude 4/Opus etc). Vendor API modes (“Pro” vs “Flash” in Gemini, “Opus” vs “Sonnet” in Claude) can change results. Published results may not cover the very newest versions. Until we test directly, we only have secondhand estimates of current model capabilities.
Few Comparative Benchmarks. There is no widely-shared leaderboard comparing GPT, Gemini, Claude on visual grounding. Many papers evaluate only one model or a mix of open models. For proprietary models, results often come from company blogs or press, not peer-reviewed tests. We must combine sources (OpenAI papers, arXiv comparisons, tech reviews) which differ in methodology and prompt style.
Evaluation Criteria Variability. Some papers penalize hallucination or require “rejection” answers (GroundingME). Others focus on relative counting (squares), or multi-step trajectories (Point-It-Out). These different tasks stress different skills (visual detail vs inference vs planning). None isolates “point-to-object” in full generality. Our domain requires precise low-level localization, so results emphasizing reasoning (multiple steps) may not fully predict point accuracy.
To measure point-of-mark performance in our context, we recommend building a targeted test suite and evaluation pipeline:
Data & Tasks: Assemble a dataset of images depicting typical assembly scenarios (e.g. circuit boards, engine parts, tool benches) with multiple small objects. Manually annotate key target locations (e.g. centers of screws, connectors, fasteners, or functional parts). For each image, create natural-language queries of two types:
Model Interface: Query each provider via its vision API. For example, use OpenAI’s GPT-4V API, Google’s Gemini API (via generative AI), and Claude’s Vision API. In the prompt, instruct the model explicitly to output coordinates. E.g.: “Coordinates of target: (x,y).” If the model returns a bounding box (coordinate pairs) instead of a point, we will take the center or one corner as the “anchor point.” (Alternatively, ask for segmentation masks and convert to a point.) Ensure consistent prompt templates across models. Record each model’s output in a parseable way.
Metrics: For each predicted point or box, compute its spatial error relative to the ground truth. Useful metrics include: Point Accuracy – fraction of times the predicted point falls inside the true object’s mask; Distance Error – e.g. L2 distance (in pixels or fraction of image size) between prediction and GT center; and IoU if using boxes (e.g. IoU≥0.5 considered correct). We could adopt a “normalized IoU” as in Point-It-Out, factoring out mask shape (www.themoonlight.io). Aggregate results over the test set as mean distance or % within threshold, etc. Also note cases where the model fails to output any coordinate.
Condition Variations: Include a range of object sizes and clutter levels. For example, have some query objects tiny (<5% image area) and others larger, to see how performance drops. Also test queries that involve occlusion or unusual wording.
Baseline & Controls: As a sanity check, run open-source or simpler models on the same tasks (e.g. BLIP-2, ViLD, or SAM segmenter) to establish an anchor. This helps gauge how much gap remains.
Evaluation Setup: Automate the above in a script or evaluation toolkit. For each query, log the model answer and compute the metrics. Visualize some results for error analysis. Compare the three providers quantitatively (e.g. GPT vs Gemini vs Claude) on these metrics. The goal is to have a clear score (accuracy, error) that directly reflects point-of-mark ability on our use-case images.
By focusing on actual assembly images and requiring explicit coordinate output, this evaluation will directly measure the models’ precision and robustness for our point-mark tasks. Use small, iterative batches to refine prompts/metrics before scaling up. The comparison can then clearly show which vision-language models achieve acceptable accuracy in pinpointing small parts under clutter.
Based on the literature, Google’s Gemini Vision models (especially the latest “Pro/Ultra” versions) are our first candidates to test, followed by OpenAI’s GPT-4/5 Vision. Anthropic’s Claude is likely third. Our proposed evaluation – image-and-point queries with distance/IoU scoring – will give definitive data to confirm (or revise) these ranks for the assembly domain.