Short answer
Real image already tested
Real drone frame used for Gemini localization tests
Results summary
Model Calibration board One-shot fastener search Tiled crop-search Takeaway
gemini-2.5-pro 5/5 landmarks exact
0 px mean error
12 candidate fasteners
mostly symmetric central structure
27 merged candidates
many more peripheral / local-detail hits
Good raw grounding + crop search expands coverage, but likely overcalls.
gemini-robotics-er-1.5-preview 5/5 landmarks exact
0 px mean error
8 candidate fasteners
coarser / more selective than 2.5 Pro
27 merged candidates
similar expansion under crop-search
Also understands pixel frame correctly on calibration; crop-based prompting materially changes behavior.

Important caveat: these counts are not ground-truth precision/recall numbers yet. They are behavior probes.

Gemini 2.5 Pro one-shot versus crop-search
Method 3 result — Gemini 2.5 Pro
  • One-shot prompt found 12 likely fasteners.
  • Tiled crop-search found 27 merged candidates.
  • The extra candidates cluster on smaller peripheral hardware and partially obscured regions.
  • Interpretation: the crop strategy is doing real work. It changes the search behavior, not just the formatting.
Gemini Robotics ER one-shot versus crop-search
Method 3 result — Gemini Robotics-ER
  • One-shot prompt found 8 likely fasteners.
  • Tiled crop-search found 27 merged candidates.
  • So the zoom-search effect is not just a 2.5 Pro quirk; it also changes the robotics model materially.
  • Interpretation: if we let Gemini act like an inspection agent, small-parts grounding becomes much more aggressive.
Calibration board
Method 1 result — coordinate calibration

Both models returned the exact centers for all five labeled circles on the synthetic calibration image.

  • gemini-2.5-pro: 0 px mean error
  • gemini-robotics-er-1.5-preview: 0 px mean error

This is good news: the weirdness we saw earlier is probably not a generic “Gemini does not know image coordinates” problem. It is more likely prompt/task-specific or scene-specific.

Distractor board
2. Similar-part distractor board

Still worth running next. This would separate “bad localization” from “category confusion among tiny metal parts.”

Multi-view consistency
4. Multi-view / temporal consistency test

Still pending. Important if the end goal is stable part identity across multiple views rather than single-image marking.

Counterfactual occlusion
5. Counterfactual + occlusion reasoning

Still pending. Useful for distinguishing “I see it” from “I infer it’s behind the occluder.”

What worked

What remains unresolved