Gemini small-parts reasoning test ideas

Short answer

Method 1 worked cleanly. Both gemini-2.5-pro and gemini-robotics-er-1.5-preview returned the exact synthetic landmark coordinates: 0 px mean error.
Method 3 also worked. Tiled crop-search surfaced many more candidate fasteners than one-shot prompting on the real image.
But: more points does not automatically mean more truth. Crop-search increased recall / exploration, but almost certainly also introduced false positives.

Real image already tested

Real drone frame used for Gemini localization tests

Results summary

Model	Calibration board	One-shot fastener search	Tiled crop-search	Takeaway
gemini-2.5-pro	5/5 landmarks exact 0 px mean error	12 candidate fasteners mostly symmetric central structure	27 merged candidates many more peripheral / local-detail hits	Good raw grounding + crop search expands coverage, but likely overcalls.
gemini-robotics-er-1.5-preview	5/5 landmarks exact 0 px mean error	8 candidate fasteners coarser / more selective than 2.5 Pro	27 merged candidates similar expansion under crop-search	Also understands pixel frame correctly on calibration; crop-based prompting materially changes behavior.

Important caveat: these counts are not ground-truth precision/recall numbers yet. They are behavior probes.

Gemini 2.5 Pro one-shot versus crop-search

Method 3 result — Gemini 2.5 Pro

One-shot prompt found 12 likely fasteners.
Tiled crop-search found 27 merged candidates.
The extra candidates cluster on smaller peripheral hardware and partially obscured regions.
Interpretation: the crop strategy is doing real work. It changes the search behavior, not just the formatting.

Gemini Robotics ER one-shot versus crop-search

Method 3 result — Gemini Robotics-ER

One-shot prompt found 8 likely fasteners.
Tiled crop-search found 27 merged candidates.
So the zoom-search effect is not just a 2.5 Pro quirk; it also changes the robotics model materially.
Interpretation: if we let Gemini act like an inspection agent, small-parts grounding becomes much more aggressive.

Method 1 result — coordinate calibration

Both models returned the exact centers for all five labeled circles on the synthetic calibration image.

gemini-2.5-pro: 0 px mean error
gemini-robotics-er-1.5-preview: 0 px mean error

This is good news: the weirdness we saw earlier is probably not a generic “Gemini does not know image coordinates” problem. It is more likely prompt/task-specific or scene-specific.

2. Similar-part distractor board

Still worth running next. This would separate “bad localization” from “category confusion among tiny metal parts.”

4. Multi-view / temporal consistency test

Still pending. Important if the end goal is stable part identity across multiple views rather than single-image marking.

5. Counterfactual + occlusion reasoning

Still pending. Useful for distinguishing “I see it” from “I infer it’s behind the occluder.”

What worked

Pixel coordinates can be exact on a clean synthetic target.
Crop-based prompting clearly changes model behavior on tiny hardware.

What remains unresolved

We still do not have human-labeled ground truth for the real drone image.
So the crop-search result should be read as “higher exploratory recall” rather than “higher measured accuracy.”
The next rigorous step is a small hand-labeled benchmark on 3–10 real images.

Five different ways to test Gemini on small-parts reasoning

What worked

What remains unresolved