Query: Research the 2025-2026 state of the art for exemplar- or image-prompted detection, segmentation, and tracking of physical parts in images and video, with emphasis on small mechanical parts. Compare pipelines that start from a single annotated example image versus a 3D model or rendered views, but do not overfocus on CAD-specific methods unless directly helpful. Prioritize practical methods and repos usable today for a full pipeline: exemplar/image-prompted detection -> mask refinement/segmentation -> multi-frame tracking and re-detection after occlusion. Include arXiv papers and repos from 2025-2026, with brief mention of older foundational systems only when needed for context. Identify which methods are open-source and production-usable versus closed/existence-proof systems. Output should be a research memo with concrete recommendations and a ranked table of models/repos the user can actually try. Model: o4-mini-deep-research Date: 2026-04-10 Searches performed: 41 Sources cited: 12

Exemplar-Based Part Detection

Modern “foundation” segmentation models excel at finding specific object instances from images. In particular Meta’s Segment Anything 3 (SAM3) model explicitly supports text or image exemplars: it uses a DETR-based detector conditioned on text, geometry or an example image to exhaustively segment all matching objects (github.com) (huggingface.co). SAM3 is open-source (FacebookResearch/sam3, Apache-like SAM license (github.com)) and comes with pretrained weights. In practice, giving SAM3 a single part image will yield masks for every instance of that part in a scene (far more directly applicable to small mechanical parts than vague text descriptions (huggingface.co)). Searches using large vision-language models (e.g. CLIP) or open-vocabulary detectors like GroundingDINO/GLIP can work too, but are generally weaker for precise, similar-shaped parts. For example, CLIP-based detection suffers from coarse semantics (openaccess.thecvf.com), whereas an image-as-prompt approach like SAM3 remains sharply tuned to the part’s appearance.

If a CAD model is available, 3D-to-2D matching techniques can be used. For instance, MUSE (2025) uses multi-view renders of a 3D object as “exemplar” templates and matches them to 2D proposals, achieving state-of-the-art zero-shot detection (arxiv.org). (MUSE is “training-free” – it requires no finetuning, just precomputed templates (arxiv.org).) This is promising for known part models on assembly lines. Alternatively, older “one-shot” frameworks like OS2D (2020) perform dense feature matching from a single example (code available (arxiv.org)), but they’ve been largely superseded by today’s foundation models. In summary, for detection from exemplars, our top recommendation is SAM3 (GitHub “facebookresearch/sam3”, HuggingFace integration (huggingface.co)) for image prompts, with MUSE or similar 3D template matchers if CAD data exists.

Mask Refinement & Segmentation

After locating instances (via bounding box or coarse mask), we refine masks for precise segmentation. Here, Segment-Anything models shine: SAM3 returns full masks, and even standard SAM2 (2023) can refine coarse contours with minimal cues. New diffusion approaches also improve boundaries: e.g., Discrete Diffusion Contour Refinement (2026) iteratively denoises a sparse contour conditioned on an initial mask, yielding sharper edges under scarce data. Other hybrid pipelines are emerging: Trident (ICCV 2025) is a training-free open-vocabulary segmentation method that splits a high-res image into patches, extracts CLIP/DINO features, then uses SAM’s encoder to correlate and merge them. Crucially, Trident converts coarse CLIP masks into “prompts” for SAM, producing much finer segmentation (openaccess.thecvf.com). In practice, one could run a CLIP/embedding instantiation of the part to get a rough mask and then let SAM3 refine that mask precisely.

Another line of work is open-vocabulary segmentation with object priors. For example, LoGoSeg (2026) introduces a dual-stream fusion of global semantics (from CLIP) and local structure, plus an object-presence prior, to reduce spurious masks (www.catalyzex.com). These methods are open-source and training-free but may be overkill for our tightly-defined object set. In our pipeline, a pragmatic choice is to rely on SAM3 (or SAM2) masks directly, possibly supplemented by one of these refinement tools if needed. All mentioned systems (SAM3, SAM2, Trident, LoGoSeg) provide code or models publicly. SAM3’s repository includes inference scripts and models (github.com), and Trident’s code is on GitHub.

Multi-Frame Tracking & Re-Detection

Once parts are identified in one frame, they must be tracked through the video, handling occlusions and out-of-view cases. There are two main paradigms:

Tracking by Detection/Segmentation: Perform detection on each frame and associate object IDs. For example, one could rerun SAM3 or a smaller detector every N frames and apply a tracker like ByteTrack or DeepSORT on the resulting boxes (if SAM3, one could use the DETR outputs instead of boxes). However, for small, similar parts in clutter, sliding a heavy detector on every frame can be slow.
Segmentation-based tracking: Foundation models now allow tracking via masks directly. Notably, SAM2MOT (2025) proposes “tracking by segmentation”: it takes segmentation masks (from SAM) and extracts bounding boxes, focusing on associating masks over time (arxiv.org). By bypassing explicit detection, it generalizes zero-shot. Similarly, Seg2Track-SAM2 (2025) integrates any pretrained object detector with SAM2 plus a specialized module for track initialization and re-association (arxiv.org). These “Seg→Track” frameworks are mostly research prototypes, but they show that a SAM-based tracker can maintain identities purely on mask continuity.

Practically, open-source toolkits exist to simplify this. Track-Anything (MIT license) is a Python pipeline built on SAM2+XMem: the user clicks an object in one frame and Track-Anything uses SAM to segment it, then XMem (a memory-based VOS network) propagates the mask through the video (github.com). It’s interactive by design, but can also run in “auto” mode on multiple objects. Likewise, Segment-and-Track-Anything (SAM-Track) uses SAM for initial key-frame segmentation plus the “AOT” (Associating Objects with Transformers) tracker to propagate segments (github.com). These are full pipelines (with demos, notebooks and code) that handle multi-object in dynamic scenes. For non-interactive use, one can seed SAM-Track with the target part’s masked region on frame 1 and let it run. .

In absence of these specialized tools, classical MOT models (ByteTrack, FairMOT, etc.) can still be used: simply treat each part as a “class” by repeatedly detecting it (e.g. via SAM3’s text prompt or heavy anchor search) and feeding detections to ByteTrack for ID tracking. For mask-based temporal coherence, video segmentation networks like XMem or STCN (ECCV 2022) are effective: given an initial mask of the part, they recall it across frames with optical-flow-aware memory. XMem is open-source (pytorch) and known to handle occlusions reasonably.

Practical Pipeline Recommendations

For a bounding pipeline tailored to small mechanical parts, the following open-source stack is recommended (arranged roughly by priority):

Detection/Instance Segmentation: SAM3 (facebook/sam3) – open GitHub code with pretrained models (github.com). Use the image prompt interface. Expect highest accuracy on arbitrary parts.
Refinement: SAM3 output masks usually suffice. Optionally apply a boundary refinement (e.g. a diffusion-based contour model) if the part edges must be extremely precise. The Trident pipeline (CLIP+SAM) is available (Python code) and can amplify SAM’s accuracy (openaccess.thecvf.com).
Tracking: Use SAM3’s built-in tracker or a multi-object tracker like Track-Anything (MIT, code on GitHub (github.com)) or SAM-Track (AGPL, integrates DeAOT) to propagate masks and IDs. These handle new-appearance and occlusion robustly by re-triggering SAM on each frame. For simpler needs, the ByteTrack bounding-box tracker can manage IDs, assuming detection is run periodically.
3D Model Matching: If 3D CAD is available, run MUSE (arXiv code available) to generate multiple 2D template views and match them to the video frames (arxiv.org). This is “training-free” and excels in industrial scenarios (it topped the 2025 BOP challenge).
Open-Vocab/Text (Auxiliary): If the part has a clear textual descriptor (e.g. “allen bolt M6”), one could try open-vocab detectors (Grounding DINO, OWL-ViT) for coarse localization. However, for uniform parts these often underperform an image exemplar.
SMALL-OBJECT TUNING: Many object models degrade on very small items. If using any CNN detector, ensure high-resolution input or tiling. Some recent detectors explicitly “borrow” features from larger objects to boost small-object recall (arxiv.org) (e.g. the “BorrowFeatures” paper), but such models aren’t readily available as code.

Each recommended component above is production-usable: SAM3, XMem, Track-Anything, ByteTrack, etc. have permissive licenses (MIT/Apache) and Python APIs. Conversely, purely research demos (like SAM2MOT, Seg2Track) are more proof-of-concept, albeit guiding principles. In particular, for part-finding tasks we expect image-prompt approaches (SAM3) to dramatically outperform text-based segmentation: an image prompt precisely encodes the part’s geometry, which CLIP-text often cannot.

Ranked Table of Models/Repos to Try

Method/Repo	Task	Source & License	Notes
facebookresearch/sam3	Image-prompt detection & segmentation	GitHub (Meta AI; open) (github.com) (huggingface.co)	General “detect-anything” model; accepts image or text prompt; state-of-art prompt segmentation; includes video tracking capabilities. Heavy but highly flexible.
gaomingqi/Track-Anything	Video tracking+segmentation	GitHub (MIT) (github.com)	Interactive video tracker using SAM+XMem; can auto-track multiple clicked objects; code/demo available; MIT-licensed.
z-x-yang/Segment-and-Track-Anything (SAM-Track)	Video multi-object tracking	GitHub (AGPL-3.0) (github.com)	Uses SAM for frame-by-frame segmentation + AOT for tracking; auto-detects new objects; code available (AGPL).
Heng et al. Trident (2025) (YuHengsss/Trident)	Open-vocab segmentation	GitHub (Open) (openaccess.thecvf.com)	Refinement pipeline combining CLIP+DINO+SAM to improve segmentation of novel concepts; training-free, code on GitHub. Outperforms naive CLIP masks.
jk-choi/MUSE-Object-Detection (or similar)	Template matching (3D->2D)	Arxiv/SIG (preprint) – (likely open code) (arxiv.org)	Uses multi-view renders from CAD for zero-shot detection; achieves SOTA on BOP 2025; no training needed. (If CAD known, plug into MUSE.)
hkchengrex/XMem	Video object segmentation	GitHub (Apache-2.0)	ECCV’22 model: propagate an initial mask through video. Useful for mask-based tracking (semi-supervised VOS).
ultralytics/YOLO series (YOLOv8+)	Bounding-box detection	GitHub (GPLv3 + commercial)	If classes are defined, can train a detector. Might detect parts if abundant examples exist (fewer explicit exemplar use).
deep-sort-pytorch (nwojke/DeepSort)	Multi-object tracking (bbox)	GitHub (MIT)	Associative tracker using appearance features. Use with any detector’s boxes.
groundingdino / glip	Text-prompt detection	GitHub (open)	Zero-shot detector via text; try if a descriptive name is known for parts. Less reliable for precise part shapes.
OS2D (aosokin/os2d)	One-shot 2D detection	GitHub (Apache) (arxiv.org)	Older method (CVPR’20). Matches CNN features from a single template. May fail on heavy clutter. Code available for experimentation.

Each entry above can be tried in a prototype pipeline. In practice, we expect SAM3 (and SAM-based trackers) to be the core enablers: image-prompt detection to get initial part masks, then tracking (via XMem or track-anything) to carry those masks through video, re-running the prompt detector whenever tracking confidence drops. This leverages the latest open-source foundation models (github.com) (arxiv.org).

Sources summary

Recent literature and code repositories (2024–2025) were surveyed. Key references: MUSE (zero-shot detection from CAD) (arxiv.org); SAM2MOT/SAM3 (prompted tracking) (github.com) (arxiv.org); Trident (SAM+CLIP segmentation) (openaccess.thecvf.com); Track-Anything and SAM-Track frameworks (github.com) (github.com). Older one-shot methods (e.g. OS2D (arxiv.org)) are noted but less mature.