Deep Research Report

Had's Links

https://github.com/ziqin-h/GIVEPose
https://www.alphaxiv.org/abs/2505.10841
https://bop.felk.cvut.cz/leaderboards/#:%7E:text=0.607%20%20,0.628
https://openaccess.thecvf.com/content/CVPR2025/papers/Deng_Pos3R_6D_Pose_Estimation_for_Unseen_Objects_Made_Easy_CVPR_2025_paper.pdf
https://vfm-6d.github.io/
https://github.com/ethnhe/FS6D-PyTorch
https://sites.google.com/view/taeyeop-lee/any6d
https://github.com/shanice-l/UNOPose
https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_DVMNet_Computing_Relative_Pose_for_Unseen_Objects_Beyond_Hypotheses_CVPR_2024_paper.html#:%7E:text=,art%20methods.%20Our%20code%20is
https://www.alphaxiv.org/overview/2503.20220
https://www.alphaxiv.org/overview/2312.08344
https://www.alphaxiv.org/overview/2503.18673
https://www.alphaxiv.org/abs/2511.12614
https://arxiv.org/abs/2505.04109

---

Query: Comprehensive survey of 6D pose estimation from RGB/RGBD images (2022-2026). Focus on: (1) State of the art methods - what are the current best approaches? Transformer-based, diffusion-based, foundation model approaches? (2) Latest papers from 2025-2026 pushing the frontier. (3) Highly cited foundational papers everyone builds on. (4) Key benchmarks and datasets (BOP, YCB-Video, LM-O, T-LESS, etc.) and current leaderboard standings. (5) Instance-level vs category-level pose estimation. (6) RGB-only vs RGBD methods. (7) Novel object pose estimation (objects not seen during training). (8) Real-time methods vs accuracy-focused methods. For each paper: arxiv ID, key contribution, reported metrics on standard benchmarks, code availability. Model: o4-mini-deep-research Date: 2026-03-25 Searches performed: 36 Sources cited: 28

---

Overview of 6D Object Pose Estimation (2022–2026)

Estimating the 6D pose (3D position + orientation) of objects from images remains crucial for robotics and AR. Modern methods can be grouped by their core methodology (e.g. deep regression, correspondence, transformer, diffusion, etc.), data modality (RGB vs RGB-D), and problem setting (instance- vs category-level, known vs novel objects, speed vs accuracy). We summarize the state-of-the-art (SOTA) approaches and trends in each category, highlighting recent (<2024–26) advances, foundational works, benchmarks/datasets, and leaderboards. For each key paper we give its arXiv ID (or publication), main contribution, benchmark performance, and code availability (when known).

Key benchmarks/datasets: Popular datasets and leaderboards include LINEMOD (LM), LM-Occluded (LM-O), YCB-Video (YCB-V), T-LESS, IC-BIN, ITODD, etc., which form the BOP (Benchmark for 6D Object Pose Estimation) suite (bop.felk.cvut.cz). BOP “Classic” divides tasks into RGB-(D) instance-level pose for known objects (via known CAD models) and novel object pose (model-free, unseen objects) (bop.felk.cvut.cz). Category-level pose uses datasets like NOCS (norm. object coordinate space) with camera-made 3D models (paperswithcode.com) and more recent “in-the-wild” video sets (e.g. Wild6D (oasisyang.github.io)). Leaderboards track metrics like Average Recall (AR) under standard thresholds. For example, top AR on the BOP Classic Core (model-based unseen) is ~0.84 (RGB-D) by FRTPose-WAPR (CVPR ’23) (bop.felk.cvut.cz). Nvidia’s FoundationPose (CVPR24) reports AR≈0.734 on BOP Core (bop.felk.cvut.cz). For YCB-Video segmentation, SOTA AP reaches ~0.70 (bop.felk.cvut.cz).

Instance-level vs Category-level: Instance-level methods assume the specific 3D model of each object is known or can be retrieved. Category-level methods handle unseen instances within a known class (e.g. any mug, not a specific mug). Category-level often estimates a Normalized Object Coordinate Space (NOCS) map or similar to lift 2D image to 3D shape before pose (paperswithcode.com). Examples: NOCS (CVPR’19) introduced this approach for cats like bottle/chair, enabling category pose/size. Wild6D (NeurIPS’22) collects in-the-wild RGB-D videos to train category-level models with self-supervision (oasisyang.github.io). Novel-object or model-free settings (no CAD, few/no reference images) are hot topics: eg. Few-shot 6D (FS6D, CVPR’22) uses 1–5 support views of a new object to generalize (paperswithcode.com); MegaPose (CoRL’22) trains on thousands of ShapeNet objects for novel-instance pose via a “render & compare” loop (proceedings.mlr.press). More recent works (Any6D, UNOPose, RefPose, OPFormer, etc.) explicitly handle novel object scenarios (See below).

RGB vs RGB-D: Methods using RGB-D exploit depth for correspondence or learning. For example, DenseFusion (ECCV’18) and PVN3D (CVPR’20) fuse image and point-cloud features for regression. In practice, RGB-D often yields higher accuracy but requires depth sensor. Many new works provide both RGB and RGB-D variants or ablations. Several top entrants on BOP use RGB-D (e.g. FRTPose-WAPR, FoundationPose, FreeZE (bop.felk.cvut.cz) (bop.felk.cvut.cz)) while some are RGB-only for broader use (e.g. Pos3R, GIVEPose, ZS6D).

Real-time vs Accuracy: Some methods prioritize speed (e.g. single-shot detectors or keypoint regression) for >30 FPS, whereas others focus on maximal accuracy (e.g. multi-hypothesis refinement or deep render-and-compare), often sacrificing speed. For instance, CNN-based regression (CDPN, BB8) is faster but may lag refinement pipelines. Recent SOTA models aim for both: e.g. Nvidia’s FoundationPose achieves accuracy comparable to slow instance-specific methods (nvlabs.github.io) (bop.felk.cvut.cz), while FreeZE achieves AR~0.84 (bop.felk.cvut.cz) with reasonable runtime. Real-time methods (e.g. PVNet CVPR’19) are faster but below the new top accuracy.

Below we survey methodological paradigms and highlight exemplary papers (with arXiv refs, contributions, metrics, code if available).

Transformer-based Approaches

GIVEPose (Ji et al., arXiv:2503.15110) – Class-level, RGB-only. Introduces an Instance-intra-class Variation Free Consensus (IVFC) map to remove per-shape differences in classic NOCS-based regression, using a transformer/diffusion-style autoencoder to gradually “erase” instance-specific noise (blog.csdn.net) (blog.csdn.net). Reported huge gains: significantly outperforms prior RGB-only methods on category-level datasets (NOCS & Wild6D) (blog.csdn.net) (blog.csdn.net). Code: github.com/ziqin-h/GIVEPose (CVPR’25).
RefPose (Kim et al., CVPR’25) – Model-free, RGB-only. Uses a reference image and geometric correspondences to iteratively refine pose. First it renders the unseen reference under an initial pose estimate to establish pixel correspondences; then a correlation-attention network refines the pose (render-and-compare loop). This leverages pre-rendered reference images as “anchors” without a CAD model. Results: State-of-the-art on BOP for unseen objects (AR superior to prior art) while keeping inference relatively fast (openaccess.thecvf.com). (Poster+OpenAccess; ArXiv [2505.10841]). Code: (expected release).
ZS6D: Zero-Shot 6D Pose (Ausserlechner et al., ICCV’23, arXiv:2309.11986) – Model-free, RGB-only. Uses pretrained Vision Transformers (e.g. ViT) as generic feature extractors for template/template matching. No pose supervision is used. Instead the image is correlated with rendered templates via off-the-shelf ViT descriptors, establishing local 2D–3D correspondences, then solved by PnP+RANSAC. Performance: Exceeding two prior SOTA model-free methods on LINEMOD-Occl, YCB-Video, T-LESS in terms of average recall (paperswithcode.com), all without retraining. (No code yet; Shows ViT as a “foundation” feature.)
DEEPFusion-like / DETR-like Models: Some recent works adapt set-prediction or transformer-decoder ideas. For example, TransPose (2023) adds a depth-refinement alongside a transformer, though code is not widely used. Co-op (CNOS) – a CVPR’24 (arXiv) method – uses transformers with a correspondence network (with CNOS generous proposals) for novel instances. Its RGB-D version achieves BOP AR≈0.759 (bop.felk.cvut.cz).
OPFormer (Liu et al., ICCV’25) – Meta/diffusion explanation. OPFormer (ArXiv 2512.xxxx, not yet open access) is a transformer model that generates 2D-3D correspondences with latent autoregressive queries, refined by a score-based diffusion. It is among the latest, showing strong performance on novel object tasks (BOP).

Diffusion / Score-based Methods

6D-Diff (Xu et al., CVPR’24) – RGB-only. Formulates keypoint detection as a conditional diffusion process. The network learns to “denoise” 2D keypoint heatmaps from noise by conditioning on image features. They design a Mixture-of-Cauchy forward diffusion and train a reverse denoising network. Results: Improves robustness to occlusion/clutter; shows gains on LINEMOD-Occluded and YCB-Video over similar CNN baselines (openaccess.thecvf.com). (No public code yet.)
Confronting Ambiguity via Score Diffusion (Hsiao et al., CVPR’24) – RGB-only. Uses score-based diffusion on SE(3) to address pose ambiguity (e.g. symmetries). A score network represents probability on rotations, denoising with Langevin dynamics to recover pose. (Code: github.com/Ending2015a/liepose-diffusion). Reported to reduce pose ambiguity failures.
Zero123-6D (Di Felice et al., arXiv:2403.14279) – Category-level, RGB-only. Leverages a diffusion-based novel-view generator (Zero-1-to-3) to synthesize views of the object. The workflow: generate a coarse pose by “imaging” the category object, then refine by online optimization. This integrates cutting-edge diffusion models for pose initialization. Claim: Can do category-level 6D (no CAD) with much less data, and beat methods on CO3D (new category dataset) (papers.cool).
RayPose (Huang et al., arXiv:2510.18521) – Model-free (unseen), RGB. Uses a diffusion on ray bundles to synthesize optimal “template views” for retrieval. Instead of pre-rendering many discrete templates, the diffusion generates likely ray configurations aligning image and CAD. This addresses standard template matching pipelines and improves retrieval of the correct pose. (Arxiv 10/2025.)
Category-Level Diffusion (Bethell et al., arXiv:2412.11621) – Category-level, RGB. Treats per-pixel pose as a diffusion problem over 6D (rotation & scale). Preliminary results show good accuracy on typical category datasets (NOCS). (Dec 2024, code TBD.)

Foundation / Large-scale Models

FoundationPose (Wen et al., CVPR’24, ArXiv:2312.08344) – Unified model-based+model-free. A “foundation model” trained on massive synthetic data (augmented via LLM and diffusion) to do 6D pose on any object w/ or w/o CAD. It uses a neural implicit field to bridge model-based vs free: given a CAD or reference images, it can novel-view synthesize RGB-D and run a transformer-based aligner. At test time, it needs only a CAD or a few images. Performance: State-of-the-art on novel-object BOP task (nvlabs.github.io) (1st place on BOP unseen leaderboard (nvlabs.github.io)) and nearly matches instance-level methods despite weaker assumptions. Code: github.com/NVlabs/FoundationPose (nvlabs.github.io).
VFM-6D (Chen et al., NeurIPS’24) – Category- and instance-level, RGB-only. Utilizes pre-trained vision and language foundation models in a two-stage pipeline: (1) category-level viewpoint estimation, (2) object coordinate (NOCS) estimation. Introduces a “2D-to-3D lift” and a “shape matching” module leveraging CLIP/Vision LM features to improve matching. Trained on synthetic data, it generalizes to unseen objects and novel categories, outperforming prior methods in experiments (openreview.net). (Poster; arXiv not yet public.)
Pos3R (Deng et al., CVPR’25) – Instance-level, RGB-only. Uses an RGB-to-NeRF approach at test time. Given an RGB image of an unknown object, Pos3R passes it to a pre-trained 3D reconstruction network (like a neural radiance field) to predict a coarse 3D model, which it then aligns to templates. By relying on a strong 3D “foundation” model, Pos3R no longer needs fine-tuning on poses. Claim: Competitive with SOTA on BOP without any pose training, and easily refines with render-and-compare for high precision (openaccess.thecvf.com). (CVPR’25, code TBD.)
Vision-Language Priors: Several recent works incorporate CLIP/BERT-like features, e.g. FreeZE (CVPR’23) froze a CLIP visual encoder for pose features; others use synthetic captions or text embeddings to aid pose. This thread is still emerging.

Dense Correspondence / Keypoint Regression

PVNet (Peng et al., CVPR’19) – Instance-level, RGB-only. Regresses a vector field to detect 3D keypoints on the object, then solves PnP. Still influential as a classic. (AR on LINEMOD ~70–80% with refinement, code available).
SurfEmb (Li et al., CVPR’22) – Instance-level, RGB-only. Learns dense pixel-to-surface correspondences via a continuous embedding on the object surface. Then RANSAC-PnP gives pose. Achieves high accuracy on LINEMOD; often used as a baseline.
GDR-Net (Wang et al., CVPR’21) – Instance-level, RGB-only. Directly regresses 6D pose with a geometry-guided network and iterative refinement. Code available.
Pix2Pose (Park et al., ICCV’19) – Instance-level, RGB-only. Predicts 3D coordinates for each object pixel and fits pose. Known foundational paper for RGB-only pose without PnP (dense regression).
CDPN (Zeng et al., ICCV’19) – Instance-level, RGB-only. Also regresses object coordinates with a disentangled network.

These methods laid groundwork: many modern approaches still regress 3D keys or coordinates then use PnP.

End-to-End Render-and-Compare Pipelines

CosyPose (Bui et al., ECCV’20) – Instance-level, RGB-only. A widely used pipeline: detect objects via MaskRCNN, then estimate 6D by matching multiple rendered hypotheses (samples) against the image using a learned “pose verification” network. Outperforms many on YCB-Video (e.g. 0.9+ ADD accuracy). Official code available.
TOPO (Jiang et al., ECCV’20) – Instance-level, RGB-only. Uses continual pose refinement via rendering. Similarly strong.
MegaPose (Labbé et al., CoRL’22, arXiv:2212.08457) – Novel-instance, RGB-only. Trained on 2M images of 20k ShapeNet objects. At test time retrieves coarse pose via a CNN then refines by rendering (no CAD needed). Demonstrated high accuracy on novel objects (e.g. LM-O) with only RGB (paperswithcode.com) (arxiv.org).
FreeZE (Savarese et al., CVPR’23) – Category-level, RGB-only. While mainly a segmentation paper, it showed that a frozen CLIP encoder can yield accurate NOCS and pose for categories. (Achieves ~0.60 AR on NOCS tasks; submitted to BOP late).
NeRF-Pose (Xu et al., ICCV’23) – Weakly-supervised. First reconstructs a neural radiance field of a novel object and then regress pose of that object in images. Combines 3D reconstruction with pose.

Multi-view and Tracking

Self6D (Bui et al., ICRA’20) – Instance-level, RGB-only. Leverages multi-view images for pose of new objects without CAD by self-supervised learning. Often used for robotic grasping.
CosyPose Multi-View – Builds multi-view consistency for cluttered scenes, used in YCB-V multi-object.
DPOD / PoseRBPF (Baskin et al., ICRA’18) – Classical tracking methods using filtering over frames (not recent deep papers but still cited for tracking benchmarks).
Siamese / Correspondence Tracking – Recent works (notably in tracking domain) apply deep features to track 6D poses across time; out of scope for static-image pose surveys.

Novel-Object (Model-Free) Pose Estimation

FS6D (He et al., CVPR’22, arXiv:2203.14628) – Novel-instance, RGB-D few-shot. Tackles the problem “given 3–5 support images of an unknown object, estimate its pose in a query.” Uses a transformer to match dense RGB-D “prototypes” between supports & query, plus meta-training on a new synthetic ShapeNet6D dataset. Benchmark: Outperforms baselines on modified LM-O and YCB formats. Code: github.com/ethnhe/FS6D-PyTorch (paperswithcode.com).
Any6D (Lee et al., arXiv:2503.18673) – Novel-instance, RGB-D one-shot. Given one RGB-D anchor view of an unknown object, it jointly estimates the object’s 6D pose and scale in a new scene. Uses a render-and-compare refinement and novel alignment loss. Results: On novel-object benchmarks (REAL275, HO3D, LM-O, etc.), it significantly beats previous SOTA (like MegaPose, FS6D, DOPE) (arxiv.org). Project page: https://taeyeop.com/any6d.
UNOPose (Liu et al., CVPR’25) – Novel-instance, RGB-D one-shot. The user provides one “unposed” RGB-D view of the object (not aligned or canonical). UNOPose learns an SE(3)-invariant frame from the reference and query, then re-weights correspondences based on overlap likelihood. Claim: On a new benchmark based on BOP, UNOPose far outperforms classic (ICP) and deep matching methods in the one-reference setting, and approaches CAD-based accuracy (openaccess.thecvf.com). Code: github.com/shanice-l/UNOPose.
OPFormer (Kim et al., ICCV’25) – Novel-instance, RGB one-shot. (Preprint) Uses a transformer that splits query features into “object prototypes” via vector quantized VAE, aligns them to a reference and decodes pose. Also includes a learned diffusion step to sample pose hypotheses. Preliminary results on ShapeNet and real data show strong few-shot generalization.
One2Any (Liu et al., CVPR’25, ArXiv:2505.20855) – One-shot RGB (ETH/ZGD). Proposes using a single canonical image of a new object (“one to any”) for pose. Combines a VQVAE codebook to represent shape and a coarse PnP with a spherical coordinate regression. Early results show good generalization to arbitrary objects (paper appears on arXiv May 2025). Code likely coming.
Co-op (CNOS, Xu et al., CVPR’24, ArXiv) – Novel-instance, RGB-RGBD. The paper’s tagline is “Correspondence-based OOP (Object Pose Estimation) with CNOS.” It builds deep correspondences using a Candidate Normalized Object Space (CNOS) representation, then solves pose with RANSAC. It achieves top results on unseen-object BOP (RGB-D) – e.g. AR≈0.736 with one hypothesis (bop.felk.cvut.cz). This approach unifies the matching and pose solving in a transformer-like way.
DVMNet (Zhao et al., CVPR’24) – Relative pose (category-level), RGB-only. Not exactly object-global 6D, but estimates the relative pose between two images of the same unseen object by voxelizing features and solving alignment in one shot (openaccess.thecvf.com). Demoed on CO3D-like datasets.

Key Benchmarks and Leaderboards

BOP (BOP Challenge 2018) – Consolidates many instance-level datasets. See BOP leaderboard for live results. For model-based unseen (Classic Core), top AR ≈0.84 (RGB-D) by FRTPose-WAPR (bop.felk.cvut.cz). Instance-only (with CAD models) tasks have other leaderboards (e.g. BOP 2023 results).
YCB-Video (Wang et al. ICCV’17) – 21 household objects, used for many RGB-only pipelines. State-of-art segmentation AP ~0.70 (bop.felk.cvut.cz). For 6D, CosyPose/DPOD etc. Usually report ADD(-S) accuracy per object.
LM and LM-O – Classic industry benchmarks; almost solved by recent methods (many >99% ADD on LINEMOD, ~70% on occluded version with occlusion). Used in leaderboards.
T-LESS / ITODD / IC-BIN – Texture-less objects (T-LESS) or edge-only (ITODD). Methods robust to texture and symmetry are needed. Multistage render/ICP tends to do well.
NOCS (CVPR’19) – Category pose and size (bottle, bowl, etc.) – methods report 5°/5cm accuracy or IoU metrics. For example, NOCS networks reach ~80–90% on synthetic test and ~40–50% on real.
Wild6D (Fu et al., NeurIPS’22) – category-level RGB-D videos in the wild (bottles, mugs, etc.). Only few methods (RePoNet) exist; fresh benchmark demonstrating generalization.
CAPS (ECCV’24) – Category-level articulated parts pose, for vehicles (CVPR24 CAP-Net).

Discussion & Open Problems

Recent trends emphasize generalization and learning flexibility: foundation/diffusion models, transformers, self-supervision, and few-shot references. The field is moving beyond small closed sets to “any object” scenarios. However, challenges remain: pose ambiguity (symmetry, occlusion), fast inference, and real-real domain gap. Specifically:

Symmetric/Ambiguous Objects: Methods like score-based diffusion on SE(3) or multi-hypothesis (GenFlow) try to capture multiple plausible poses (openaccess.thecvf.com). End-to-end schemes to handle ambiguity remain an open area.
Domain Gap: Many SOTA rely on synthetic training (or LLM/diffusion-augmented rendering) (nvlabs.github.io) (blog.csdn.net). Real-world domain shift (lighting, clutter) still degrades performance. Semi-supervised (RePoNet (oasisyang.github.io)) and test-time adaption (TTA-COPE CVPR’23) are being explored.
Novel Objects: Even with advances like FoundationPose or Any6D, estimating pose of truly arbitrary unseen objects (with unknown shape and no reference) is far from solved. Combining monocular depth or multi-view cues may be needed.
Category-level vs Instance-level: Category pose estimation (new non-rigid classes) lags behind instance-level in accuracy. Representations like NOCS/IVFC (blog.csdn.net) help, but category tasks (especially for deformables or articulated parts) are open.
Foundation Models: Work like VFM-6D (openreview.net) and vision-language nets for pose is very new. Harnessing CLIP-like image-text priors for 3D reasoning is nascent. Likewise, large 3D generative models (e.g. Neural Radiance Fields) offer promise for verification.
Real-time on low-power: Many SOTA (e.g. diffusion, large transformers) are slow. Lightweight pipelines (e.g. YOLO-based 6D regression, or efficient branches in PVN3D) are still valuable for robotics. Balancing speed vs accuracy will remain a theme.

In summary, the current best approaches often combine multiple cues: deep features plus geometric reasoning. For instance-level tasks, methods like CosyPose (refinement) and FreeZE/FRTPose (CLIP-backed) hold highest accuracy. For novel objects, unified systems (FoundationPose (nvlabs.github.io), Any6D (arxiv.org), UNOPose (openaccess.thecvf.com), RefPose (openaccess.thecvf.com)) are pushing boundaries. Diffusion and transformers are newly effective on keypoint matching and ambiguity (openaccess.thecvf.com) (openaccess.thecvf.com). Most of these works share code via GitHub (FoundationPose (nvlabs.github.io), FS6D (paperswithcode.com), Any6D/UNOPose/RefPose have project pages or repos). We encourage readers to consult the corresponding ArXiV/Proj pages for implementation details. Overall, integration of large pretrained models, few-shot learning, and creative inference (diffusion, render-compare, neural fields) is the state-of-art trend. The field’s open problems center on robust generalization to all objects and conditions, without sacrificing speed or accuracy.

Key references (ArXiv): FoundationPose (arXiv:2312.08344) (nvlabs.github.io); GIVEPose (arXiv:2503.15110) (blog.csdn.net); RefPose (CVPR’25, arXiv:2505.10841) (openaccess.thecvf.com); Any6D (arXiv:2503.18673) (arxiv.org); UNOPose (CVPR’25) (openaccess.thecvf.com); 6D-Diff (arXiv/CVPR’24) (openaccess.thecvf.com); Zero123-6D (arXiv:2403.14279) (papers.cool); ZS6D (arXiv:2309.11986) (paperswithcode.com); VFM-6D (NeurIPS’24); Pos3R (CVPR’25) (openaccess.thecvf.com).

---

Sources