Deep Research Report

Had's Links

---

Query: Comprehensive survey of 6D pose estimation from RGB/RGBD images (2022-2026). Focus on: (1) State of the art methods - what are the current best approaches? Transformer-based, diffusion-based, foundation model approaches? (2) Latest papers from 2025-2026 pushing the frontier. (3) Highly cited foundational papers everyone builds on. (4) Key benchmarks and datasets (BOP, YCB-Video, LM-O, T-LESS, etc.) and current leaderboard standings. (5) Instance-level vs category-level pose estimation. (6) RGB-only vs RGBD methods. (7) Novel object pose estimation (objects not seen during training). (8) Real-time methods vs accuracy-focused methods. For each paper: arxiv ID, key contribution, reported metrics on standard benchmarks, code availability. Model: o4-mini-deep-research Date: 2026-03-25 Searches performed: 36 Sources cited: 28

---

Overview of 6D Object Pose Estimation (2022–2026)

Estimating the 6D pose (3D position + orientation) of objects from images remains crucial for robotics and AR. Modern methods can be grouped by their core methodology (e.g. deep regression, correspondence, transformer, diffusion, etc.), data modality (RGB vs RGB-D), and problem setting (instance- vs category-level, known vs novel objects, speed vs accuracy). We summarize the state-of-the-art (SOTA) approaches and trends in each category, highlighting recent (<2024–26) advances, foundational works, benchmarks/datasets, and leaderboards. For each key paper we give its arXiv ID (or publication), main contribution, benchmark performance, and code availability (when known).

Key benchmarks/datasets: Popular datasets and leaderboards include LINEMOD (LM), LM-Occluded (LM-O), YCB-Video (YCB-V), T-LESS, IC-BIN, ITODD, etc., which form the BOP (Benchmark for 6D Object Pose Estimation) suite (bop.felk.cvut.cz). BOP “Classic” divides tasks into RGB-(D) instance-level pose for known objects (via known CAD models) and novel object pose (model-free, unseen objects) (bop.felk.cvut.cz). Category-level pose uses datasets like NOCS (norm. object coordinate space) with camera-made 3D models (paperswithcode.com) and more recent “in-the-wild” video sets (e.g. Wild6D (oasisyang.github.io)). Leaderboards track metrics like Average Recall (AR) under standard thresholds. For example, top AR on the BOP Classic Core (model-based unseen) is ~0.84 (RGB-D) by FRTPose-WAPR (CVPR ’23) (bop.felk.cvut.cz). Nvidia’s FoundationPose (CVPR24) reports AR≈0.734 on BOP Core (bop.felk.cvut.cz). For YCB-Video segmentation, SOTA AP reaches ~0.70 (bop.felk.cvut.cz).

Instance-level vs Category-level: Instance-level methods assume the specific 3D model of each object is known or can be retrieved. Category-level methods handle unseen instances within a known class (e.g. any mug, not a specific mug). Category-level often estimates a Normalized Object Coordinate Space (NOCS) map or similar to lift 2D image to 3D shape before pose (paperswithcode.com). Examples: NOCS (CVPR’19) introduced this approach for cats like bottle/chair, enabling category pose/size. Wild6D (NeurIPS’22) collects in-the-wild RGB-D videos to train category-level models with self-supervision (oasisyang.github.io). Novel-object or model-free settings (no CAD, few/no reference images) are hot topics: eg. Few-shot 6D (FS6D, CVPR’22) uses 1–5 support views of a new object to generalize (paperswithcode.com); MegaPose (CoRL’22) trains on thousands of ShapeNet objects for novel-instance pose via a “render & compare” loop (proceedings.mlr.press). More recent works (Any6D, UNOPose, RefPose, OPFormer, etc.) explicitly handle novel object scenarios (See below).

RGB vs RGB-D: Methods using RGB-D exploit depth for correspondence or learning. For example, DenseFusion (ECCV’18) and PVN3D (CVPR’20) fuse image and point-cloud features for regression. In practice, RGB-D often yields higher accuracy but requires depth sensor. Many new works provide both RGB and RGB-D variants or ablations. Several top entrants on BOP use RGB-D (e.g. FRTPose-WAPR, FoundationPose, FreeZE (bop.felk.cvut.cz) (bop.felk.cvut.cz)) while some are RGB-only for broader use (e.g. Pos3R, GIVEPose, ZS6D).

Real-time vs Accuracy: Some methods prioritize speed (e.g. single-shot detectors or keypoint regression) for >30 FPS, whereas others focus on maximal accuracy (e.g. multi-hypothesis refinement or deep render-and-compare), often sacrificing speed. For instance, CNN-based regression (CDPN, BB8) is faster but may lag refinement pipelines. Recent SOTA models aim for both: e.g. Nvidia’s FoundationPose achieves accuracy comparable to slow instance-specific methods (nvlabs.github.io) (bop.felk.cvut.cz), while FreeZE achieves AR~0.84 (bop.felk.cvut.cz) with reasonable runtime. Real-time methods (e.g. PVNet CVPR’19) are faster but below the new top accuracy.

Below we survey methodological paradigms and highlight exemplary papers (with arXiv refs, contributions, metrics, code if available).

Transformer-based Approaches

Diffusion / Score-based Methods

Foundation / Large-scale Models

Dense Correspondence / Keypoint Regression

These methods laid groundwork: many modern approaches still regress 3D keys or coordinates then use PnP.

End-to-End Render-and-Compare Pipelines

Multi-view and Tracking

Novel-Object (Model-Free) Pose Estimation

Key Benchmarks and Leaderboards

Discussion & Open Problems

Recent trends emphasize generalization and learning flexibility: foundation/diffusion models, transformers, self-supervision, and few-shot references. The field is moving beyond small closed sets to “any object” scenarios. However, challenges remain: pose ambiguity (symmetry, occlusion), fast inference, and real-real domain gap. Specifically:

In summary, the current best approaches often combine multiple cues: deep features plus geometric reasoning. For instance-level tasks, methods like CosyPose (refinement) and FreeZE/FRTPose (CLIP-backed) hold highest accuracy. For novel objects, unified systems (FoundationPose (nvlabs.github.io), Any6D (arxiv.org), UNOPose (openaccess.thecvf.com), RefPose (openaccess.thecvf.com)) are pushing boundaries. Diffusion and transformers are newly effective on keypoint matching and ambiguity (openaccess.thecvf.com) (openaccess.thecvf.com). Most of these works share code via GitHub (FoundationPose (nvlabs.github.io), FS6D (paperswithcode.com), Any6D/UNOPose/RefPose have project pages or repos). We encourage readers to consult the corresponding ArXiV/Proj pages for implementation details. Overall, integration of large pretrained models, few-shot learning, and creative inference (diffusion, render-compare, neural fields) is the state-of-art trend. The field’s open problems center on robust generalization to all objects and conditions, without sacrificing speed or accuracy.

Key references (ArXiv): FoundationPose (arXiv:2312.08344) (nvlabs.github.io); GIVEPose (arXiv:2503.15110) (blog.csdn.net); RefPose (CVPR’25, arXiv:2505.10841) (openaccess.thecvf.com); Any6D (arXiv:2503.18673) (arxiv.org); UNOPose (CVPR’25) (openaccess.thecvf.com); 6D-Diff (arXiv/​CVPR’24) (openaccess.thecvf.com); Zero123-6D (arXiv:2403.14279) (papers.cool); ZS6D (arXiv:2309.11986) (paperswithcode.com); VFM-6D (NeurIPS’24); Pos3R (CVPR’25) (openaccess.thecvf.com).

---

Sources

1. BOP: Benchmark for 6D Object Pose Estimation 2. Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation | Papers With Code 3. Category-Level 6D Object Pose Estimation in the Wild: A Semi-Supervised Learning Approach and A New Dataset 4. BOP: Benchmark for 6D Object Pose Estimation 5. BOP: Benchmark for 6D Object Pose Estimation 6. BOP: Benchmark for 6D Object Pose Estimation 7. FS6D: Few-Shot 6D Pose Estimation of Novel Objects | Papers With Code 8. MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare 9. BOP: Benchmark for 6D Object Pose Estimation 10. https://nvlabs.github.io/FoundationPose/#:~:text=We%20present%20FoundationPose%2C%20a%20unified,Extensive%20evaluation%20on%20multiple%20public 11. 〖CVPR2025〗计算机视觉|GIVEPose:RGB位姿估计新SOTA!吊打LaPose!_cvpr2025动作捕捉 姿态检测估计-CSDN博客 12. 〖CVPR2025〗计算机视觉|GIVEPose:RGB位姿估计新SOTA!吊打LaPose!_cvpr2025动作捕捉 姿态检测估计-CSDN博客 13. CVPR 2025 Open Access Repository 14. ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers | Papers With Code 15. BOP: Benchmark for 6D Object Pose Estimation 16. CVPR 2024 Open Access Repository 17. Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation | Cool Papers - Immersive Paper Discovery 18. https://nvlabs.github.io/FoundationPose/#:~:text=BOP%20Leaderboard 19. https://nvlabs.github.io/FoundationPose/#:~:text=Paper%20Code 20. Vision Foundation Model Enables Generalizable Object Pose Estimation | OpenReview 21. CVPR 2025 Open Access Repository 22. ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers | Papers With Code 23. Any6D: Model-free 6D Pose Estimation of Novel Objects 24. FS6D: Few-Shot 6D Pose Estimation of Novel Objects | Papers With Code 25. CVPR 2025 Open Access Repository 26. BOP: Benchmark for 6D Object Pose Estimation 27. CVPR 2024 Open Access Repository 28. FS6D: Few-Shot 6D Pose Estimation of Novel Objects | Papers With Code