Deep Research Report

Query: Comprehensive survey of assembly task research in AI/ML from 2022-2026. Cover: (1) Assembly datasets - furniture assembly, mechanical assembly, industrial assembly, instruction-following datasets, video datasets showing assembly processes. (2) Video models for assembly - action recognition, step detection, procedure learning, video transformers applied to assembly. (3) Foundation models for assembly - LLMs/VLMs for assembly planning, multimodal models that understand assembly instructions, robot learning from assembly demonstrations. (4) Arxiv papers on assembly sequence planning, assembly state estimation, part detection for assembly. Include dataset names, sizes, download links, benchmark results, and code repositories. Model: o4-mini-deep-research Date: 2026-03-25 Searches performed: 68 Sources cited: 30

---

Assembly Datasets

Assembly101 (Sener et al., 2022) (arxiv.org): A multi-view video dataset of 4,321 human assembly/disassembly recordings (toy vehicles) with 100K+ coarse and 1M+ fine-grained action segments and 18M 3D hand-pose annotations. Videos are captured from 8 static cameras + 4 egocentric views. Benchmarks include action recognition, anticipation, temporal segmentation, and a novel mistake-detection task (arxiv.org). (Dataset details: ~100 hrs, no public download link found; see paper).
AssemblyHands (Ohkawa et al., 2023) (arxiv.org): A large-scale egocentric hand-pose dataset built on Assembly101. It provides 3.0M images (including 490K egocentric) with accurate 3D hand keypoints (∼4.2 mm error) via a multi-view fusion pipeline (arxiv.org). This is (as of 2023) the largest benchmark for egocentric 3D hand-pose estimation in assembly videos. (No direct link – see arXiv).
ATTACH (Aganian et al., 2023) (arxiv.org): 51.6 hours of two-handed industrial assembly video (RGB-D) with 95.2K fine-grained action annotations (each hand labeled separately) captured by 3 cameras. Actions often overlap (68% simultaneous) to reflect realistic assembly; it includes Azure Kinect skeletons. Baselines on action recognition/detection (video and skeleton) are reported. Dataset: TU Ilmenau ATTACH Page (arxiv.org).
MECCANO (Ragusa et al., 2022) (arxiv.org): A multimodal egocentric industrial dataset (wearable camera + gaze + depth) for assembly-like tasks. Contains annotated first-person videos (with gaze and depth) of people assembling objects; labeled for action recognition, active-object detection, human-object interaction, anticipation, etc. Dataset: MECCANO Project Page (arxiv.org).
FurnitureBench (Heo et al., 2023) (arxiv.org): A real-world furniture-assembly benchmark with 200+ hours and 5,000+ human or robot demonstrations. Provides 3D-printable furniture models, an easy-to-reproduce rig, and a realistic simulator (“FurnitureSim”), targeting long-horizon manipulation challenges. (Reproducibility focus; see arXiv summary (arxiv.org) for dataset stats.)
AssembleRL (Aslan et al., 2022) (arxiv.org): A simulator-based furniture-assembly dataset (point-cloud inputs) for learning assembly policies with minimal supervision. Contains simulation episodes of assembling IKEA-like furniture from raw 3D scans. (Dataset details in paper; code may be available on authors’ sites.)
REASSEMBLE (Sliwowski et al., 2025) (arxiv.org): A multi-modal dataset focused on contact-rich robotic assembly/disassembly. Built on the NIST Assembly Task Board, it includes 4 action types (pick, insert, remove, place) over 17 objects to capture complex physical dynamics. (Data generation code is public via project page – see paper (arxiv.org).)
2BY2 (Two-by-Two) (Yu et al., 2025) (arxiv.org): A large-scale 3D assembly dataset of daily paired objects. Contains 1,034 instances of 517 object-pair assemblies (e.g. “pairings” of common appliances/toys) with annotated SE(3) poses and part symmetries. Designed to promote generalizable assembly pose estimation. (See ArXiv for details (arxiv.org); code not linked.)
IKEA-Manual (Wang et al., 2023) (arxiv.org): A 3D-object+instruction dataset linking IKEA furniture to manuals. Contains 102 IKEA objects with annotated assembly parts, assembly graphs, image manuals and fine segmentation to relate 3D parts with 2D instruction diagrams (arxiv.org). Supports tasks like assembly-plan generation, part segmentation, 6D pose estimation, and 3D assembly from a single image. (Dataset not directly linked – see paper for details.)
IKEA Video Manuals (“Manuals at Work”) (Liu et al., 2024) (arxiv.org): A 4D grounding dataset that aligns IKEA assembly instructions with real-world videos. It provides 3D furniture models, step-by-step manuals, and Internet-sourced assembly videos with dense spatiotemporal alignment annotations (arxiv.org). Enables instruction grounding and tracking of parts over time. (See arXiv for tasks (arxiv.org).)
SCANet (LEGO-ECA Dataset) (Wan et al., 2024) (arxiv.org) (arxiv.org): A single-step assembly error-correction dataset. LEGO-ECA contains manual images for each assembly step of LEGO models, including examples of misassemblies (arxiv.org). Accompanying code and data are available on the project site: SCANet IROS 2024 (arxiv.org). (Used to train models that detect and correct assembly errors.)

Video Models for Assembly

3D CNNs and Two-Stream Networks: Traditional video models like C3D, I3D or two-stream CNNs (RGB + optical flow) have served as baselines for assembly action recognition. These capture spatio-temporal motion cues around hands and parts (e.g. Feichtenhofer et al. SlowFast).
Vision Transformers: Architectures such as TimeSformer or Video Swin apply self-attention over space-time. For example, the Video Swin Transformer (Liu et al., 2021) showed strong action accuracy by sliding-window attention (arxiv.org). Girdhar et al. (2018) proposed an Action Transformer network that attends to human hands and faces for fine-grained actions, improving recognition on complex tasks (arxiv.org). These models can be adapted to assembly videos to capture long-range action dependencies.
Skeleton/Graph Models: Skeletal representations from pose estimation (e.g. Azure Kinect) have been used with graph-convolutional networks (GCNs). For assembly, Aganian et al. (2023) showed that adding object-centered “virtual joints” to a skeleton GCN significantly boosts accuracy on the IKEA ASM assembly dataset (arxiv.org). These graph-based models (ST-GCN / DGCNN variants) encode human-object interactions in assembly.
Hand-Focused Models: Since hands and tools are key to assembly, specialized models emphasize hand regions. Myers et al. (2022) proposed a high-resolution, hand-location-guided CNN for action segmentation in fine-grained assembly videos (arxiv.org). This network combines large-field features with cropped hand crops to accurately segment rapid assembly steps.
Temporal Segmentation Models: Techniques like MS-TCN, ASFormer or Seq2Seq LSTMs are applied to detect boundaries of assembly steps. Some works (e.g. Aganian et al. [44],[85]) report baselines by fine-tuning I3D or SlowFast on assembly data. Collectively, these video models facilitate tasks like action recognition, step detection, and mistake localization in assembly demos.

Foundation Models for Assembly

Manual2Skill (Tie et al., 2025) (arxiv.org): A vision-language framework that reads IKEA instruction manuals to build hierarchical assembly graphs. It uses a VLM to extract structured info (parts, subassemblies) from diagrams (arxiv.org), then applies 6D pose estimation for each step. A motion planner executes the generated sequence. Manual2Skill demonstrates robots assembling real IKEA parts purely from manual images. (Code/URL: see paper.)
Manual2Skill++ (Tie et al., 2025) (arxiv.org): Extends Manual2Skill by treating connectors (screws, pegs) as first-class elements. A large-scale vision-language model parses symbolic diagrams and annotations in manuals, extracting connector type, count, and placement (arxiv.org). The system represents assemblies as graphs with explicit connection edges, enabling general assembly across many connector types.
Gemini Robotics (1.5/ER 1.5) (DeepMind 2025) (www.livescience.com): An advanced LLM-based robot model built on Google’s Gemini LLM. Gemini Robotics 1.5 (and ER 1.5) are tuned for physical tasks requiring long-horizon reasoning. Google reports these models enabling multi-step manipulation (e.g. sorting objects by color) and more general “physical AI” planning beyond fixed scripts (www.livescience.com). (It interprets natural language commands into robot actions; see DeepMind releases.)
Rho-α (Microsoft 2026) (www.techradar.com): A vision-language-action model by Microsoft (derived from its Phi VLM). Rho-α translates English instructions into control for dual-arm robots, focusing on fine-grained, bimanual manipulation (www.techradar.com). It integrates vision, force/tactile sensing, and language in a unified policy, aiming to let robots adapt assembly strategies online. (Tech announcement only; implementation details forthcoming.)
Others: Recent “vision-language-action” models (e.g. PALM-E, SayCan) have shown robotics promise. For assembly specifically, multi-modal LLMs and VLMs are being explored to interpret instructions and demonstrations. For example, ROSA (Wen et al., 2025) aligns VLM features with robot states for better action grounding, which could benefit instruction following. (See arXiv【114†】.)

Sequence Planning, State Estimation, and Part Detection (ArXiv)

Assembly Sequence Planning: Many recent works tackle ordering of assembly steps. Ma et al. (2022) model sequence planning as a graph Transformer problem on LEGO models (arxiv.org). Tian et al. (2023, “ASAP”) use physics-aware tree search and a graph-neural heuristic to generate feasible sequences (accounting for gravity and stability) on hundreds of products (arxiv.org). Liu et al. (2024) train an RL policy with a physics-based action mask, achieving 100% valid assembly of 250+ Lego structures (arxiv.org). Shu et al. (2024, “S2A”) use a graph-attention RL to scale sequence planning to many parts (arxiv.org). Li et al. (2023) introduce GPAT, a transformer that predicts 6D part poses for general (unseen) assemblies (arxiv.org). (See also Atad et al. 2023 (arxiv.org) for graph-based RASP, etc.)
Assembly State Estimation & Error Detection: Estimating the current assembly state from vision is an active area. Schoonbeek et al. (2024) propose ISIL, a self-supervised loss for classifying correct/erroneous states; they show their model distinguishes multiple error types even when trained only on correct states (arxiv.org). Schieber et al. (2024) present ASDF, which fuses 6D pose estimates of all parts to detect misplaced components in an assembly (arxiv.org). Lehman et al. (2024) introduce StateDiffNet: a change-detection network that takes a “correct” assembly image and a test frame, and localizes where an error (e.g. missing/misaligned part) occurred (arxiv.org). (They provide synthetic data and code for training error detectors (arxiv.org).)
Part Detection/Segmentation: Recognizing and localizing parts is fundamental. The IKEA-Manual dataset (arxiv.org) (2023) provides finely-labeled 2D–3D correspondences: each part in an IKEA object is segmented in images and matched to the 3D CAD part, enabling segmentation and 6D pose estimation tasks (arxiv.org). In the LEGO domain, SCANet’s LEGO-ECA includes images of each assembly step; models like SCANet use these to classify assembled part poses. (E.g., LEGO-ECA’s website/code is at scanet-iros2024.github.io) (arxiv.org) (arxiv.org). Such datasets allow training models to detect part alignment and verify each assembly step.

Resources: All papers above are on ArXiv (links in citations). Whenever available, dataset download pages or GitHub repos are cited above. For example, ATTACH and MECCANO datasets have public webpages (arxiv.org) (arxiv.org); LEGO-ECA data and code are on the SCANet site (arxiv.org); error-detection code from Lehman et al. is linked in their paper (arxiv.org). Each entry above includes reference citations for the results or dataset statistics.

---

Sources

1. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities 2. AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation 3. ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding 4. MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain 5. FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation 6. AssembleRL: Learning to Assemble Furniture from Their Point Clouds 7. REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly 8. Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation 9. IKEA-Manual: Seeing Shape Assembly Step by Step 10. IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos 11. SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network 12. SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network 13. 2021-06-24 | Video Swin Transformer 14. Video Action Transformer Network 15. How Object Information Improves Skeleton-based Human Action Recognition in Assembly Tasks 16. Hand Guided High Resolution Feature Enhancement for Fine-Grained Atomic Action Segmentation within Complex Human Assemblies 17. Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models 18. Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models 19. Robots receive major intelligence boost thanks to Google DeepMind's 'thinking AI' - a pair of models that help machines understand the world 20. Microsoft unveils first robotics model targeted at boosting physical AI in a bid to free robots from the production line 21. Planning Assembly Sequence with Graph Transformer 22. ASAP: Automated Sequence Planning for Complex Robotic Assembly with Physical Feasibility 23. Physics-Aware Combinatorial Assembly Sequence Planning using Data-free Action Masking 24. Subassembly to Full Assembly: Effective Assembly Sequence Planning through Graph-based Reinforcement Learning 25. Rearrangement Planning for General Part Assembly 26. Efficient and Feasible Robotic Assembly Sequence Planning via Graph Representation Learning 27. Supervised Representation Learning towards Generalizable Assembly State Recognition 28. ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation 29. Find the Assembly Mistakes: Error Segmentation for Industrial Applications 30. IKEA-Manual: Seeing Shape Assembly Step by Step