2025年面向具身智能的大小模型协同算法研究和实践报告

4.24 MB 37 页 0 下载 3 浏览 0 评论 0 收藏

语言	格式	评分
中文（简体）	.pdf	3
概览
面向具身智能的大小脑模型协同算法研究及实践盛律 \| 软件学院 2025-08-23 1 具身智能的基本概念基于物理载体进行感知和行动的智能系统，其通过智能体与环境的交互获取信息、理解问题、做出决策并实现行动，从而产生智能行为和适应性具身智能 2 具身智能的基本概念基于物理载体进行感知和行动的智能系统，其通过智能体与环境的交互获取信息、理解问题、做出决策并实现行动，从而产生智能行为和适应性具身智能传统智能具身智能只可远观，被动接受别人告诉我这就是盒子可以打开，可以装东西我主动体验什么是盒子被动抽象接受主动具体体验重要意义具身智能因其能自主产生智能行为和适应性，是通用人工智能的可能起点 3 具身智能的关键任务导航问答操作 4 具身智能的核心目标 5 具身智能的核心要素具身载体(Agent) 具身模型(Model) 智能算法物理载体相比具身载体的日趋成熟，具身模型的算法研究方兴未艾、挑战众多现状 6 具身模型应该考虑哪些能力？ n 技能泛化、真实交互、本体扩展 Skill （技能泛化） Reality （真实交互） Embodiment （本体扩展） Adapted from Jim Fan’s talk 7 具身模型的几种类型大小脑协同端到端 8 具身模型的最新进展：代表性新工作端到端VLA (2024.10) 大小脑 hi robot (2025.02) 混合 (2025.04) 大脑-小脑端到端VLA 端测SDK （2025.03）具身大脑端到端VLA 9 具身大模型离实用还有差距 2023及之前 2025 及之后 2024 大模型大数据基本能力单任务单本体单场景多任务单本体单场景通用智能系统多本体多场景 Scaling Law 在大语言模型和多模态大模型上都得到了验证感知和理解决策和规划执行和协作评估和反馈端到端多模态大模型机器人 Hand-Eye Coordination Robotic Arm 感知操作导航不好用不易用不通用需要“ 聪明 ”的大脑大模型和跨本体的大小脑协作框架，实现跨本体、跨场景、可泛化的具身智能 10 模型能力弱, 未达到具身智能的 “ChatGPT时刻” 大脑、小脑、本体适配难度高一个模型只适用于一种本体本报告来源于三个皮匠报告站（www.sgpjbg.com）,由用户Id:349461下载,文档Id:908399,下载日期:2025-09-10 大小脑模型协同的技术路线仍有机会 q 端到端模型虽决策高效，但泛化性和扩展性受限，受制于环境交互与硬件适配，难以适应多样场景。而模块化的大小脑协同框架凭借强泛化、可解释优势，正成为学界与业界的研究热点模块化：大小脑协同框架赋予具身智能体模块化优势，具备可扩展架构、高效开发与强适应性三大特性可泛化：基于VLM开发的大脑具备丰富的多模态认知能力，且不受小脑模型的影响可解释：决策过程更加透明，提升人机协同效率大小脑模型协同框架是当前实现具身智能体更易落地的技术路线 11 传统多模态大模型能够作为“大脑”？ n 传统VLMs在具身智能场景（长程闭环操作、时空智能等）中面临严峻挑战以‘把锅放到抽屉里’为例，该任务涉及多步骤的长时间交互，包括移动、抓取、放置等操作，并需要与锅、抽屉等物体进行持续交互 GPT-4o在具身任务中表现欠佳 12 回顾：具身模型应该考虑哪些能力？ n 技能泛化、真实交互、本体扩展 Skill （技能泛化） Reality （真实交互） Embodiment （本体扩展） Adapted from Jim Fan’s talk 13 技能泛化：多智能体实现长时序开放具身任务解决 Day Long-horizon open- Forest world embodied tasks Stone Water Task: Gather wood from the forest, craft a stone sword on the plains, and then use it to kill a pig during the daytime near water and grass Wood Pig Grass Plains 14 技能泛化：多智能体实现长时序开放具身任务解决 Task: Gather wood from the forest, craft a stone sword on the plains, and then use it to kill a pig during the daytime near water and grass Day Forest Stone Water Wood O6 Process O1 O : pig 8 O7 : stone sword O : stone 6 1 O8 O : wooden pickaxe 5 Pig Grass 长时序具身任务上下文依赖 + 过程依赖 2 O8 O1: log 1 O8 O1 O6 O8 2 技能泛化：多智能体实现长时序开放具身任务解决 n MP5 (CVPR 2024): 5 (M)LLMs with different roles, communicating for different purposes Obtain Env. Info. for Planning Task: Kill a pig with a wooden sword during the daytime near the water with grass next to it. <Sub-Objective> Knowledge Memory Planner: Can you tell me what important environmental information I need to know? Sub-Objectives { } Patroller: I conduct Active Perception with Percipient with your current observation, there is no pig based on the scene. Parser Obtain Env. Info. for Performer Performer Memory <Sub-Objective> Planner: 1. Equip( ) 2. Find( ) 3. Move( ) 4. Fight( ) Performer: Start executing “Equip”. Planner Performer: Having completed a move in “Find” action, based on my current view, Active tell me if I should continue this action or if the next action is ready to execute. Perception Percipient Patroller Patroller: I conduct Active Perception with Percipient with your current observation, you must continue with the current action since there is no river near the pig. Move Equip Multi-round Single-round Performer: Continue executing “Find”. Craft Performer: Having completed a move in “Find” action, based on my current view, tell me if I should continue this action or if the next action is ready to execute. Mine Fight Find Error Feedback Re-plan Patroller: I conduct Active Perception with Percipient with your current observation, you can execute the next action since all conditions are satisfied. MP5: A multi-modal open-ended embodied system in minecraft via active perception, CVPR 2024 16 技能泛化：多智能体实现长时序开放具身任务解决 n MP5 (CVPR 2024): 5 (M)LLMs with different roles, communicating for different purposes MP5: A multi-modal open-ended embodied system in minecraft via active perception, CVPR 2024 17 技能泛化：多智能体实现长时序开放具身任务解决 - 能精准理解环境上下文内容 - 能够解决钻石级难度任务 - 能持续执行开放式生存任务 18 技能泛化：组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills 19 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化：组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills A baseline of RA-P 20 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化：组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills A comprehensive dataset: RH20T-P 21 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化：组合泛化实现未知技能的学习 n More demos about the dataset and our RA-P? Please check the project page 22 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 真实交互：想象链强化行动执行的环境动态适应性 n MineDreamer (IROS 2025, NeurIPS 2024 OWA研讨会) n 当处理困难问题时，一种可靠的思路是预测未来可能的执行效果，评估当前行动的可行性，以此来指导更可靠的行动执行 n Chain-of-Imagination（想象链）可以强化具身行动执行的指令跟随能力 23 MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control, IROS 2025 真实交互：想象链强化行动执行的环境动态适应性 n Chain-of-imagination n Imagination-conditional VPT in a sequential way n 提供和动态环境、语言指令、当前状态更为相关、效果更为精准的视觉提示 24 真实交互：想象链强化行动执行的环境动态适应性 25 MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control. IROS 2025 真实交互：实时监控提升具身任务执行的成功率 n How to increase the success rate? à Reduce the rate of failure… n Reactive（反应式） + Proactive（主动式）failure detections 3D perception capability + Real-time efficiency VLM ❌ 26 真实交互：实时监控提升具身任务执行的成功率 n Code-as-Monitor (CVPR 2025): Constraint-aware Visual Programming 27 真实交互：实时监控提升具身任务执行的成功率 n The first framework to integrate both reactive and proactive failure detection n Simplify real-time failure detection with high precision n Achieves SOTA performa nce in both simulated and real-world environments n Exhibits strong generalizability on unseen scenarios, tasks, and objects 28 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection, CVPR 2025 具身大脑的基本能力提升：空间感知 + 深度思考 29 Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission) 具身大脑的基本能力提升：空间感知 + 深度思考大规模数据提升能力提升 n 2D Web Images (OpenImages) n 3D Embodied Videos (CA-1M) n Simulation Data by Infinigen with generative assets Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission)30 具身大脑的基本能力提升：空间感知 + 深度思考 Simulation Data Pipeline 31 具身大脑的基本能力提升：空间感知 + 深度思考 n RoboRefer: Accurate Spatial referring by VLMs that enables multi-step dynamic reasoning Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission) 32 具身大脑的基本能力提升：空间感知 + 深度思考展示机械臂（UR5）在场景关键要素变化下完成抓取放置，展示了模型快速的场景适应能力，以及模型判断物体远近、识别朝向、距离的能力。展示人形机器人（宇树G1）在移动操作任务中的效果，展示了模型判断物体远近、识别朝向、距离的能力。展示机械臂（UR5）抓取指定高度物体并放置在光线照射区域，展示模型物体空间高度识别与光照区域识别能力。展示机械臂（Franka）对物体的抓取放置，展示了模型基于空间关系进行物体指代的能力，以及在三维空间中定位空闲区域的能力任务指令：“我要喝右边的饮料”，展示人形机器人（宇树G1）在灵巧手操作任务中的效果，体现了顶层模型判断相对方向的能力，以及灵巧手模型精准控制能力任务指令：“我要吃肉汉堡”，展示双臂机器人（松灵）在夹爪操作任务中的效果，体现了顶层模型对任务拆解以及执行的能力33 Limitations still met for embodied models? n Semantic and spatial perception? 34 Limitations still met for embodied models? n Semantic and spatial perception? n Reliable long-horizon planning? 35 Limitations still met for embodied models? n Semantic and spatial perception? n Reliable long-horizon planning? n Universally drive multiple specialized controllers for diverse skills? 36 Thank You! 盛律，北京航空航天大