2025年面向具身智能的大小模型协同算法研究和实践报告
4.24 MB
37 页
0 下载
3 浏览
0 评论
0 收藏
| 语言 | 格式 | 评分 |
|---|---|---|
中文(简体) | .pdf | 3 |
| 概览 | ||
面向具身智能的 大小脑模型协同算法研究及实践 盛律 | 软件学院 2025-08-23 1 具身智能的基本概念 基于物理载体进行感知和行动的智能系统,其通过智能体与环境的交互获 取信息、理解问题、做出决策并实现行动,从而产生智能行为和适应性 具身 智能 2 具身智能的基本概念 基于物理载体进行感知和行动的智能系统,其通过智能体与环境的交互获 取信息、理解问题、做出决策并实现行动,从而产生智能行为和适应性 具身 智能 传统智能 具身智能 只可远观,被动接受 别人告诉我这就是盒子 可以打开,可以装东西 我主动体验什么是盒子 被动抽象接受 主动具体体验 重要 意义 具身智能因其能自主产生智能行为和适应性,是通用人工智能的可能起点 3 具身智能的关键任务 导航 问答 操作 4 具身智能的核心目标 5 具身智能的核心要素 具身载体(Agent) 具身模型(Model) 智能 算法 物理 载体 相比具身载体的日趋成熟,具身模型的算法研究方兴未艾、挑战众多 现状 6 具身模型应该考虑哪些能力? n 技能泛化、真实交互、本体扩展 Skill (技能泛化) Reality (真实交互) Embodiment (本体扩展) Adapted from Jim Fan’s talk 7 具身模型的几种类型 大小脑协同 端到端 8 具身模型的最新进展:代表性新工作 端到端VLA (2024.10) 大小脑 hi robot (2025.02) 混合 (2025.04) 大脑-小脑 端到端VLA 端测SDK (2025.03) 具身大脑 端到端VLA 9 具身大模型离实用还有差距 2023及之前 2025 及之后 2024 大模型 大数据 基本能力 单任务 单本体 单场景 多任务 单本体 单场景 通用智能系统 多本体 多场景 Scaling Law 在大语言模型和多模态大模型 上都得到了验证 感知和理解 决策和规划 执行和协作 评估和反馈 端到端 多模态大模型机器人 Hand-Eye Coordination Robotic Arm 感知 操作 导航 不好用 不易用 不通用 需要“ 聪明 ”的大脑大模型和 跨本体的大小脑协作框架, 实现跨本体、跨场景、可泛化的具身智能 10 模型能力弱, 未达到具身智能的 “ChatGPT时刻” 大脑、小脑、本体 适配难度高 一个模型只适用于 一种本体 本报告来源于三个皮匠报告站(www.sgpjbg.com),由用户Id:349461下载,文档Id:908399,下载日期:2025-09-10 大小脑模型协同的技术路线仍有机会 q 端到端模型虽决策高效,但泛化性和扩展性受限,受制于环境交互与硬件适配, 难以适应多样场景。而模块化的大小脑协同框架凭借强泛化、可解释优势,正成 为学界与业界的研究热点 模块化:大小脑协同框架赋予具身智能体模块化优势,具备可扩展架构、高效开发 与强适应性三大特性 可泛化:基于VLM开发的大脑具备丰富的多模态认知能力,且不受小脑模型的影响 可解释:决策过程更加透明,提升人机协同效率 大小脑模型协同框架 是当前实现具身智能体更易落地的技术路线 11 传统多模态大模型能够作为“大脑”? n 传统VLMs在具身智能场景(长程闭环操作、时空智能等)中面临严峻挑战 以‘把锅放到抽屉里’为例,该任务涉及多步骤的长时间交互,包括移动、 抓取、放置等操作,并需要与锅、抽屉等物体进行持续交互 GPT-4o在具身任务中表现欠佳 12 回顾:具身模型应该考虑哪些能力? n 技能泛化、真实交互、本体扩展 Skill (技能泛化) Reality (真实交互) Embodiment (本体扩展) Adapted from Jim Fan’s talk 13 技能泛化:多智能体实现长时序开放具身任务解决 Day Long-horizon open- Forest world embodied tasks Stone Water Task: Gather wood from the forest, craft a stone sword on the plains, and then use it to kill a pig during the daytime near water and grass Wood Pig Grass Plains 14 技能泛化:多智能体实现长时序开放具身任务解决 Task: Gather wood from the forest, craft a stone sword on the plains, and then use it to kill a pig during the daytime near water and grass Day Forest Stone Water Wood O6 Process O1 O : pig 8 O7 : stone sword O : stone 6 1 O8 O : wooden pickaxe 5 Pig Grass 长时序具身任务 上下文依赖 + 过程依赖 2 O8 O1: log 1 O8 O1 O6 O8 2 技能泛化:多智能体实现长时序开放具身任务解决 n MP5 (CVPR 2024): 5 (M)LLMs with different roles, communicating for different purposes Obtain Env. Info. for Planning Task: Kill a pig with a wooden sword during the daytime near the water with grass next to it. <Sub-Objective> Knowledge Memory Planner: Can you tell me what important environmental information I need to know? Sub-Objectives { } Patroller: I conduct Active Perception with Percipient with your current observation, there is no pig based on the scene. Parser Obtain Env. Info. for Performer Performer Memory <Sub-Objective> Planner: 1. Equip( ) 2. Find( ) 3. Move( ) 4. Fight( ) Performer: Start executing “Equip”. Planner Performer: Having completed a move in “Find” action, based on my current view, Active tell me if I should continue this action or if the next action is ready to execute. Perception Percipient Patroller Patroller: I conduct Active Perception with Percipient with your current observation, you must continue with the current action since there is no river near the pig. Move Equip Multi-round Single-round Performer: Continue executing “Find”. Craft Performer: Having completed a move in “Find” action, based on my current view, tell me if I should continue this action or if the next action is ready to execute. Mine Fight Find Error Feedback Re-plan Patroller: I conduct Active Perception with Percipient with your current observation, you can execute the next action since all conditions are satisfied. MP5: A multi-modal open-ended embodied system in minecraft via active perception, CVPR 2024 16 技能泛化:多智能体实现长时序开放具身任务解决 n MP5 (CVPR 2024): 5 (M)LLMs with different roles, communicating for different purposes MP5: A multi-modal open-ended embodied system in minecraft via active perception, CVPR 2024 17 技能泛化:多智能体实现长时序开放具身任务解决 - 能精准理解环境上下文内容 - 能够解决钻石级难度任务 - 能持续执行开放式生存任务 18 技能泛化:组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills 19 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化:组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills A baseline of RA-P 20 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化:组合泛化实现未知技能的学习 n RA-P (IROS 2025, NeurIPS 2024 OWA): composable generalizable agents in real world n n Decompose complicated tasks into fine-grained primitive skills, generalizable to new physical skills A comprehensive dataset: RH20T-P 21 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 技能泛化:组合泛化实现未知技能的学习 n More demos about the dataset and our RA-P? Please check the project page 22 RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents, IROS 2025 真实交互:想象链强化行动执行的环境动态适应性 n MineDreamer (IROS 2025, NeurIPS 2024 OWA研讨会) n 当处理困难问题时,一种可靠的思路是预测未来可能的执行效果,评估当前行动的可行性,以 此来指导更可靠的行动执行 n Chain-of-Imagination(想象链)可以强化具身行动执行的指令跟随能力 23 MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control, IROS 2025 真实交互:想象链强化行动执行的环境动态适应性 n Chain-of-imagination n Imagination-conditional VPT in a sequential way n 提供和动态环境、语言指 令、当前状态更为相关、 效果更为精准的视觉提示 24 真实交互:想象链强化行动执行的环境动态适应性 25 MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control. IROS 2025 真实交互:实时监控提升具身任务执行的成功率 n How to increase the success rate? à Reduce the rate of failure… n Reactive(反应式) + Proactive(主动式)failure detections 3D perception capability + Real-time efficiency VLM ❌ 26 真实交互:实时监控提升具身任务执行的成功率 n Code-as-Monitor (CVPR 2025): Constraint-aware Visual Programming 27 真实交互:实时监控提升具身任务执行的成功率 n The first framework to integrate both reactive and proactive failure detection n Simplify real-time failure detection with high precision n Achieves SOTA performa nce in both simulated and real-world environments n Exhibits strong generalizability on unseen scenarios, tasks, and objects 28 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection, CVPR 2025 具身大脑的基本能力提升:空间感知 + 深度思考 29 Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission) 具身大脑的基本能力提升:空间感知 + 深度思考 大规模数据提升能力提升 n 2D Web Images (OpenImages) n 3D Embodied Videos (CA-1M) n Simulation Data by Infinigen with generative assets Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission)30 具身大脑的基本能力提升:空间感知 + 深度思考 Simulation Data Pipeline 31 具身大脑的基本能力提升:空间感知 + 深度思考 n RoboRefer: Accurate Spatial referring by VLMs that enables multi-step dynamic reasoning Zhou E, et al. RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics. (in Submission) 32 具身大脑的基本能力提升:空间感知 + 深度思考 展示机械臂(UR5)在场景关键要素变化下完成抓取放置,展示了 模型快速的场景适应能力,以及模型判断物体远近、识别朝向、距 离的能力。 展示人形机器人(宇树G1)在移动操作任务中的效果,展示了模 型判断物体远近、识别朝向、距离的能力。 展示机械臂(UR5)抓取指定高度物体并放置在光线照射区域, 展示模型物体空间高度识别与光照区域识别能力。 展示机械臂(Franka)对物体的抓取放置,展示了模型基于空间关 系进行物体指代的能力,以及在三维空间中定位空闲区域的能力 任务指令:“我要喝右边的饮料”,展示人形机器人(宇树G1)在 灵巧手操作任务中的效果,体现了顶层模型判断相对方向的能力, 以及灵巧手模型精准控制能力 任务指令:“我要吃肉汉堡”,展示双臂机器人(松灵)在夹爪操 作任务中的效果,体现了顶层模型对任务拆解以及执行的能力33 Limitations still met for embodied models? n Semantic and spatial perception? 34 Limitations still met for embodied models? n Semantic and spatial perception? n Reliable long-horizon planning? 35 Limitations still met for embodied models? n Semantic and spatial perception? n Reliable long-horizon planning? n Universally drive multiple specialized controllers for diverse skills? 36 Thank You! 盛律,北京航空航天大
| ||
下载文档到本地,方便使用
共 37 页, 还有
1 页可预览,
继续阅读
文档评分


具身智能的基础知识(68页 PPT)