中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

Internvl3.5:在多功能性,推理和效率方面推进开源多模型

  • 标题: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
  • 作者: Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo
  • 日期: 2025-08-25
  • ArXiv主页: https://arxiv.org/abs/2508.18265
  • 论文链接: https://arxiv.org/pdf/2508.18265
  • 项目链接: https://chat.intern-ai.org.cn/
  • gitHub仓库: https://github.com/OpenGVLab/InternVL

英文摘要

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks – narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

中文摘要

我们介绍了Internvl 3.5,这是一个新的开源多模型系列,可在Intervl系列中显着提高多功能性,推理能力和推理效率。一个关键的创新是Cascade增强学习(Cascade RL)框架,它通过两个阶段的过程增强了推理:离线RL稳定收敛和在线RL以进行精制对齐。这种粗线至细节的训练策略可实质性地改进下游推理任务,例如MMMU和Mathvista。为了优化效率,我们提出了一个视觉分辨率路由器(VIR),该路由器会动态调整视觉令牌的分辨率而不会损害性能。加上VIR,我们的脱钩视力部署(DVD)策略将视觉编码器和语言模型跨不同的GPU分开,有效地平衡了计算负载。这些贡献共同使Intervl3.5能够达到+16.0 \%的总体推理性能增长,而与其前任相比,推理的速度为4.05倍。此外,Intervl3.5还支持新的功能,例如GUI互动和体现的代理。值得注意的是,我们最大的模型,即InternVL3.5-241B-A28B,在一般多模式,推理,文本和代理任务中,在开源MLLM中获得最新的结果 - 与GPT-5这样的领先商业模型的性能差异缩小了性能差距。所有模型和代码均已公开发布。


AgentFly:无需微调LLM的微调LLM代理

英文摘要

In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set. It reaches 66.6% F1 and 80.4% PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds 4.7% to 9.6% absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.

中文摘要

在本文中,我们引入了一种新颖的学习范式,用于自适应大语模型(LLM)代理,该模型消除了对基础LLM的微调需求。现有方法通常是刚性的,依赖于静态,手工的反射工作流,或计算密集型,需要LLM模型参数的梯度更新。相比之下,我们的方法可以通过基于内存的在线增强学习来实现低成本的持续适应。我们将其形式化为具有内存的马尔可夫决策过程(M-MDP),该过程配备了神经案例选择政策,以指导行动决策。过去的经验存储在情节记忆中,无论是可区分还是非参数。通过内存重写机制根据环境反馈不断更新该策略,而通过有效的内存阅读(检索)实现了策略的改进。我们将代理模型实例化在深度研究环境中,即代理Fly,该模型在GAIA验证(87.88%PASS@3)中获得了TOP-1,在测试集中获得了79.40%。它在Deepresearcher数据集上达到66.6%的F1和80.4%的PM,表现优于基于最新的培训方法,而基于病例的内存则在分发任务上增加了4.7%至9.6%的绝对点。我们的方法为开发能够在没有梯度更新的情况下能够连续实时学习的通用LLM代理提供了可扩展,有效的途径,将机器学习推进了开放式技能掌握和深入的研究场景。该代码可在https://github.com/agent-on-the-fly/agentfly上找到。


Vibevoice技术报告

英文摘要

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe’’ and surpassing open-source and proprietary dialogue models.

中文摘要

该报告提出了Vibevoice,这是一种新型模型,旨在通过采用下一言传扩散来与多个说话者合成长形式的语音,这是一种通过扩散通过扩散产生潜在的向量来对连续数据进行建模的统一方法。为了实现这一目标,我们引入了一种新型的连续语音令牌,与流行的EccoDEC模型相比,该数据将数据压缩提高了80倍,同时保持可比性的性能。代币器有效地保留了音频保真度,同时显着提高了处理长序列的计算效率。因此,VibeVoice可以最多90分钟(在64K上下文窗口长度)中综合长形式的语音,最多4个扬声器,捕获真实的对话``vibe’',并超过开源和专有的对话模型。


超越通行证@1:自我播放与各种问题综合持续RLVR

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

中文摘要

具有可验证奖励(RLVR)的强化学习最近已成为培训后大语言模型(LLMS)的关键范式,尤其是对于复杂的推理任务。但是,已证明香草RLVR训练可以以损失政策熵为代价来提高通过@1的表现,从而减少了发电多样性并限制了Pass@k性能,这通常代表了LLM推理能力的上限。在本文中,我们从培训问题的角度系统地分析了该政策的产生多样性,并发现增加和更新培训问题有助于减轻培训期间的熵崩溃。基于这些观察结果,我们提出了一个在线自我播放,该自我播放具有RLVR培训的各种问题综合(SVS)策略,该培训使用该策略的正确解决方案来综合变异问题,同时确保其参考答案与原件相同。与标准RLVR相比,这种自我改善策略有效地维持了训练期间的政策熵,并且在竞争级AIME24和AIME25基准的PASS@32绩效中,通过长期改进并实现了长期改进,并实现了长期改进和22.8%的绝对增长。对从3B到32B的不同模型大小的12个推理基准的实验始终证明了SV的普遍性和鲁棒性。


RSTAR2代理:代理推理技术报告

英文摘要

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

中文摘要

我们介绍了RSTAR2-AGENT,这是一种14B数学推理模型,该模型训练了经纪性增强学习,以实现前沿级的性能。除了目前的长床外,该模型还展示了高级认知行为,例如在使用Python编码工具之前仔细思考,并反思代码执行反馈以自主探索,验证和完善复杂问题解决中的中间步骤。通过三个关键的创新来启用此功能,从而使代理RL有效地进行了规模:(i)具有可靠的Python代码环境的有效RL基础架构,可支持高通量执行,并降低了高推广成本,从而在有限的GPU资源(64 MI300X GPU)上进行了培训;(ii)GRPO-ROC,一种具有重新样本的推出策略的代理RL算法,该策略可解决编码工具的固有环境噪声,从而使模型可以在代码环境中更有效地推理;(iii)一种有效的代理训练食谱,始于非回答SFT,并通过多RL阶段进行,从而获得高级认知能力,计算成本最低。为此,RSTAR2代理将预先训练的14B模型提高到一周之内仅510 RL步骤,在AIME24的平均得分中获得平均得分为80.6%,AIME25的平均得分为80.6%,而DeepSeek-R1(671b)的平均得分为69.8%。除了数学外,RSTAR2-AGENT-14B还表明了对对齐,科学推理和代理工具使用任务的强烈概括。代码和培训配方可在https://github.com/microsoft/rstar上找到。


PREF-GRPO:基于成对偏好奖励的GRPO,用于稳定的文本对图像增强学习

英文摘要

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

中文摘要

最近的进步强调了基于GRPO的增强学习方法和基准测试在增强文本形象(T2i)一代中的重要性。但是,使用点奖励模型(RM)进行评分的图像的当前方法容易奖励黑客。我们透露,当标准化后图像之间的分数差异被放大时,就会发生这种情况,从而创造出虚幻的优势,使模型驱动模型过度耗尽琐碎的收益,最终破坏了图像生成过程的稳定。为了解决这个问题,我们提出了Pref-grpo,这是一种基于成对的基于奖励奖励的GRPO方法,该方法将优化目标从得分最大化转变为偏好拟合,从而确保了更稳定的训练。在PERF-GRPO中,使用偏好RM在每个组中对图像进行成对比较,并将获胜率用作奖励信号。广泛的实验表明,Pref-Grpo可以区分微妙的图像质量差异,提供更稳定的优势并减轻奖励黑客攻击。此外,现有的T2I基准受到粗略评估标准的限制,阻碍了全面的模型评估。为了解决这个问题,我们引入了Unigenbench,这是一个统一的T2I基准测试,其中包括5个主要主题和20个子主题的600个提示。它通过10个主要和27个亚标准评估语义一致性,利用MLLM进行基准构造和评估。我们的基准测试揭示了开放和封闭源T2I模型的优势和劣势,并验证了Pref-Grpo的有效性。


超越转录:ASR的机械解释性

英文摘要

Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.

中文摘要

最近,可解释性方法引起了极大的关注,尤其是在大语言模型的背景下,可以洞悉语言表示,错误检测以及诸如幻觉和重复等模型行为。但是,尽管它们有可能提高ASR系统的性能和可解释性,但这些技术在自动语音识别(ASR)中仍然没有得到充满反感。在这项工作中,我们适应并系统地应用了已建立的可解释性方法,例如Logit镜头,线性探测和激活修补程序,以检查声学和语义信息如何在ASR系统中的层中演变。我们的实验揭示了以前未知的内部动力学,包括负责重复幻觉和语义偏见的特定编码器折线相互作用。这些见解证明了扩展和将可解释性技术应用于语音识别的好处,开为改善模型透明度和鲁棒性的未来研究打开了有希望的方向。


通过推理分解的自我奖励视觉模型

英文摘要

Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

中文摘要

视觉语言模型(VLMS)经常患有视觉幻觉,说出实际上不在图像中的事情和语言快捷方式,它们跳过视觉部分,只是依靠文本先验。这些问题之所以出现,是因为VLMS的大多数训练后培训方法都依赖于简单可验证的答案匹配并仅监督最终输出,而中间视觉推理没有明确的指导。结果,VLM会收到稀疏的视觉信号,并经常学会优先考虑基于语言的推理而不是视觉感知。为了减轻这种情况,一些现有方法使用人类注释或外部大型模型的蒸馏标签添加了视觉监督。但是,人类注释是劳动密集型和昂贵的,并且由于外部信号无法适应不断发展的政策,因此它们会导致分配转变,从而导致奖励黑客攻击。在本文中,我们介绍了Vision-SR1,这是一种自我奖励方法,可以改善视觉推理,而无需通过强化学习依靠外部视觉监督。Vision-SR1将VLM推理分为两个阶段:视觉感知和语言推理。首先提示该模型产生独立的视觉感知,足以在不转移输入图像的情况下回答问题。为了验证这种自我传输,然后重新提出相同的VLM模型,以仅使用生成的感知作为计算奖励的输入来执行语言推理。这种自我奖励与最终输出的监督相结合,提供了平衡的训练信号,从而增强了视觉感知和语言推理。我们的实验表明,Vision-SR1可以改善视觉推理,减轻视觉幻觉,并减少对各种视觉语言任务跨语言捷径的依赖。


Treepo:通过基于启发树的建模弥合政策优化和功效和推理效率的差距

英文摘要

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22% up to 43% of the sampling design for the trained models, meanwhile showing up to 40% reduction at trajectory-level and 35% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

中文摘要

通过强化学习使大型语言模型保持一致的最新进展在解决复杂的推理问题方面取得了显着的收益,但以昂贵的车间推销和对各种推理路径的探索有限。在这项工作中,我们介绍了Treepo,其中涉及一种自引导的推出算法,该算法将序列生成视为树的结构化搜索过程。Treepo由动态树采样策略和固定长度段解码组成,利用本地不确定性来保证其他分支。通过对跨常见前缀进行计算并尽早修剪低价值路径,Treepo基本上减轻了Per-perdate Compute负担,同时保留或增强了勘探多样性。关键贡献包括:(1)通过段的采样算法通过连续的段来减轻KV缓存负担,并产生新的分支以及早期的机制;(2)一个基于树的细分级优势估计,该估计考虑了全球和本地近端策略优化。(3)分析概率和质量驱动的动态差异和后备策略的有效性。我们从经验上验证了Treepo在设定的推理基准上的性能增长,以及训练有素型号的GPU小时从22 \%到43%的抽样设计的效率,同时显示了轨迹级别的40 \%降低,以降低40 \%的型号,以对现有模型进行35 \%。特雷普(Treepo)在提供免费推理效率的午餐时,揭示了一条实用的途径,以减少样本和较少的计算来缩放基于RL的培训。主页位于https://m-a-p.ai/treepo。


MCP基础:通过MCP服务器进行复杂的现实世界任务的基准测试工具使用LLM代理

英文摘要

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

中文摘要

我们介绍了MCP-Bench,这是一种用于评估实际,多步骤任务的大型语言模型(LLM)的基准,这些任务需要工具使用,跨工具协调,精确的参数控制以及解决任务的计划/推理。MCP Bench建立在模型上下文协议(MCP)的基础上,将LLMS连接到28个代表性的LIVE MCP服务器,这些服务器跨越了跨领域的250个工具,例如财务,旅行,科学计算和学术搜索。与先前的基于API的基准分析不同,每个MCP服务器都提供了一组辅助工具,旨在共同使用,从而构建具有丰富输入输出耦合的真实多步任务。MCP Bench测试剂中的任务能够从没有明确工具名称的模糊指令中检索相关工具,计划多跳执行轨迹,以实现复杂目标,中间工具输出中的地面响应以及协调跨域工作流程 - 不充分依赖于依赖explicit工具的现有基准进行了充分评估的跨域工具,并依赖于explicit的工具的隔离工具。我们提出了一个多方面的评估框架,涵盖了工具级架构的理解和使用,轨迹级别的计划和任务完成。在20个高级LLMS上进行的实验表明,MCP基础的持续挑战。代码和数据:https://github.com/accenture/mcp-bench。


USO:统一的风格和主题驱动的一代,通过分离和奖励学习

英文摘要

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model’s performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

中文摘要

现有文献通常将风格驱动和主题驱动的一代视为两个不相交的任务:前者优先考虑风格相似性,而后者则坚持主题一致性,从而产生了明显的对抗。我们认为,这两个目标都可以在一个单一框架下统一,因为它们最终涉及内容和风格的分离和重组,这是风格驱动的研究中的长期主题。为此,我们提出了USO,这是一种统一样式的优化自定义模型。首先,我们构建了一个大规模的三重态数据集,该数据集由内容图像,样式图像及其相应的程式化内容图像组成。其次,我们介绍了一个分离的学习计划,该方案同时通过两个互补的目标,样式培训和内容式的删除培训来使风格功能与样式的内容相结合,并将内容从样式中分离出来。第三,我们结合了一种样式的奖励学习范式,该范式表示为SRL,以进一步提高模型的性能。最后,我们发布了USO Bench,这是第一个共同评估多个指标的样式相似性和主题保真度的基准。广泛的实验表明,USO在主题一致性和样式相似性的方面都可以在开源模型之间达到最先进的性能。代码和型号:https://github.com/bytedance/uso


Admentscope 1.0:一个以开发人员为中心的构建代理应用程序的框架

  • 标题: AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

  • 作者: Dawei Gao, Zitao Li, Yuexiang Xie, Weirui Kuang, Liuyi Yao, Bingchen Qian, Zhijian Ma, Yue Cui, Haohao Luo, Shen Li, Lu Yi, Yi Yu, Shiqi He, Zhiling Luo, Wenmeng Zhou, Zhicheng Zhang, Xuguang He, Ziqian Chen, Weikai Liao, Farruh Isakulovich Kushnazarov, Yaliang Li, Bolin Ding, Jingren Zhou

  • 日期: 2025-08-22

  • ArXiv主页: https://arxiv.org/abs/2508.16279

  • 论文链接: https://arxiv.org/pdf/2508.16279

  • gitHub仓库: https://github.com/agentscope-ai/agentscope

英文摘要

Driven by rapid advancements of Large Language Models (LLMs), agents are empowered to combine intrinsic knowledge with dynamic tool use, greatly enhancing their capacity to address real-world tasks. In line with such an evolution, AgentScope introduces major improvements in a new version (1.0), towards comprehensively supporting flexible and efficient tool-based agent-environment interactions for building agentic applications. Specifically, we abstract foundational components essential for agentic applications and provide unified interfaces and extensible modules, enabling developers to easily leverage the latest progress, such as new models and MCPs. Furthermore, we ground agent behaviors in the ReAct paradigm and offer advanced agent-level infrastructure based on a systematic asynchronous design, which enriches both human-agent and agent-agent interaction patterns while improving execution efficiency. Building on this foundation, we integrate several built-in agents tailored to specific practical scenarios. AgentScope also includes robust engineering support for developer-friendly experiences. We provide a scalable evaluation module with a visual studio interface, making the development of long-trajectory agentic applications more manageable and easier to trace. In addition, AgentScope offers a runtime sandbox to ensure safe agent execution and facilitates rapid deployment in production environments. With these enhancements, AgentScope provides a practical foundation for building scalable, adaptive, and effective agentic applications.

中文摘要

在大型语言模型(LLM)快速发展的驱动下,代理有权将固有知识与动态工具使用相结合,从而大大提高了他们解决现实世界任务的能力。与这种演变相一致,AgentsCope引入了新版本(1.0)的重大改进,以全面支持灵活,有效的基于工具的代理环境与建筑物代理应用程序。具体而言,我们抽象了对代理应用程序必不可少的基础组件,并提供统一的接口和可扩展的模块,从而使开发人员可以轻松利用最新的进度,例如新模型和MCP。此外,我们在反应范式中的基础行为,并基于系统的异步设计提供高级代理级基础架构,该设计丰富了人类代理和代理商的交互模式,同时提高了执行效率。在这个基础的基础上,我们整合了针对特定实际情况的几个内置代理。AdmentScope还包括针对开发人员友好的体验的强大工程支持。我们提供具有视觉工作室界面的可扩展评估模块,从而使长条设备代理应用程序的开发更易于管理,更易于跟踪。此外,AdgentsCope提供了一个运行时的沙箱,以确保安全代理执行并促进生产环境中的快速部署。通过这些增强功能,Admentscope为建筑可扩展,适应性和有效的代理应用提供了实用的基础。


CMPHYSBENCH:用于评估凝结物理学中大型语言模型的基准

  • 标题: CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

  • 作者: Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng

  • 日期: 2025-08-25

  • ArXiv主页: https://arxiv.org/abs/2508.18124

  • 论文链接: https://arxiv.org/pdf/2508.18124

  • gitHub仓库: https://github.com/CMPhysBench/CMPhysBench

英文摘要

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

中文摘要

我们介绍了CMPhysbench,旨在评估大型语言模型(LLMS)在凝结物理学中的熟练程度,作为一种新颖的基准。Cmphysbench由520多个研究生级的精心策划的问题组成,涵盖了凝结物理学的代表性子场和基础理论框架,例如磁性,超导性,强度相关性,等等。确保对问题解决过程的深入了解,我们将重点放在解决问题的过程中,以便综合地综合地构成了llms的综合性。同时,利用基于树的表达式表示,我们介绍了可扩展的表达式编辑距离(种子)评分,该评分提供了细粒度(非二元)部分信用,并对预测和地面真相之间的相似性进行更准确的评估。我们的结果表明,即使是最佳模型,Grok-4,仅达到36个平均种子得分,CMPHYSBENCH的准确性仅达到28%,也强调了显着的能力差距,尤其是对于这个实用和边界的领域,相对于传统物理学而言。代码和DataSet可在https://github.com/cmphysbench/cmphysbench上公开获得。


奥德赛:长途探索和操纵长途任务

英文摘要

Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/

中文摘要

长期以来,语言指导的长途移动操作一直是体现语义推理,可推广的操纵和自适应运动的巨大挑战。三个基本限制阻碍了进度:首先,尽管大型语言模型通过语义先验改进了空间推理和任务计划,但现有的实现仍然局限于桌面场景,未能解决移动平台的受限感知和有限的驱动范围。其次,当面对在开放世界环境中遇到的各种对象配置时,当前的操纵策略表现出不足的概括。第三,尽管对于实际部署至关重要,但在非结构化设置中维持高平台可操作性以及精确的最终效应器控制的双重要求仍在研究中。在这项工作中,我们提出了Odyssey,这是一个统一的移动操纵框架,适用于配备操纵器的敏捷四倍机器人,该机器人无缝地将高级任务计划与低级全身控制。为了解决以语言条件任务中自我感知的挑战,我们介绍了一个由视觉模型支持的层次结构规划师,从而实现了长途指示分解和精确的动作执行。在控制层面,我们新颖的全身政策在具有挑战性的地形上实现了强大的协调。我们进一步介绍了长途移动操作的第一个基准,评估了各种室内和室外场景。通过成功的SIM到现实转移,我们演示了系统在现实世界部署中的概括和鲁棒性,强调了在非结构化环境中腿部操纵器的实用性。我们的工作推动了能够进行复杂,动态任务的广义机器人助手的可行性。我们的项目页面:https://kaijwang.github.io/odyssey.github.io/


Omnihuman-1.5:通过认知模拟在头像中灌输积极的思想

英文摘要

Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character’s authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive. Our model, OmniHuman-1.5, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: https://omnihuman-lab.github.io/v1_5/

中文摘要

现有的视频头像模型可以产生流畅的人类动画,但它们努力超越单纯的身体形象,以捕捉角色的真实本质。他们的动作通常与低级线索等低级线索同步,例如音频节奏,缺乏对情感,意图或背景的更深入的语义理解。为了弥合这一差距,我们提出了一个框架,旨在生成角色动画,这些动画不仅在物理上是合理的,而且在语义上相干和表现力。我们的模型Omnihuman-1.5建立在两个关键的技术贡献之上。首先,我们利用多模式的大型语言模型来综合条件的结构化文本表示,该条件提供了高级语义指导。该指导使我们的运动生成器超越了简单的节奏同步,从而实现了上下文和情感共鸣的动作的产生。其次,为了确保这些多模式输入的有效融合并减轻模式间冲突,我们引入了具有新颖的伪最后框架设计的专业多模式DIT体系结构。这些组件的协同作用使我们的模型能够准确地解释音频,图像和文本的联合语义,从而产生与角色,场景和语言内容非常一致的动作。广泛的实验表明,我们的模型在一系列全面的指标中实现了领先的性能,包括唇部同步的精度,视频质量,运动自然性和语义一致性,并具有文本提示。此外,我们的方法对复杂场景(例如涉及多人和非人类受试者的情况)显示出了显着的可扩展性。主页:https://omnihuman-lab.github.io/v1_5/


Voxhammer:在本机3D空间中无训练的精确和连贯的3D编辑

英文摘要

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

中文摘要

特定区域的3D本地编辑对于游戏行业和机器人互动至关重要。最近的方法通常编辑渲染的多视图图像,然后重建3D模型,但它们在精确保存未编辑的区域和整体连贯性方面面临挑战。受结构化3D生成模型的启发,我们提出了Voxhammer,这是一种新型的无训练方法,在3D潜在空间中执行精确且连贯的编辑。给定3D模型,Voxhammer首先预测其反转轨迹,并在每个时间步中获得其倒立的潜在和键值令牌。随后,在非授予和编辑阶段,我们用相应的倒潜在和缓存的键值令牌代替了保留区域的去核特征。通过保留这些上下文特征,这种方法可确保对保留区域的一致重建和编辑零件的连贯整合。为了评估保留区域的一致性,我们构建了Edit3D-Bench,这是一个由人类通知的数据集组成的数据集,该数据集包含数百个样本,每个样本都经过精心标记为3D编辑区域。实验表明,从保存区域的3D一致性和整体质量方面,Voxhammer显着优于现有方法。我们的方法有望合成高质量编辑的配对数据,从而为内部文化3D生成奠定了数据基础。请参阅我们的项目页面,网址为https://huanngzh.github.io/voxhammer-page/。


视觉库:阶段意识的增强学习与文本对图的指导链

  • 标题: Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
  • 作者: Yaqi Li, Peng Chen, Mingyang Han, Bu Pi, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song
  • 日期: 2025-08-25
  • ArXiv主页: https://arxiv.org/abs/2508.18032
  • 论文链接: https://arxiv.org/pdf/2508.18032

英文摘要

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

中文摘要

尽管文本到图像(T2I)的最新自回归模型取得了希望,但它们处理多属性和模棱两可的提示的能力仍然有限。为了解决这些局限性,现有的作品应用了思想链(COT),以实现舞台感知的视觉合成和使用的强化学习(RL)以提高推理能力。但是,大多数模型仅在生成阶段结束时提供奖励信号。这个单层的最终指导使很难确定哪些阶段对最终结果产生了积极的贡献,并可能导致次优政策。为了解决这个问题,我们提出了一个由三个阶段组成的指导链(Visual-cog)范式:语义推理,过程完善和结果评估,并在整个图像生成管道中提供了舞台感知的奖励。我们进一步构建了一个视觉认知基准Viscog Bench,该基准包括四个子任务以评估语义推理的有效性。对Geneval,T2i-Compbench和拟议的Viscog基座的全面评估分别提高了15%,5%和19%的改善,证明了所提出的视觉库的出色表现。我们将尽快发布所有资源。


Aworld:编排代理AI的培训配方

  • 标题: AWorld: Orchestrating the Training Recipe for Agentic AI
  • 作者: Chengyue Yu, Siyuan Lu, Chenyi Zhuang, Dong Wang, Qintong Wu, Zongyue Li, Runsheng Gan, Chunfeng Wang, Siqi Hou, Gaochi Huang, Wenlong Yan, Lifeng Hong, Aohui Xue, Yanfeng Wang, Jinjie Gu, David Tsai, Tao Lin
  • 日期: 2025-08-28
  • ArXiv主页: https://arxiv.org/abs/2508.20404
  • 论文链接: https://arxiv.org/pdf/2508.20404

英文摘要

The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that significantly outperforms its base model, increasing its overall GAIA accuracy from 21.59% to 32.23%. On the benchmark’s most challenging levels, our agent achieves a score of 16.33%, surpassing the performance of leading proprietary models. Our open-source system and resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement.

中文摘要

从实践范式中学习对于开发有能力的代理AI系统至关重要,但是由于经验生成效率低下而严重阻碍了它,这在诸如Gaia之类的复杂基准中尤其明显。为了解决这个问题,我们介绍了Aworld,这是一种为大规模代理 - 环境互动设计的开源系统。通过在群集上分配任务,与标准的单节点执行相比,Aworld将经验收集加速14.6倍。这种关键的加速使广泛的强化学习实用且可扩展。利用这一能力,我们训练了一个基于QWEN3-32B的代理,该代理显着胜过其基本模型,将其总体GAIA准确度从21.59%提高到32.23%。在基准最具挑战性的水平上,我们的经纪人的得分达到16.33%,超过了领先的专有模型的性能。我们的开源系统和最终的代理为完整的代理AI训练管道提供了实用的蓝图,从有效的交互到可证明的模型改进。


MV-rag:检索增强的多视频扩散

英文摘要

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

中文摘要

文本到3D生成方法通过利用预估计的2D扩散先验,产生高质量和3D一致的输出来显着提高。但是,它们通常无法产生室外(OOD)或稀有概念,从而产生不一致或不准确的结果。为此,我们提出了MV-rag,这是一种新型的文本到3D管道,首先从大型野外2D数据库中检索相关的2D图像,然后在这些图像上调节多视频扩散模型,以合成一致且准确的多视输出。培训这种检索条件模型是通过新型混合策略桥接结构化的多视图数据和不同的2D图像收集来实现的。这涉及使用增强条件视图对多视图数据进行培训,这些视图模拟了特定于视图的重建的检索差异,以及使用独特的固定视图预测目标对检索到的现实世界2D图像集进行培训:从2D数据中推断出其他视图从其他视图中预示了持有的视图。为了促进严格的OOD评估,我们介绍了新的具有挑战性的OOD提示。针对最先进的文本到3D,图像到3D和个性化基线的实验表明,我们的方法可显着提高3D一致性,光真实主义和OOD/稀有概念的文本依从性,同时保持在标准基准上的竞争性能。


CODA:协调大脑和小脑,用于与脱钩的双脑使用代理

英文摘要

Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.

中文摘要

图形用户界面(GUIS)的自主代理在科学计算等专业领域中面临重大挑战,在科学计算中,需要长马计划和精确执行。现有的方法遭受了权衡的困扰:通才代理人在计划方面表现出色,但执行效果不佳,而专业人士则表现出相反的弱点。最近的构图框架试图通过结合计划者和演员来弥合这一差距,但它们通常是静态且不可训练的,这阻止了经验的适应性。鉴于科学领域中高质量数据的稀缺性,这是一个关键的局限性。为了解决这些限制,我们介绍了Coda,这是一个新颖且可训练的构图框架,将通才计划者(大脑)与专业执行人(小脑)集成在一起,该框架通过专用的两阶段管道进行了训练。在第一阶段,专业化,我们采用了一种脱钩的GRPO方法来分别为每个科学应用程序培训专家规划师,并从一系列的任务轨迹中引导。在第二阶段,我们将专家的所有成功轨迹汇总为构建合并数据集,然后将其用于监督最终计划者的微调。这使CODA具有强大的执行和跨域泛化。CODA对科学板基准测试的四个具有挑战性的应用程序进行了评估,极大地超过了基准,并在开源模型之间建立了新的最新技术。


UltrameMV2:具有出色的长篇文化学习的内存网络缩放到120B参数

英文摘要

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

中文摘要

尽管专家(MOE)模型的混合物仅通过激活参数子集来实现出色的效率,但在推断期间,它们的内存访问成本很高。内存层架构提供了一种吸引人的替代方案,并且很少有内存访问,但是像Ultramem这样的以前的尝试仅与2- Expert MOE型号的性能相匹配,而与最新的8-Expert配置相比,它的幅度大大不足。我们提出UltrameMV2,这是一种重新设计的内存层架构,缩小了此性能差距。我们的方法介绍了五个关键的改进:将内存层集成到每个变压器块中,通过单个线性投影简化了价值扩展,从对等方面采用了基于FFN的值处理,实现了原则性参数初始化以及重新平衡内存对FFFN计算比率。通过广泛的评估,我们证明了UltrameMV2在相同的计算和参数下与8型专家MOE模型实现了性能奇偶校验,但内存访问的范围很低。值得注意的是,UltrameMV2在记忆密集型任务上表现出卓越的性能,在长篇下说记忆中的改善为+1.6分,多轮记忆中的+6.2分和+7.9分在封闭式学习中。我们通过从120B总参数的模型验证我们的方法,并确定激活密度比总稀疏参数计数更大的影响。我们的工作将记忆层体系结构带到了最先进的MOE模型的性能奇偶校验中,为有效的稀疏计算提供了令人信服的替代方案。


爱马仕4技术报告

英文摘要

We present Hermes 4, a family of hybrid reasoning models that combine structured, multi-turn reasoning with broad instruction-following ability. We describe the challenges encountered during data curation, synthesis, training, and evaluation, and outline the solutions employed to address these challenges at scale. We comprehensively evaluate across mathematical reasoning, coding, knowledge, comprehension, and alignment benchmarks, and we report both quantitative performance and qualitative behavioral analysis. To support open research, all model weights are published publicly at https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

中文摘要

我们介绍了Hermes 4,这是一个将结构化的,多转的推理与广泛的指导跟随能力相结合的混合推理模型家族。我们描述了在数据策展,综合,培训和评估过程中遇到的挑战,并概述了用于解决这些挑战的解决方案。我们在数学推理,编码,知识,理解和对齐基准中进行全面评估,并报告定量性能和定性行为分析。为了支持开放研究,所有模型权重均在https://huggingface.co/collections/nousresearch/hermes-4-collection-68a731bfd452e20816725728上公开发布。


Pixie:快速且可推广的监督从像素中对3D物理学的学习

英文摘要

Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

中文摘要

从视觉信息中推导3D场景的物理特性是创建交互式和现实的虚拟世界的关键但挑战性的任务。尽管人类直观地掌握了诸如弹性或刚度之类的材料特征,但现有方法通常依赖于缓慢的,每场现场的优化,从而限制了它们的推广性和应用。为了解决这个问题,我们介绍了Pixie,这是一种新颖的方法,该方法训练可概括的神经网络,以预测来自3D视觉特征的多个场景的物理特性,纯粹是使用监督损失的。一旦受过训练,我们的前馈网络就可以对合理的材料字段进行快速推断,该材料字段与博斯(Gaussian)脱落(Gaussian Splatting)这样的学识渊博的静态场景表示形式,可以在外部力下实现逼真的物理模拟。为了促进这项研究,我们还收集了Pixieverse,这是配对3D资产和物理材料注释的最大的已知数据集之一。广泛的评估表明,小精灵比测试时时间优化方法要高约1.46-4.39倍,并且数量级快。通过利用之前的视觉功能(例如剪辑),尽管仅接受过合成数据的培训,但我们的方法也可以零弹性地推广到现实世界的场景。https://pixie-3d.github.io/


长期视频的上下文混合

英文摘要

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

中文摘要

长期的视频生成从根本上是一个漫长的上下文记忆问题:模型必须保留和检索远距离的显着事件,而不会崩溃或漂移。但是,将扩散变压器缩放为生成长篇小说视频的缩放量受到自我注意的二次成本的限制,这使得记忆和计算很难进行,并且难以为长序列进行优化。我们将长篇小说视频生成重新铸造为内部信息检索任务,并提出了一个简单,可学习的稀疏注意路由模块,上下文(MOC)的混合物,作为有效的长期内存检索引擎。在MOC中,每个查询都会动态选择一些有益的块以及强制性锚点(标题,本地窗口),以防止循环封闭。当我们扩展数据并逐渐稀疏路由时,该模型将计算分配给了显着的历史记录,在内容的分钟内保留身份,动作和场景。效率是作为检索(接近线性缩放)的副产品,它可以实现实践训练和合成,以及记忆和一致性的出现。


分析思想动态链:积极的指导或事后合理化的不忠?

英文摘要

Recent work has demonstrated that Chain-of-Thought (CoT) often yields limited gains for soft-reasoning problems such as analytical and commonsense reasoning. CoT can also be unfaithful to a model’s actual reasoning. We investigate the dynamics and faithfulness of CoT in soft-reasoning tasks across instruction-tuned, reasoning and reasoning-distilled models. Our findings reveal differences in how these models rely on CoT, and show that CoT influence and faithfulness are not always aligned.

中文摘要

最近的工作表明,对经过分析和常识性推理等软性问题的问题(COT)通常会产生有限的收益。COT也可能对模型的实际推理不忠。我们调查了COT在跨指令调整,推理和推理延伸模型的软性任务中的动态和忠诚。我们的发现揭示了这些模型如何依赖COT的差异,并表明COT的影响力和忠诚并不总是一致。


了解工具集成推理

英文摘要

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM’s capabilities. We demonstrate that tools enable a strict expansion of the model’s empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR’s success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

中文摘要

我们研究为什么工具集成推理(TIR)使大型语言模型(LLMS)更有能力。尽管LLM与Python Code口译员等工具集成在一起表现出巨大的希望,但有原则的理论解释了为什么缺少这种范式的范式。这项工作提供了第一个正式证明,即TIR从根本上扩展了LLM的功能。我们证明了工具可以严格扩展该模型的经验和可行的支持,从而通过解锁解决问题的问题的策略来打破纯文本模型的能力上限,这些策略原本是不可能的或有用的冗长。为了指导模型行为而不损害训练稳定性和性能,我们还引入了优势塑造策略优化(ASPO),这是一种新型算法,可以直接修改优势功能以指导策略行为。我们对具有挑战性的数学基准进行了全面的实验,利用Python解释器作为外部工具。我们的结果表明,TIR模型在Pass@k Metric上果断地优于其纯文本对应物。至关重要的是,这一优势不仅限于计算密集型问题,而是扩展到需要大量抽象见解的问题。我们进一步确定了新兴的认知模式,这些模式说明了模型如何使用工具学习思考。最后,我们通过早期代码调用报告了改进的工具使用行为,并与ASPO进行了更多的交互式转弯。总体而言,我们的工作为TIR的成功提供了第一个原则上的解释,将重点从仅此的事实转移到了工具有效和如何实现更强大推理的事实。


垫片:迈向工程的科学灵感

英文摘要

Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via ‘deliberate decontextualization,’ an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.

中文摘要

LLM的最新进展使自动化的科学研究成为人工超智能道路的下一个前线。但是,这些系统要么与狭窄范围的任务或LLM的创意功能有限。我们提出了Spacer,这是一种科学发现系统,它在没有外部干预的情况下发展了创造性和事实扎根的概念。Spacer试图通过“故意的脱皮化”来实现这一目标,该方法将信息分解为原子单元(关键字),并从它们之间未开发的连接中汲取创造力。Spacer由(i)Nuri组成,这是一种构建关键字集的灵感引擎,以及(ii)将这些集将这些集完善成精心的科学陈述中的显现管道组成。Nuri提取物是从一个在生物学领域的180,000个学术出版物构建的关键字图中的小型高能关键字集。表现的管道发现关键字之间的联系,分析其逻辑结构,验证其合理性,并最终起草原始的科学概念。根据我们的实验,NURI的评估度量准确地对高影响力出版物进行了0.737的高分归类。我们的显现管道还成功地从最新的顶级文章中重建了核心概念,仅从其关键字集中。一个基于LLM的评分系统估计,这种重建是超过85%的案例的合理。最后,我们的嵌入空间分析表明,与SOTA LLM相比,来自间隔者的输出与领先的出版物明显相似。


Logo

开源鸿蒙跨平台开发社区汇聚开发者与厂商,共建“一次开发,多端部署”的开源生态,致力于降低跨端开发门槛,推动万物智联创新。

更多推荐