正在加载数据... (如果长时间停留在此,说明浏览器不支持当前脚本或 JS 报错)

Deep Tech

ArXiv 最新论文精选

MM-WebAgent:/mmwebagent*/ A Hierarchical/ˌhaɪˈrɑrkəkəl/ Multimodal/multimodal*/ Web Agent for Webpage Generation/ˌʤɛnərˈeɪʃən/

Yan Li, Zezi Zeng, Yifan Yang 2026-04-16

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.


人工智能生成内容 (AIGC) 工具的快速发展使得可以根据网页设计的需求创建图像、视频和可视化,为现代 UI/UX 提供了灵活且日益采用的范例。然而,直接将这些工具集成到自动化网页生成中通常会导致风格不一致和全局一致性差,因为元素是孤立生成的。我们提出了 MM-WebAgent,这是一种用于多模式网页生成的分层代理框架,它通过分层规划和迭代自我反思来协调基于 AIGC 的元素生成。 MM-WebAgent 共同优化全局布局、本地多模式内容及其集成,生成连贯且视觉一致的网页。我们进一步引入了多模式网页生成的基准和用于系统评估的多级评估协议。实验表明,MM-WebAgent 的性能优于代码生成和基于代理的基线,尤其是在多模式元素生成和集成方面。代码和数据:https://aka.ms/mm-webagent。

Generalization/ˌʤɛnərəlɪˈzeɪʃən/ in LLM Problem Solving: The Case of the Shortest/ˈʃɔrtɪst/ Path

Yao Tong, Jiayuan Ye, Anastasia Borovykh 2026-04-16

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.


语言模型是否可以系统地泛化仍然存在激烈争论。然而,经验表现是由训练数据、训练范式和推理时间策略等多种因素共同决定的,这使得失败难以解释。我们引入了基于最短路径规划的受控合成环境,这是一个典型的可组合顺序优化问题。该设置可以清楚地分离这些因素,并支持两个正交轴的概括:空间转移到看不见的地图和长度缩放到更长的视野问题。我们发现模型表现出很强的空间转移,但由于递归不稳定而在长度缩放下始终失败。我们进一步分析学习管道的不同阶段如何影响系统问题的解决:例如,数据覆盖范围设置了能力限制;强化学习提高了训练的稳定性,但并没有扩大这些限制;推理时间缩放可以提高性能,但无法挽救长度缩放失败。

Diagnosing/ˌdaɪəgˈnoʊsɪŋ/ LLM Judge Reliability:/riˌlaɪəˈbɪləti/ Conformal/conformal*/ Prediction/priˈdɪkʃən/ Sets and Transitivity/transitivity*/ Violations/vaɪəˈleɪʃənz/

Manan Gupta, Dhruv Kumar 2026-04-16

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.


LLM 作为法官的框架越来越多地用于自动 NLG 评估,但它们的每个实例的可靠性仍然知之甚少。我们提出了一个应用于 SummEval 的双管齐下的诊断工具包:$\textbf{(1)}$ 一种传递性分析,揭示了被低总违规率掩盖的广泛的每个输入不一致性($\barρ = 0.8$-$4.1\%$),其中 $33$-$67\%$ 的文档至少显示一个定向 3 周期; $\textbf{(2)}$ 将共形预测集分割为 1-5 个 Likert 分数,提供理论上保证的 $\geq(1{-}α)$ 覆盖范围,并将集合宽度用作每个实例的可靠性指标($r_s = {+}0.576$、$N{=}1{,}918$、$p < 10^{-100}$,汇集所有法官)。至关重要的是,预测集宽度显示出一致的跨评判一致性($\bar{r} = 0.32$-$0.38$),表明它捕获了文档级别的难度,而不是评判特定的噪音。在四位法官和四个标准中,两种诊断都趋于一致:标准比判断更重要,相关性判断最可靠(平均集合大小$\约3.0$),连贯性中等(平均集合大小$\约3.9$),而流畅性和一致性仍然不可靠(平均集合大小$\约4.9$)。我们发布所有代码、提示和缓存结果。

GitHub Trending

近期 AI 热门项目

First Principles Thinking

第一性原理


将复杂问题分解为最基本的元素,然后从头开始重建解决方案。不依赖类比或既有经验,而是从根本真理出发进行推理。

实例:埃隆·马斯克在制造电池时,不接受'电池就是很贵'的假设,而是分析电池的原材料成本,发现可以大幅降低成本。

— 亚里士多德 / 伊隆·马斯克

Occam's Razor

奥卡姆剃刀


如无必要,勿增实体。在多个假设中,选择假设最少、最简洁的那个。复杂的解释往往隐藏着错误。

实例:当你听到马蹄声时,先想到马,而不是斑马。除非有明确证据表明是更罕见的情况。

— 威廉·奥卡姆 (14世纪)

Second-Order Thinking

二阶思维


不仅考虑行动的直接后果,还要思考这些后果带来的连锁反应。问自己:'然后呢?再然后呢?'

实例:降价促销会增加短期销量(一阶效应),但可能损害品牌形象并引发价格战(二阶效应)。

— 霍华德·马克斯

The only way to do great work is to love what you do.

做好工作的唯一方法就是热爱你所做的事情。

— Steve Jobs

In the middle of difficulty lies opportunity.

困难之中蕴藏着机遇。

— Albert Einstein

The best time to plant a tree was 20 years ago. The second best time is now.

种一棵树最好的时间是20年前。第二个最好的时间是现在。

— Chinese Proverb

不积跬步,无以至千里;不积小流,无以成江海。

— 荀子

博观而约取,厚积而薄发。

— 苏轼

Example

Stay hungry, stay foolish

保持饥饿,保持愚昧

The people who are crazy enough to think

那些疯狂到认为自己

they can change the world

能够改变世界的人

are the ones who do

往往正是那些真正改变世界的人

Here's to the crazy ones

致那些疯狂的人

The misfits, the rebels

那些格格不入的人,那些叛逆者

The troublemakers

那些惹是生非的人

The round pegs in the square holes

方孔中的圆钉

They're not fond of rules

他们不喜欢循规蹈矩

And they have no respect for the status quo

他们也不尊重现状

You can quote them, disagree with them

你可以引用他们,反对他们

Glorify or vilify them

颂扬或诋毁他们

But the only thing you can't do

但唯独不能忽视他们

is ignore them

Because they change things

因为他们改变了事物

They push the human race forward

他们推动了人类前进

algorithm

/ˈælɡəˌrɪðəm/

n. 算法;运算法则

The sorting algorithm runs in O(n log n) time complexity.

该排序算法的时间复杂度为 O(n log n)。

We need to optimize this algorithm for better performance.

我们需要优化这个算法以获得更好的性能。

recursion

/rɪˈkɜːrʒn/

n. 递归;循环

Recursion is a method where the solution depends on solutions to smaller instances.

递归是一种方法,其解决方案依赖于较小实例的解决方案。

Be careful with recursion to avoid stack overflow.

使用递归时要小心避免栈溢出。

encapsulation

/ɪnˌkæpsjuˈleɪʃn/

n. 封装;包装

Encapsulation hides the internal state of an object from the outside.

封装将对象的内部状态对外部隐藏。

Good encapsulation leads to more maintainable code.

良好的封装能带来更易维护的代码。

polymorphism

/ˌpɒliˈmɔːfɪzəm/

n. 多态性

Polymorphism allows objects of different classes to be treated as objects of a common superclass.

多态性允许不同类的对象被当作共同父类的对象来处理。

Method overriding is a common way to implement polymorphism.

方法重写是实现多态性的常见方式。

inheritance

/ɪnˈherɪtəns/

n. 继承;遗传

Inheritance enables new classes to receive the properties of existing classes.

继承使新类能够接收现有类的属性。

Multiple inheritance can lead to the diamond problem.

多重继承可能导致菱形继承问题。

abstraction

/æbˈstrækʃn/

n. 抽象;提取

Abstraction reduces complexity by hiding unnecessary details.

抽象通过隐藏不必要的细节来降低复杂性。

An abstract class cannot be instantiated directly.

抽象类不能被直接实例化。

concurrency

/kənˈkʌrənsi/

n. 并发;并发性

Concurrency allows multiple tasks to run in overlapping time periods.

并发允许多个任务在重叠的时间段内运行。

Handling concurrency correctly is crucial for multi-threaded applications.

正确处理并发对多线程应用程序至关重要。

serialization

/ˌsɪəriəlaɪˈzeɪʃn/

n. 序列化

Serialization converts an object into a stream of bytes for storage.

序列化将对象转换为字节流以便存储。

JSON is a popular format for data serialization.

JSON 是一种流行的数据序列化格式。

asynchronous

/eɪˈsɪŋkrənəs/

adj. 异步的

Asynchronous programming allows the program to continue executing while waiting for I/O.

异步编程允许程序在等待 I/O 时继续执行。

Use async/await syntax for cleaner asynchronous code.

使用 async/await 语法可以获得更简洁的异步代码。

deprecated

/ˈdeprəkeɪtɪd/

adj. 已弃用的;不推荐的

This method is deprecated and will be removed in the next version.

此方法已弃用,将在下一版本中移除。

Avoid using deprecated APIs in new projects.

避免在新项目中使用已弃用的 API。