正在加载数据... (如果长时间停留在此,说明浏览器不支持当前脚本或 JS 报错)

Deep Tech

ArXiv 最新论文精选

DARE-bench:/darebench*/ Evaluating/ɪˈvæljuˌeɪtɪŋ/ Modeling/ˈmɑdəlɪŋ/ and Instruction/ˌɪnˈstrəkʃən/ Fidelity/ˌfaɪˈdɛləti/ of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu 2026-02-27

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.


使用大型语言模型 (LLM) 来处理复杂的多步骤数据科学任务的需求快速增长,迫切需要准确的基准测试。现有基准存在两个主要差距:(i) 缺乏标准化、流程感知的评估来捕获指令依从性和流程保真度,以及 (ii) 缺乏准确标记的培训数据。为了弥补这些差距,我们引入了 DARE-bench,这是一个专为机器学习建模和数据科学指令遵循而设计的基准。与许多依赖人类或基于模型的评判的现有基准不同,DARE-bench 中的所有任务都有可验证的基本事实,确保客观且可重复的评估。为了涵盖广泛的任务并支持代理工具,DARE-bench 由 6,300 个 Kaggle 衍生任务组成,并提供大规模训练数据和评估集。广泛的评估表明,即使是像 gpt-o4-mini 这样功能强大的模型也很难获得良好的性能,特别是在机器学习建模任务中。使用 DARE-bench 训练任务进行微调可以显着提高模型性能。例如,监督微调将 Qwen3-32B 的准确性提高了 1.83 倍,强化学习将 Qwen3-4B 的准确性提高了 8 倍以上。这些重大改进验证了 DARE-bench 作为准确评估基准和关键训练数据的重要性。

Do LLMs Benefit From Their Own Words?

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo 2026-02-27

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.


与大型语言模型的多轮交互通常会在对话历史记录中保留助手自己过去的响应。在这项工作中,我们通过询问大型语言模型是否受益于它们自己先前的响应来重新审视这种设计选择。使用野外多回合对话,我们将标准(全上下文)提示与用户仅回合提示方法进行比较,该方法忽略了所有先前的助理响应,跨越三种开放推理模型和一种最先进的模型。令我们惊讶的是,我们发现删除先前的助理响应并不会影响大部分转弯的响应质量。省略助手端历史记录可以将累积上下文长度减少多达 10 倍。为了解释这个结果,我们发现多轮对话包含很大比例(36.4%)的独立提示,并且许多后续提示提供了足够的指令,仅使用当前用户轮次和先前用户轮次即可回答。在分析仅用户轮次提示明显优于完整上下文的情况时,我们识别了上下文污染的实例,其中模型对其先前的响应进行了过度调节,引入了错误、幻觉或跨轮次传播的风格伪像。受这些发现的启发,我们设计了一种上下文过滤方法,有选择地忽略助手端上下文。我们的研究结果表明,有选择地省略助理历史记录可以提高响应质量,同时减少内存消耗。

CUDA Agent: Large-Scale/largescale*/ Agentic RL for High-Performance/highperformance*/ CUDA Kernel Generation/ˌʤɛnərˈeɪʃən/

Weinan Dai, Hanlin Wu, Qiying Yu 2026-02-27

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.


GPU 内核优化是现代深度学习的基础,但仍然是一项高度专业化的任务,需要深厚的硬件专业知识。尽管在通用编程方面具有强大的性能,但大型语言模型 (LLM) 与基于编译器的系统(例如用于 CUDA 内核生成的 torch.compile)相比仍然没有竞争力。现有的 CUDA 代码生成方法要么依赖于免训练细化,要么依赖于固定多轮执行反馈循环内的微调模型,但这两种范式都无法从根本上提高模型内在的 CUDA 优化能力,导致性能提升有限。我们推出了 CUDA Agent,这是一个大规模代理强化学习系统,它通过三个组件开发 CUDA 内核专业知识:可扩展的数据合成管道、具有自动验证和分析功能的技能增强 CUDA 开发环境,以提供可靠的奖励信号,以及支持稳定训练的强化学习算法技术。 CUDA Agent 在 KernelBench 上实现了最先进的结果,在 KernelBench Level-1、Level-2 和 Level-3 分割上比 torch.compile 速度快 100%、100% 和 92%,在最难的 Level-3 设置上比 Claude Opus 4.5 和 Gemini 3 Pro 等最强的专有模型高出约 40%。

GitHub Trending

近期 AI 热门项目

First Principles Thinking

第一性原理


将复杂问题分解为最基本的元素,然后从头开始重建解决方案。不依赖类比或既有经验,而是从根本真理出发进行推理。

实例:埃隆·马斯克在制造电池时,不接受'电池就是很贵'的假设,而是分析电池的原材料成本,发现可以大幅降低成本。

— 亚里士多德 / 伊隆·马斯克

Occam's Razor

奥卡姆剃刀


如无必要,勿增实体。在多个假设中,选择假设最少、最简洁的那个。复杂的解释往往隐藏着错误。

实例:当你听到马蹄声时,先想到马,而不是斑马。除非有明确证据表明是更罕见的情况。

— 威廉·奥卡姆 (14世纪)

Second-Order Thinking

二阶思维


不仅考虑行动的直接后果,还要思考这些后果带来的连锁反应。问自己:'然后呢?再然后呢?'

实例:降价促销会增加短期销量(一阶效应),但可能损害品牌形象并引发价格战(二阶效应)。

— 霍华德·马克斯

The only way to do great work is to love what you do.

做好工作的唯一方法就是热爱你所做的事情。

— Steve Jobs

In the middle of difficulty lies opportunity.

困难之中蕴藏着机遇。

— Albert Einstein

The best time to plant a tree was 20 years ago. The second best time is now.

种一棵树最好的时间是20年前。第二个最好的时间是现在。

— Chinese Proverb

不积跬步,无以至千里;不积小流,无以成江海。

— 荀子

博观而约取,厚积而薄发。

— 苏轼

Example

Stay hungry, stay foolish

保持饥饿,保持愚昧

The people who are crazy enough to think

那些疯狂到认为自己

they can change the world

能够改变世界的人

are the ones who do

往往正是那些真正改变世界的人

Here's to the crazy ones

致那些疯狂的人

The misfits, the rebels

那些格格不入的人,那些叛逆者

The troublemakers

那些惹是生非的人

The round pegs in the square holes

方孔中的圆钉

They're not fond of rules

他们不喜欢循规蹈矩

And they have no respect for the status quo

他们也不尊重现状

You can quote them, disagree with them

你可以引用他们,反对他们

Glorify or vilify them

颂扬或诋毁他们

But the only thing you can't do

但唯独不能忽视他们

is ignore them

Because they change things

因为他们改变了事物

They push the human race forward

他们推动了人类前进

algorithm

/ˈælɡəˌrɪðəm/

n. 算法;运算法则

The sorting algorithm runs in O(n log n) time complexity.

该排序算法的时间复杂度为 O(n log n)。

We need to optimize this algorithm for better performance.

我们需要优化这个算法以获得更好的性能。

recursion

/rɪˈkɜːrʒn/

n. 递归;循环

Recursion is a method where the solution depends on solutions to smaller instances.

递归是一种方法,其解决方案依赖于较小实例的解决方案。

Be careful with recursion to avoid stack overflow.

使用递归时要小心避免栈溢出。

encapsulation

/ɪnˌkæpsjuˈleɪʃn/

n. 封装;包装

Encapsulation hides the internal state of an object from the outside.

封装将对象的内部状态对外部隐藏。

Good encapsulation leads to more maintainable code.

良好的封装能带来更易维护的代码。

polymorphism

/ˌpɒliˈmɔːfɪzəm/

n. 多态性

Polymorphism allows objects of different classes to be treated as objects of a common superclass.

多态性允许不同类的对象被当作共同父类的对象来处理。

Method overriding is a common way to implement polymorphism.

方法重写是实现多态性的常见方式。

inheritance

/ɪnˈherɪtəns/

n. 继承;遗传

Inheritance enables new classes to receive the properties of existing classes.

继承使新类能够接收现有类的属性。

Multiple inheritance can lead to the diamond problem.

多重继承可能导致菱形继承问题。

abstraction

/æbˈstrækʃn/

n. 抽象;提取

Abstraction reduces complexity by hiding unnecessary details.

抽象通过隐藏不必要的细节来降低复杂性。

An abstract class cannot be instantiated directly.

抽象类不能被直接实例化。

concurrency

/kənˈkʌrənsi/

n. 并发;并发性

Concurrency allows multiple tasks to run in overlapping time periods.

并发允许多个任务在重叠的时间段内运行。

Handling concurrency correctly is crucial for multi-threaded applications.

正确处理并发对多线程应用程序至关重要。

serialization

/ˌsɪəriəlaɪˈzeɪʃn/

n. 序列化

Serialization converts an object into a stream of bytes for storage.

序列化将对象转换为字节流以便存储。

JSON is a popular format for data serialization.

JSON 是一种流行的数据序列化格式。

asynchronous

/eɪˈsɪŋkrənəs/

adj. 异步的

Asynchronous programming allows the program to continue executing while waiting for I/O.

异步编程允许程序在等待 I/O 时继续执行。

Use async/await syntax for cleaner asynchronous code.

使用 async/await 语法可以获得更简洁的异步代码。

deprecated

/ˈdeprəkeɪtɪd/

adj. 已弃用的;不推荐的

This method is deprecated and will be removed in the next version.

此方法已弃用,将在下一版本中移除。

Avoid using deprecated APIs in new projects.

避免在新项目中使用已弃用的 API。