Kindle Dash - 信息仪表盘

FOCUS

--:--

💡 每日语录

The only way to do great work is to love what you do.

做好工作的唯一方法就是热爱你所做的事情。

— Steve Jobs

In the middle of difficulty lies opportunity.

困难之中蕴藏着机遇。

— Albert Einstein

The best time to plant a tree was 20 years ago. The second best time is now.

种一棵树最好的时间是20年前。第二个最好的时间是现在。

— Chinese Proverb

不积跬步，无以至千里；不积小流，无以成江海。

— 荀子

博观而约取，厚积而薄发。

— 苏轼

🎵 歌词本

Stay hungry, stay foolish

保持饥饿，保持愚昧

The people who are crazy enough to think

那些疯狂到认为自己

they can change the world

能够改变世界的人

are the ones who do

往往正是那些真正改变世界的人

... 还有 27 行

📄 Deep Tech

ArXiv 最新

DARE-bench:^/darebench*/ Evaluating^{/ɪˈvæljuˌeɪtɪŋ/} Modeling^{/ˈmɑdəlɪŋ/} and Instruction^{/ˌɪnˈstrəkʃən/} Fidelity^{/ˌfaɪˈdɛləti/} of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu 2026-02-27 cs.AI | cs.CL

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherenc...

查看中文翻译

使用大型语言模型 (LLM) 来处理复杂的多步骤数据科学任务的需求快速增长，迫切需要准确的基准测试。现有基准存在两个主要差距：(i) 缺乏标准化、流程感知的评估来捕获指令依从性和流程保真度，以及 (ii) 缺乏准确标记的培训数据。为了弥补这些差距，我们引入了 DARE-bench，这是一个专为机器学习建模和数据科学指令遵循而设计的基准。与许多依赖人类或基于模型的评判的现有基准不同，DARE-bench 中的所有任务都有可验证的基本事实，确保客观且可重复的评估。为了涵盖广泛的任务并支持代理工具，DARE-bench 由 6,300 个 Kaggle 衍生任务组成，并提供大规模训练数据和评估集。广泛的评估表明，即使是像 gpt-o4-mini 这样功能强大的模型也很难获得良好的性能，特别是在机器学习建模任务中。使用 DARE-bench 训练任务进行微调可以显着提高模型性能。例如，监督微调将 Qwen3-32B 的准确性提高了 1.83 倍，强化学习将 Qwen3-4B 的准确性提高了 8 倍以上。这些重大改进验证了 DARE-bench 作为准确评估基准和关键训练数据的重要性。

阅读原文 →

Do LLMs Benefit From Their Own Words?

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo 2026-02-27 cs.CL | cs.AI

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-tur...

查看中文翻译

与大型语言模型的多轮交互通常会在对话历史记录中保留助手自己过去的响应。在这项工作中，我们通过询问大型语言模型是否受益于它们自己先前的响应来重新审视这种设计选择。使用野外多回合对话，我们将标准（全上下文）提示与用户仅回合提示方法进行比较，该方法忽略了所有先前的助理响应，跨越三种开放推理模型和一种最先进的模型。令我们惊讶的是，我们发现删除先前的助理响应并不会影响大部分转弯的响应质量。省略助手端历史记录可以将累积上下文长度减少多达 10 倍。为了解释这个结果，我们发现多轮对话包含很大比例（36.4％）的独立提示，并且许多后续提示提供了足够的指令，仅使用当前用户轮次和先前用户轮次即可回答。在分析仅用户轮次提示明显优于完整上下文的情况时，我们识别了上下文污染的实例，其中模型对其先前的响应进行了过度调节，引入了错误、幻觉或跨轮次传播的风格伪像。受这些发现的启发，我们设计了一种上下文过滤方法，有选择地忽略助手端上下文。我们的研究结果表明，有选择地省略助理历史记录可以提高响应质量，同时减少内存消耗。

阅读原文 →

CUDA Agent: Large-Scale^{/largescale/} Agentic RL for High-Performance^{/highperformance/} CUDA Kernel Generation^{/ˌʤɛnərˈeɪʃən/}

Weinan Dai, Hanlin Wu, Qiying Yu 2026-02-27 cs.LG | cs.AI

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kern...

查看中文翻译

GPU 内核优化是现代深度学习的基础，但仍然是一项高度专业化的任务，需要深厚的硬件专业知识。尽管在通用编程方面具有强大的性能，但大型语言模型 (LLM) 与基于编译器的系统（例如用于 CUDA 内核生成的 torch.compile）相比仍然没有竞争力。现有的 CUDA 代码生成方法要么依赖于免训练细化，要么依赖于固定多轮执行反馈循环内的微调模型，但这两种范式都无法从根本上提高模型内在的 CUDA 优化能力，导致性能提升有限。我们推出了 CUDA Agent，这是一个大规模代理强化学习系统，它通过三个组件开发 CUDA 内核专业知识：可扩展的数据合成管道、具有自动验证和分析功能的技能增强 CUDA 开发环境，以提供可靠的奖励信号，以及支持稳定训练的强化学习算法技术。 CUDA Agent 在 KernelBench 上实现了最先进的结果，在 KernelBench Level-1、Level-2 和 Level-3 分割上比 torch.compile 速度快 100%、100% 和 92%，在最难的 Level-3 设置上比 Claude Opus 4.5 和 Gemini 3 Pro 等最强的专有模型高出约 40%。

阅读原文 →

⭐ GitHub Trending

AI 项目

🧠 思维模型

First Principles Thinking 第一性原理

将复杂问题分解为最基本的元素，然后从头开始重建解决方案。不依赖类比或既有经验，而是从根本真理出发进行推理。

实例：埃隆·马斯克在制造电池时，不接受'电池就是很贵'的假设，而是分析电池的原材料成本，发现可以大幅降低成本。

— 亚里士多德 / 伊隆·马斯克

Occam's Razor 奥卡姆剃刀

如无必要，勿增实体。在多个假设中，选择假设最少、最简洁的那个。复杂的解释往往隐藏着错误。

实例：当你听到马蹄声时，先想到马，而不是斑马。除非有明确证据表明是更罕见的情况。

— 威廉·奥卡姆 (14世纪)

Second-Order Thinking 二阶思维

不仅考虑行动的直接后果，还要思考这些后果带来的连锁反应。问自己：'然后呢？再然后呢？'

实例：降价促销会增加短期销量（一阶效应），但可能损害品牌形象并引发价格战（二阶效应）。

— 霍华德·马克斯

1 / 3

📖 单词记忆

algorithm

/ˈælɡəˌrɪðəm/

n. 算法；运算法则

The sorting algorithm runs in O(n log n) time complexity.

该排序算法的时间复杂度为 O(n log n)。

We need to optimize this algorithm for better performance.

我们需要优化这个算法以获得更好的性能。

recursion

/rɪˈkɜːrʒn/

n. 递归；循环

Recursion is a method where the solution depends on solutions to smaller instances.

递归是一种方法，其解决方案依赖于较小实例的解决方案。

Be careful with recursion to avoid stack overflow.

使用递归时要小心避免栈溢出。

encapsulation

/ɪnˌkæpsjuˈleɪʃn/

n. 封装；包装

Encapsulation hides the internal state of an object from the outside.

封装将对象的内部状态对外部隐藏。

Good encapsulation leads to more maintainable code.

良好的封装能带来更易维护的代码。

polymorphism

/ˌpɒliˈmɔːfɪzəm/

n. 多态性

Polymorphism allows objects of different classes to be treated as objects of a common superclass.

多态性允许不同类的对象被当作共同父类的对象来处理。

Method overriding is a common way to implement polymorphism.

方法重写是实现多态性的常见方式。

inheritance

/ɪnˈherɪtəns/

n. 继承；遗传

Inheritance enables new classes to receive the properties of existing classes.

继承使新类能够接收现有类的属性。

Multiple inheritance can lead to the diamond problem.

多重继承可能导致菱形继承问题。

abstraction

/æbˈstrækʃn/

n. 抽象；提取

Abstraction reduces complexity by hiding unnecessary details.

抽象通过隐藏不必要的细节来降低复杂性。

An abstract class cannot be instantiated directly.

抽象类不能被直接实例化。

concurrency

/kənˈkʌrənsi/

n. 并发；并发性

Concurrency allows multiple tasks to run in overlapping time periods.

并发允许多个任务在重叠的时间段内运行。

Handling concurrency correctly is crucial for multi-threaded applications.

正确处理并发对多线程应用程序至关重要。

serialization

/ˌsɪəriəlaɪˈzeɪʃn/

n. 序列化

Serialization converts an object into a stream of bytes for storage.

序列化将对象转换为字节流以便存储。

JSON is a popular format for data serialization.

JSON 是一种流行的数据序列化格式。

asynchronous

/eɪˈsɪŋkrənəs/

adj. 异步的

Asynchronous programming allows the program to continue executing while waiting for I/O.

异步编程允许程序在等待 I/O 时继续执行。

Use async/await syntax for cleaner asynchronous code.

使用 async/await 语法可以获得更简洁的异步代码。

deprecated

/ˈdeprəkeɪtɪd/

adj. 已弃用的；不推荐的

This method is deprecated and will be removed in the next version.

此方法已弃用，将在下一版本中移除。

Avoid using deprecated APIs in new projects.

避免在新项目中使用已弃用的 API。

1 / 10

📄 Deep Tech

DARE-bench:/darebench*/ Evaluating/ɪˈvæljuˌeɪtɪŋ/ Modeling/ˈmɑdəlɪŋ/ and Instruction/ˌɪnˈstrəkʃən/ Fidelity/ˌfaɪˈdɛləti/ of LLMs in Data Science

Do LLMs Benefit From Their Own Words?

CUDA Agent: Large-Scale/largescale*/ Agentic RL for High-Performance/highperformance*/ CUDA Kernel Generation/ˌʤɛnərˈeɪʃən/

DARE-bench:^/darebench*/ Evaluating^{/ɪˈvæljuˌeɪtɪŋ/} Modeling^{/ˈmɑdəlɪŋ/} and Instruction^{/ˌɪnˈstrəkʃən/} Fidelity^{/ˌfaɪˈdɛləti/} of LLMs in Data Science

CUDA Agent: Large-Scale^{/largescale/} Agentic RL for High-Performance^{/highperformance/} CUDA Kernel Generation^{/ˌʤɛnərˈeɪʃən/}