正在加载数据... (如果长时间停留在此,说明浏览器不支持当前脚本或 JS 报错)

Deep Tech

ArXiv 最新论文精选

Iterative/ˈɪtərˌeɪtɪv/ Refinement/rəˈfaɪnmənt/ Improves/ˌɪmˈpruvz/ Compositional/ˌkɑmpəˈzɪʃənəl/ Image Generation/ˌʤɛnərˈeɪʃən/

Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj 2026-01-21

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/


文本到图像 (T2I) 模型已经取得了显着的进步,但它们仍然在处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时间策略,例如与验证器的并行采样或简单地增加去噪步骤,可以改善即时对齐,但对于必须满足许多约束的丰富组合设置仍然不够。受到大型语言模型中思想链推理成功的启发,我们提出了一种迭代测试时间策略,其中 T2I 模型在作为循环批评者的视觉语言模型的反馈的指导下,跨多个步骤逐步完善其生成。我们的方法很简单,不需要外部工具或先验知识,并且可以灵活地应用于各种图像生成器和视觉语言模型。根据经验,我们展示了跨基准图像生成的一致收益:与计算匹配的并行采样相比,ConceptMix (k=7) 的正确率提高了 16.9%,T2I-CompBench(3D 空间类别)提高了 13.8%,Visual Jenga 场景分解提高了 12.5%。除了定量收益之外,迭代细化通过将复杂的提示分解为顺序修正来产生更忠实的生成,对于并行基线,人类评估者在 58.7% 的情况下更喜欢我们的方法,而不是 41.3% 的情况。总之,这些发现强调了迭代自我校正作为构图图像生成的广泛适用的原则。结果和可视化可在 https://iterative-img-gen.github.io/ 获得

Rethinking/riˈθɪŋkɪŋ/ Video Generation/ˌʤɛnərˈeɪʃən/ Model for the Embodied/ɪmˈbɑdid/ World

Yufan Deng, Zilin Pan, Hongyu Zhang 2026-01-21

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.


视频生成模型具有显着先进的体现智能,解锁了生成各种机器人数据的新可能性,这些数据可以捕获物理世界中的感知、推理和动作。然而,合成准确反映现实世界机器人交互的高质量视频仍然具有挑战性,并且缺乏标准化基准限制了公平比较和进步。为了解决这一差距,我们引入了一个全面的机器人基准测试 RBench,旨在评估跨五个任务域和四个不同实施例的面向机器人的视频生成。它通过可重复的子指标评估任务级别的正确性和视觉保真度,包括结构一致性、物理合理性和动作完整性。对 25 个代表性模型的评估凸显了在生成物理真实机器人行为方面的重大缺陷。此外,该基准与人类评估的 Spearman 相关系数达到 0.96,验证了其有效性。虽然 RBench 提供了识别这些缺陷的必要视角,但实现物理真实感需要超越评估,以解决高质量训练数据的严重短缺问题。在这些见解的驱动下,我们引入了一个完善的四阶段数据管道,从而产生了 RoVid-X,这是最大的视频生成开源机器人数据集,包含 400 万个带注释的视频剪辑,涵盖数千个任务,并丰富了全面的物理属性注释。总的来说,这个评估和数据的协同生态系统为视频模型的严格评估和可扩展训练奠定了坚实的基础,加速了嵌入式人工智能向通用智能的发展。

MolecularIQ:/moleculariq*/ Characterizing/ˈkɛrɪktərˌaɪzɪŋ/ Chemical/ˈkɛmɪkəl/ Reasoning/ˈrizənɪŋ/ Capabilities/ˌkeɪpəˈbɪlətiz/ Through Symbolic/sɪmˈbɑlɪk/ Verification/ˌvɛrəfəˈkeɪʃən/ on Molecular/məˈlɛkjələr/ Graphs

Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi 2026-01-21

A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.


分子的性质从根本上是由其分子图中编码的组成和结构决定的。因此,推理分子特性需要解析和理解分子图的能力。大型语言模型 (LLM) 越来越多地应用于化学,处理分子名称转换、字幕、文本引导生成以及性质或反应预测等任务。大多数现有基准强调一般化学知识,依赖于存在泄漏或偏见风险的文献或替代标签,或减少对多项选择题的评估。我们推出 MolecularIQ,一种分子结构推理基准,专门专注于符号可验证的任务。 MolecularIQ 能够对分子图推理进行细粒度评估,并揭示将模型故障定位到特定任务和分子结构的能力模式。这为当前化学法学硕士的优势和局限性提供了可行的见解,并指导了忠实推理分子结构的模型的开发。

GitHub Trending

近期 AI 热门项目

First Principles Thinking

第一性原理


将复杂问题分解为最基本的元素,然后从头开始重建解决方案。不依赖类比或既有经验,而是从根本真理出发进行推理。

实例:埃隆·马斯克在制造电池时,不接受'电池就是很贵'的假设,而是分析电池的原材料成本,发现可以大幅降低成本。

— 亚里士多德 / 伊隆·马斯克

Occam's Razor

奥卡姆剃刀


如无必要,勿增实体。在多个假设中,选择假设最少、最简洁的那个。复杂的解释往往隐藏着错误。

实例:当你听到马蹄声时,先想到马,而不是斑马。除非有明确证据表明是更罕见的情况。

— 威廉·奥卡姆 (14世纪)

Second-Order Thinking

二阶思维


不仅考虑行动的直接后果,还要思考这些后果带来的连锁反应。问自己:'然后呢?再然后呢?'

实例:降价促销会增加短期销量(一阶效应),但可能损害品牌形象并引发价格战(二阶效应)。

— 霍华德·马克斯

The only way to do great work is to love what you do.

做好工作的唯一方法就是热爱你所做的事情。

— Steve Jobs

In the middle of difficulty lies opportunity.

困难之中蕴藏着机遇。

— Albert Einstein

The best time to plant a tree was 20 years ago. The second best time is now.

种一棵树最好的时间是20年前。第二个最好的时间是现在。

— Chinese Proverb

不积跬步,无以至千里;不积小流,无以成江海。

— 荀子

博观而约取,厚积而薄发。

— 苏轼

Example

Stay hungry, stay foolish

保持饥饿,保持愚昧

The people who are crazy enough to think

那些疯狂到认为自己

they can change the world

能够改变世界的人

are the ones who do

往往正是那些真正改变世界的人

Here's to the crazy ones

致那些疯狂的人

The misfits, the rebels

那些格格不入的人,那些叛逆者

The troublemakers

那些惹是生非的人

The round pegs in the square holes

方孔中的圆钉

They're not fond of rules

他们不喜欢循规蹈矩

And they have no respect for the status quo

他们也不尊重现状

You can quote them, disagree with them

你可以引用他们,反对他们

Glorify or vilify them

颂扬或诋毁他们

But the only thing you can't do

但唯独不能忽视他们

is ignore them

Because they change things

因为他们改变了事物

They push the human race forward

他们推动了人类前进

algorithm

/ˈælɡəˌrɪðəm/

n. 算法;运算法则

The sorting algorithm runs in O(n log n) time complexity.

该排序算法的时间复杂度为 O(n log n)。

We need to optimize this algorithm for better performance.

我们需要优化这个算法以获得更好的性能。

recursion

/rɪˈkɜːrʒn/

n. 递归;循环

Recursion is a method where the solution depends on solutions to smaller instances.

递归是一种方法,其解决方案依赖于较小实例的解决方案。

Be careful with recursion to avoid stack overflow.

使用递归时要小心避免栈溢出。

encapsulation

/ɪnˌkæpsjuˈleɪʃn/

n. 封装;包装

Encapsulation hides the internal state of an object from the outside.

封装将对象的内部状态对外部隐藏。

Good encapsulation leads to more maintainable code.

良好的封装能带来更易维护的代码。

polymorphism

/ˌpɒliˈmɔːfɪzəm/

n. 多态性

Polymorphism allows objects of different classes to be treated as objects of a common superclass.

多态性允许不同类的对象被当作共同父类的对象来处理。

Method overriding is a common way to implement polymorphism.

方法重写是实现多态性的常见方式。

inheritance

/ɪnˈherɪtəns/

n. 继承;遗传

Inheritance enables new classes to receive the properties of existing classes.

继承使新类能够接收现有类的属性。

Multiple inheritance can lead to the diamond problem.

多重继承可能导致菱形继承问题。

abstraction

/æbˈstrækʃn/

n. 抽象;提取

Abstraction reduces complexity by hiding unnecessary details.

抽象通过隐藏不必要的细节来降低复杂性。

An abstract class cannot be instantiated directly.

抽象类不能被直接实例化。

concurrency

/kənˈkʌrənsi/

n. 并发;并发性

Concurrency allows multiple tasks to run in overlapping time periods.

并发允许多个任务在重叠的时间段内运行。

Handling concurrency correctly is crucial for multi-threaded applications.

正确处理并发对多线程应用程序至关重要。

serialization

/ˌsɪəriəlaɪˈzeɪʃn/

n. 序列化

Serialization converts an object into a stream of bytes for storage.

序列化将对象转换为字节流以便存储。

JSON is a popular format for data serialization.

JSON 是一种流行的数据序列化格式。

asynchronous

/eɪˈsɪŋkrənəs/

adj. 异步的

Asynchronous programming allows the program to continue executing while waiting for I/O.

异步编程允许程序在等待 I/O 时继续执行。

Use async/await syntax for cleaner asynchronous code.

使用 async/await 语法可以获得更简洁的异步代码。

deprecated

/ˈdeprəkeɪtɪd/

adj. 已弃用的;不推荐的

This method is deprecated and will be removed in the next version.

此方法已弃用,将在下一版本中移除。

Avoid using deprecated APIs in new projects.

避免在新项目中使用已弃用的 API。