Iterative/ˈɪtərˌeɪtɪv/ Refinement/rəˈfaɪnmənt/ Improves/ˌɪmˈpruvz/ Compositional/ˌkɑmpəˈzɪʃənəl/ Image Generation/ˌʤɛnərˈeɪʃən/
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/
文本到图像 (T2I) 模型已经取得了显着的进步,但它们仍然在处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时间策略,例如与验证器的并行采样或简单地增加去噪步骤,可以改善即时对齐,但对于必须满足许多约束的丰富组合设置仍然不够。受到大型语言模型中思想链推理成功的启发,我们提出了一种迭代测试时间策略,其中 T2I 模型在作为循环批评者的视觉语言模型的反馈的指导下,跨多个步骤逐步完善其生成。我们的方法很简单,不需要外部工具或先验知识,并且可以灵活地应用于各种图像生成器和视觉语言模型。根据经验,我们展示了跨基准图像生成的一致收益:与计算匹配的并行采样相比,ConceptMix (k=7) 的正确率提高了 16.9%,T2I-CompBench(3D 空间类别)提高了 13.8%,Visual Jenga 场景分解提高了 12.5%。除了定量收益之外,迭代细化通过将复杂的提示分解为顺序修正来产生更忠实的生成,对于并行基线,人类评估者在 58.7% 的情况下更喜欢我们的方法,而不是 41.3% 的情况。总之,这些发现强调了迭代自我校正作为构图图像生成的广泛适用的原则。结果和可视化可在 https://iterative-img-gen.github.io/ 获得