DARE-bench:/darebench*/ Evaluating/ɪˈvæljuˌeɪtɪŋ/ Modeling/ˈmɑdəlɪŋ/ and Instruction/ˌɪnˈstrəkʃən/ Fidelity/ˌfaɪˈdɛləti/ of LLMs in Data Science
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
使用大型语言模型 (LLM) 来处理复杂的多步骤数据科学任务的需求快速增长,迫切需要准确的基准测试。现有基准存在两个主要差距:(i) 缺乏标准化、流程感知的评估来捕获指令依从性和流程保真度,以及 (ii) 缺乏准确标记的培训数据。为了弥补这些差距,我们引入了 DARE-bench,这是一个专为机器学习建模和数据科学指令遵循而设计的基准。与许多依赖人类或基于模型的评判的现有基准不同,DARE-bench 中的所有任务都有可验证的基本事实,确保客观且可重复的评估。为了涵盖广泛的任务并支持代理工具,DARE-bench 由 6,300 个 Kaggle 衍生任务组成,并提供大规模训练数据和评估集。广泛的评估表明,即使是像 gpt-o4-mini 这样功能强大的模型也很难获得良好的性能,特别是在机器学习建模任务中。使用 DARE-bench 训练任务进行微调可以显着提高模型性能。例如,监督微调将 Qwen3-32B 的准确性提高了 1.83 倍,强化学习将 Qwen3-4B 的准确性提高了 8 倍以上。这些重大改进验证了 DARE-bench 作为准确评估基准和关键训练数据的重要性。