MathNet: a Global Multimodal/multimodal*/ Benchmark/ˈbɛnʧˌmɑrk/ for Mathematical/ˌmæθəˈmætɪkəl/ Reasoning/ˈrizənɪŋ/ and Retrieval/rɪˈtrivəl/
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
数学问题解决仍然是对大型语言和多模态模型的推理的一项具有挑战性的测试,但现有基准在规模、语言覆盖范围和任务多样性方面受到限制。我们介绍了 MathNet,这是一个高质量、大规模、多模式和多语言的奥林匹克级数学问题数据集,以及用于评估生成模型中的数学推理和基于嵌入的系统中的数学检索的基准。 MathNet 跨越 47 个国家、17 种语言和二十年的竞赛,包含 30,676 个专家撰写的问题以及跨不同领域的解决方案。除了核心数据集之外,我们还构建了一个检索基准,该基准由人类专家策划的数学上等效且结构相似的问题对组成。 MathNet 支持三种任务:(i) 问题解决,(ii) 数学感知检索,以及 (iii) 检索增强问题解决。实验结果表明,即使是最先进的推理模型(Gemini-3.1-Pro 为 78.4%,GPT-5 为 69.3%)仍然面临挑战,而嵌入模型则难以检索同等问题。我们进一步表明,检索增强生成性能对检索质量高度敏感;例如,DeepSeek-V3.2-Speciale 的增益高达 12%,在基准测试中获得最高分。 MathNet 提供了最大的高质量奥林匹克数据集以及用于评估数学问题检索的第一个基准,我们在 https://mathnet.mit.edu 公开发布了数据集和基准。