You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -1222,34 +1243,56 @@ <h2 class="title is-3">Data Construction Pipline</h2>
1222
1243
<imgsrc="static/images/data_construct.jpg" alt="An overview of the data construction, filtering, verification, and quality control processes of Chinese SimpleQA." style="max-width: 100%; height: auto;">
<p>The data construction process for Chinese SimpleQA includes both an automated process and a manual verification process. The automated part involves knowledge content extraction and filtering, automatic generation of question-answer pairs, LLM automatic validation based on criteria, answer factual correctness verification based on RAG (Retrieval-Augmented Generation), and question difficulty filtering.</p>
1246
+
<!-- <p>The data construction process for Chinese SimpleQA includes both an automated process and a manual verification process. The automated part involves knowledge content extraction and filtering, automatic generation of question-answer pairs, LLM automatic validation based on criteria, answer factual correctness verification based on RAG (Retrieval-Augmented Generation), and question difficulty filtering.</p>
1226
1247
<p>Initially, we collected a large amount of knowledge-rich text content from various knowledge fields, primarily derived from Wikipedia. This content was then processed through a quality assessment model to filter out low-quality data. Based on this, we guided the LLM to generate question-answer pairs according to predefined criteria using these high-quality knowledge contents. To ensure that the generated question-answer pairs met these criteria, we utilized the LLM again for rule-based validation to remove non-conforming data. In this way, we obtained a large set of initially filtered knowledge question-answer pairs. However, relying on a single data source for generation can potentially lead to inaccurate answers. To mitigate this risk, we deployed external retrieval tools to gather more diverse information, guiding the LLM in evaluating the factual correctness of answers based on information from different sources. In this process, incorrect question-answer pairs were discarded. Specifically, we used LlamaIndex as the retrieval method, with search results from Google and Bing as data sources, further enhancing the quality of the dataset.</p>
1227
-
<p>In addition, we filtered the dataset for difficulty to better probe the knowledge boundaries of the LLMs, removing overly simple questions. Specifically, if a question could be correctly answered by all four powerful models, Meta-Llama-3-70B-Instruct, Qwen2.5-72B-Instruct, and GLM-4-Plus, it was deemed too simple and thus discarded. Through this approach, Chinese SimpleQA becomes more challenging.</p>
1248
+
<p>In addition, we filtered the dataset for difficulty to better probe the knowledge boundaries of the LLMs, removing overly simple questions. Specifically, if a question could be correctly answered by all four powerful models, Meta-Llama-3-70B-Instruct, Qwen2.5-72B-Instruct, and GLM-4-Plus, it was deemed too simple and thus discarded. Through this approach, Chinese SimpleQA becomes more challenging.</p> -->
<strong>Chinese</strong>: Our Chinese SimpleQA focuses on the Chinese language, which provides a comprehensive evaluation of the factuality abilities of existing LLMs in Chinese.
1253
+
</li>
1254
+
<listyle="margin-bottom: 1em;">
1255
+
<strong>Diverse</strong>: Chinese SimpleQA covers 6 topics (i.e., "Chinese Culture", "Humanities", "Engineering, Technology, and Applied Sciences", "Life, Art, and Culture", "Society", and "Natural Science"), and these topics include 99 fine-grained subtopics in total, which demonstrates the diversity of our Chinese SimpleQA.
1256
+
</li>
1257
+
<listyle="margin-bottom: 1em;">
1258
+
<strong>High-quality</strong>: We conduct a comprehensive and rigorous quality control process to ensure the quality and accuracy of our Chinese SimpleQA.
1259
+
</li>
1260
+
<listyle="margin-bottom: 1em;">
1261
+
<strong>Static</strong>: Following SimpleQA, to preserve the evergreen property of Chinese SimpleQA, all reference answers would not change over time.
1262
+
</li>
1263
+
<listyle="margin-bottom: 1em;">
1264
+
<strong>Easy-to-evaluate</strong>: Following SimpleQA, as the questions and answers are very short, the grading procedure is fast to run via existing LLMs (e.g., OpenAI API).
<strong>Chinese SimpleQA is challenging</strong>. Only o1-preview and Doubao-pro-32k achieve the passing score (63.8% and 61.9% on the correct metric), and there is a long way to improve for many closed-source and open-source LLMs.
1272
+
</li>
1273
+
<listyle="margin-bottom: 1em;">
1274
+
<strong>Larger models lead to better results</strong>. Based on the results of Qwen2.5 series, InternLM series, Yi-1.5 series, etc, we observe that better performance is obtained when the model is larger.
1275
+
</li>
1276
+
<listyle="margin-bottom: 1em;">
1277
+
<strong>Larger models are more calibrated</strong>. We observe that o1-preview is more calibrated than o1-mini, and GPT-4o is more calibrated than GPT-4o-mini.
1278
+
</li>
1279
+
<listyle="margin-bottom: 1em;">
1280
+
<strong>RAG matters</strong>. When introducing the RAG strategy into existing LLMs, the performance gaps between different LLMs decrease a lot. For example, for GPT-4o and Qwen2.5-3B, the performance gap decreases from 42.4% to 9.3% when using RAG.
1281
+
</li>
1282
+
<listyle="margin-bottom: 1em;">
1283
+
<strong>Alignment tax exists</strong>. Existing alignment or post-training strategies usually decrease the factuality of language models.
1284
+
</li>
1285
+
<listyle="margin-bottom: 1em;">
1286
+
<strong>Rankings of SimpleQA and Chinese SimpleQA are different</strong>. The performance of several LLMs focusing on Chinese (Doubao-pro-32k, and GLM-4-Plus) is close to the high-performance o1-preview. In particular, in the “Chinese Culture” topic, these Chinese community LLMs are significantly better than GPT or o1 series models.
1287
+
</li>
1288
+
</ul>
1289
+
1228
1290
</div>
1229
1291
</div>
1230
1292
</div>
1231
1293
</div>
1232
1294
</section>
1233
1295
1234
-
<style>
1235
-
table {
1236
-
width:100%;
1237
-
border-collapse: collapse;
1238
-
}
1239
-
th,td {
1240
-
border:1px solid #ddd;
1241
-
padding:8px;
1242
-
text-align: center;
1243
-
}
1244
-
th {
1245
-
background-color:#f2f2f2; /* Light color for headers */
1246
-
}
1247
-
.merged-row {
1248
-
background-color:#e0e0e0; /* Light color for merged row */
<p>There are significant ranking differences of various models between the SimpleQA and Chinese SimpleQA benchmarks. For example, Doubao-pro-32k rises from 12th to 2nd in the Chinese version, while GPT-4 drops from 3rd to 9th. This highlights the importance of evaluating models in multilingual environments. Notably, o1-preview consistently holds the top position across both datasets. Many Chinese community-developed models perform better on the Chinese SimpleQA than on the SimpleQA.</p>
<h2class="title is-3">Detailed Results On Subtopics.</h2>
1404
+
<h2class="title is-3">Detailed Results on Subtopics</h2>
1347
1405
<divclass="image-container">
1348
-
<imgsrc="static/images/exp4.jpg" alt="An overview of the data construction, filtering, verification, and quality control processes of Chinese SimpleQA." style="max-width: 100%; height: auto;">
<p>As mentioned in our paper, the benchmark covers a total of 99 subtopics, which can comprehensively detect the knowledge level of the model in various fields. The upper figure illustrates the performance comparison between the o1 model and seven notable Chinese community models within several common domains. </p>
1352
-
<p>Firstly, from an overall perspective, the o1-preview model exhibits the most comprehensive performance across these domains, with the Doubao model following closely. In contrast, the Moonshot model demonstrates the weakest overall performance.
1353
-
Secondly, when examining specific domains, a significant disparity emerges between the Chinese community models and the o1 model in areas such as Computer Science and Medicine. However, this gap is minimal in domains like Education and Economics. Notably, in Education, some Chinese community models outperform the o1-preview, highlighting their potential for achieving success in specific vertical domains.
1354
-
Lastly, when examining specific models, the Moonshot model is notably weaker in Mathematics, Law, and Entertainment, while the Baichuan model also underperforms in Entertainment. The Yi-Large model excels in Education, and the o1 model maintains the strongest performance across other domains. </p>
1355
-
<p>Evaluating the performance of the models across diverse domains within the benchmark dataset enables users to identify the most suitable model for their specific needs.</p>
1356
-
1409
+
<p>The benchmark covers 99 subtopics to assess the model's knowledge across various fields. Overall, the o1-preview model performs most comprehensively, followed by Doubao, while Moonshot is the weakest. There is a noticeable gap between Chinese community models and the o1 model in Computer Science and Medicine, but less so in Education and Economics. Notably, some Chinese models outperform o1-preview in Education. Moonshot struggles in Mathematics, Law, and Entertainment, while Baichuan also underperforms in Entertainment. Yi-Large excels in Education, and o1 maintains strong performance in other domains. Evaluating models across diverse domains helps users choose the best fit for their needs.</p>
<p>We analyzed the calibration of different LLMs on Chinese SimpleQA. Models were instructed to provide a confidence level from 0 to 100 when answering questions. Ideally, confidence should match actual accuracy. Results show that GPT-4o aligns better than GPT-4o-mini, and o1-preview aligns better than o1-mini. In the Qwen2.5 series, larger models show better calibration. All models tend to be overconfident, especially when confidence is above 50.</p>
<p>We evaluated the relationship between increased test-time compute and accuracy. Random samples from Chinese SimpleQA showed that as inference counts increase, response accuracy improves and eventually reaches a ceiling. This aligns with the dataset's purpose to probe model knowledge boundaries.</p>
This page was built using the <ahref="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Chinese SimpleQA Template</a> which was adopted from the <ahref="https://nerfies.github.io" target="_blank">Nerfies</a> project page.
1383
-
You are free to borrow the source code of this website, we just ask that you link back to this page in the footer. <br> This website is licensed under a <arel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
1444
+
This site is created based on <ahref="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a> and is licensed under <arel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
1384
1445
Commons Attribution-ShareAlike 4.0 International License</a>.
0 commit comments