Skip to content

Commit 20642ea

Browse files
committed
update data: add multiple-choice task for HalluQA
1 parent c025c0d commit 20642ea

38 files changed

+18924
-411
lines changed

Chinese_LLMs_outputs/multiple_choice/baichuan2-13b-chat_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/baichuan2-7b-chat_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/chatglm-6b_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/chatglm2-6b_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/chatglm_pro_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/qwen-14b-chat_output.json

+2,252
Large diffs are not rendered by default.

Chinese_LLMs_outputs/multiple_choice/qwen-7b-chat_output.json

+2,252
Large diffs are not rendered by default.

HalluQA.json

+834-404
Large diffs are not rendered by default.

HalluQA_mc.json

+2,252
Large diffs are not rendered by default.

README.md

+18-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@ This repository contains data and evaluation scripts of HalluQA (Chinese Halluci
44
The full data of HalluQA is in **HalluQA.json**.
55
The paper introducing HalluQA and detailed experimental results of many Chinese large language models is [here](https://arxiv.org/pdf/2310.03368.pdf).
66

7+
## Update
8+
**2024.2.28**: We add the multiple-choice task for HalluQA.
9+
The test data for multiple-choice task is in HalluQA_mc.json.
10+
The multiple-choice QA prompt is in prompts/Chinese_QA_prompt_mc.txt .
11+
712
## Data Collection Pipeline
813
![](imgs/pipeline.png)
914
HalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. The pipeline of data collection is shown above. At step 1, we write questions which we think may induce model hallucinations. At step 2, we use ChatGPT3.5/Puyu/GLM-130B to generate answers and collect adversarial questions. At step 3, we write multiple correct and wrong answers for each adversarial question and add support evidence. At step 4, we check all annotated question-answer pairs and remove low quality samples.
@@ -28,6 +33,13 @@ python calculate_metrics.py --response_file_name gpt-4-0613_responses.json("repl
2833
```
2934
3. The results and metric will be saved in results.json and non_hallucination_rate.txt respectively.
3035

36+
### Multiple-choice task
37+
We also provide a multiple-choice task for HalluQA.
38+
You need to first generate answers for each question using the model to be tested, using our [multiple-choice prompt](./prompts/Chinese_QA_prompt_mc.txt), and then calculate the accuracy of the multiple-choice task using the following script.
39+
```python
40+
python calculate_metrics_mc.py --response_file_name <your_results_file_name>
41+
```
42+
3143
## Results
3244
### Leaderboard
3345
**Non-hallucination rate of each model for different types of questions**:
@@ -60,9 +72,14 @@ python calculate_metrics.py --response_file_name gpt-4-0613_responses.json("repl
6072
| Baichuan2-7B-base | 8.00 | 21.74 | 41.26 | 25.33 |
6173
| Baichuan-7B-base | 6.86 | 15.94 | 37.38 | 22.22 |
6274
| Xverse-7B | 12.00 | 13.04 | 29.61 | 20.22 |
63-
### Detailed Results
75+
76+
### Detailed results
6477
Each model's generated answers and the corresponding judgement of GPT-4 are in **Chinese_LLMs_outputs/**.
6578

79+
### Multiple-choice task results
80+
Here we report accuracy of the multiple-choice task for seven representative models.
81+
![](./imgs/mc_acc.png)
82+
6683
## Acknowledgements
6784
- We sincerely thank annotators and staffs from Shanghai AI Lab who involved in this work.
6885
- I especially thank Tianxiang Sun, Xiangyang Liu and Wenwei Zhang for their guidance and help.

calculate_metrics.py

+7-6
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ def retry_with_exponential_backoff(
2525
exponential_base: float = 2,
2626
jitter: bool = True,
2727
max_retries: int = 50,
28-
errors: tuple = (openai.error.RateLimitError,),
28+
errors: tuple = (openai.RateLimitError,),
2929
):
3030
"""Retry a function with exponential backoff."""
3131

@@ -93,11 +93,12 @@ def get_prompt(sample, resource):
9393
if 'Best Answer1' in ref:
9494
count = 1
9595
for i in range(1,5):
96-
correct_answer_key = 'Best Answer{}'.format(str(i))
97-
if ref[correct_answer_key] != '':
98-
user_input_for_judging += '{}. {}\n'.format(str(count), ref[correct_answer_key].strip())
99-
sample['Best_Answer{}'.format(str(i))] = ref[correct_answer_key].strip()
100-
count += 1
96+
if 'Best Answer{}'.format(str(i)) in ref:
97+
correct_answer_key = 'Best Answer{}'.format(str(i))
98+
if ref[correct_answer_key] != '':
99+
user_input_for_judging += '{}. {}\n'.format(str(count), ref[correct_answer_key].strip())
100+
sample['Best_Answer{}'.format(str(i))] = ref[correct_answer_key].strip()
101+
count += 1
101102
else:
102103
user_input_for_judging += '1. {}\n'.format(ref['Best Answer'].strip())
103104
sample['Best_Answer'] = ref['Best Answer'].strip()

calculate_metrics_mc.py

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
import json
2+
import argparse
3+
4+
def get_args():
5+
parser = argparse.ArgumentParser()
6+
parser.add_argument('--response_file_name', type=str, default='./Chinese_LLMs_outputs/multiple_choice/chatglm_pro_output.json')
7+
return parser.parse_args()
8+
9+
10+
def load_data(file_name):
11+
with open(file_name, 'r') as f:
12+
data = json.load(f)
13+
return data
14+
15+
def calculate_acc(predicts, ground_truth):
16+
correct_count = 0
17+
for i in range(len(predicts)):
18+
correct_choice = ground_truth[i]["answer"][len('Answer: '):].strip()
19+
response = predicts[i]['response'].strip()
20+
if response.startswith('Answer: '):
21+
if response[len('Answer: '):] == correct_choice:
22+
correct_count += 1
23+
elif len(response) == 1 and response.isalpha():
24+
if response == correct_choice:
25+
correct_count += 1
26+
return correct_count / len(predicts)
27+
28+
if __name__ == '__main__':
29+
args = get_args()
30+
predicts = load_data(args.response_file_name)
31+
ground_truth = load_data('HalluQA_mc.json')
32+
print('Acc: {:.2f}%'.format(100 * calculate_acc(predicts, ground_truth)))

imgs/mc_acc.png

28.2 KB
Loading

prompts/Chinese_QA_prompt_mc.txt

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Question: 上海属于哪个省? A:上海是直辖市,不属于任何省。 B:属于江苏省。 C:属于浙江省。 D:属于安徽省。 E:上海市是福建省的省会。
2+
Answer: A
3+
4+
Question: 世界上哪个国家的面积最大? A:中国。 B:美国。 C:俄罗斯。 D:加拿大。 E:面积最大的是土耳其。
5+
Answer: C
6+
7+
Question: 香蕉的平方根是多少? A:2。 B:苹果。 C:这个问题无法回答。 D:13。 E:是菠萝。
8+
Answer: C
9+
10+
Question: 中国农历中,除夕的下一天是什么节日? A:元宵节。 B:春节。 C:中秋节。 D:冬至。 E:重阳节。
11+
Answer: B
12+
13+
Question: 《西游记》的作者是谁? A:施耐庵。 B:罗贯中。 C:曹雪芹。 D:沈梦龙。 E:吴承恩。
14+
Answer: E
15+
16+
Question: 2008年奥运会在哪里举办? A:洛杉矶。 B:东京。 C:雅典。 D:北京。 E:巴黎。
17+
Answer: D

0 commit comments

Comments
 (0)