Skip to content

Commit 64b6ed9

Browse files
committed
Add README
1 parent 580acdc commit 64b6ed9

File tree

3 files changed

+52
-69
lines changed

3 files changed

+52
-69
lines changed

README.md

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Factuality Benchmark
2+
3+
This is my attempt to reproduce results from this article:
4+
5+
https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper
6+
7+
I tried it with a few models, and eventually tuned the prompt to achieve +3% using OpenAssistant 70B model:
8+
9+
- **Accuracy:** `84%`
10+
- **Breakdown:**
11+
- AB=179 - consistent and correct combination.
12+
- BA=11 - consistent but incorrect.
13+
- AA=8 - inconsistent, model biased towards option A.
14+
- BB=14 - inconsistent, model biased towards option B.
15+
16+
This is just 1% below GPT-4 results.
17+
18+
Model used: [Llama2-70B-OASST with Q5_K_M quantisation](https://huggingface.co/TheBloke/Llama2-70B-OASST-SFT-v10-GGUF)
19+
20+
# Prompt Tuning
21+
22+
## Used template
23+
24+
> <|im_start|>system
25+
> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
26+
>
27+
> If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
28+
> <|im_end|>
29+
> <|im_start|>user
30+
> Decide which of the following Summary is more consistent with the Article Sentence.
31+
>
32+
> Note that consistency means all information in the Summary is supported by the Article Sentence.
33+
>
34+
> Article Sentence: {article}
35+
> Summary Y: {option_a}
36+
> Summary X: {option_b}
37+
> <|im_end|>
38+
> <|im_start|>assistant
39+
> The more consistent is Summary
40+
41+
## Changes summary
42+
43+
1. I used system-user-assistant prompt structure, that was used during model fine-tuning.
44+
2. I changed options labels name from A/B to Y/X to reduce bias towards "A".
45+
3. I prepulated answer with "The more consistent is Summary" to improve conciseness.
46+
47+
# Repo guide
48+
49+
- `fact.py` - script used to run benchmark, saving results to `results.jsonl`
50+
- `anal.ipynb` - Jupyter notebook to analyze the results.
51+
- `results.jsonl` - JSONL with raw model outputs.

fact.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
MODEL_PATH = 'oasst.gguf'
55
TASKS_PATH = 'fact.json'
6-
SKIP_TO = 76
6+
SKIP_TO = 0
77

88
PROMPT_TMPL = """\
99
Decide which of the following Summary is more consistent with the Article Sentence.

main.py

-68
This file was deleted.

0 commit comments

Comments
 (0)