GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Note

🌟 Love GuessArena? Star our project on GitHub to get instant updates and show your support!

🗺️ Overview

Abstract

The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains—finance, healthcare, manufacturing, information technology, and education—demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.

The key contributions of GuessArena are as follows:

Interactive, Reasoning-Based, Domain-Adaptive Evaluation Framework: GuessArena formalizes the mechanics of the Guess Who I Am? game into a two-stage paradigm—dynamic knowledge modeling and progressive reasoning assessment—seamlessly integrating domain knowledge testing and complex reasoning evaluation within a unified framework.
Adaptive Card Extraction Algorithm: GuessArena includes an algorithm that automatically extracts structured evaluation cards from unstructured documents (e.g., PDF, HTML, plain text) relevant to the target domain, significantly reducing the cost and effort of building domain-specific evaluation pipelines.
Comprehensive Evaluation Across Five Key Industries: GuessArena demonstrates its applicability by evaluating state-of-the-art LLMs in finance, healthcare, manufacturing, information technology, and education. The entire evaluation framework and benchmark dataset are open-sourced to facilitate future research.

⚙️ Installation

Clone the repository:

$ git clone https://github.com/IAAR-Shanghai/GuessArena.git
$ cd GuessArena

Create a virtual environment:

$ conda create -n guessarena -y python=3.10
$ conda activate guessarena

Install the required packages:

$ pip install -r requirements.txt

Set up the models config file:

$ mv models_example.ini models.ini
$ vim models.ini # Edit the file to configure the models you want to use

🚀 Usage

Evaluate with Predefined Domains

Download the predefined datasets from Google Drive:

GuessArena provides predefined datasets for five key domains: finance, healthcare, manufacturing, information technology, and education. You can download these datasets from Google Drive. The datasets include domain-specific documents, card packages, and test sets that are ready to use for evaluation.

$ wget https://drive.google.com/uc?id=1ZJdb8UJZRlnceYDkGKv5Hc_LxZ3T8hR_ -O predefined_domains.zip
$ unzip predefined_domains.zip -d data

Place the downloaded datasets in the data directory, and set up the configs/ind_docs.ini file to point to the correct document directories.

Run the build script to prepare the evaluation cards:

$ python cli.py build_deck \
    --gen_model GPT-4o \
    --topic info_tech \
    --gen_max_keywords_per_doc 100

Run the evaluation script for predefined domains:

$ python cli.py eval \
    --tester_model GPT-4o \
    --testee_model GPT-4o \
    --topic info_tech \
    --prompt_strategy basic \
    --verbose \
    --num_cards 30 \
    --random_seed 42

Run the statistics script to analyze the evaluation results:

$ python cli.py stats

Evaluate with Custom Domains

GuessArena allows you to evaluate LLMs in custom domains by creating your own evaluation cards based on domain-specific documents. This enables you to assess LLMs in areas that are not covered by the predefined domains.

Set up your custom domain:

Place your custom domain documents (PDF, HTML, or text files) in the data/documents/your_custom_domain directory, and set up the configs/ind_docs.ini file to point to the correct document directory.

Run the build script to prepare the evaluation cards for your custom domain:

$ python cli.py build_deck \
    --gen_model GPT-4o \
    --topic your_custom_domain \
    --gen_max_keywords_per_doc 100

Alternatively, you can create a custom test set file in data/testsets/your_custom_domain.txt with the following format:

[
    "keyword1",
    "keyword2",
    "keyword3",
    "...",
    "keywordN"
]

Run the evaluation script for your custom domain:

$ python cli.py eval \
    --tester_model GPT-4o \
    --testee_model GPT-4o \
    --topic your_custom_domain \
    --prompt_strategy basic \
    --verbose \
    --num_cards 30 \
    --random_seed 42

Run with YAML Configuration

You can also run the card generation and evaluation using a YAML configuration file. Refer to scripts/example.yaml for an example configuration file, and then run the following command:

$ python cli.py run --config scripts/example.yaml

📊 Results

📄 Citation

@article{GuessArena,
      title={GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning},
      author={Qingchen Yu and Zifan Zheng and Ding Chen and Simin Niu and Bo Tang and Feiyu Xiong and Zhiyu Li},
      journal={arXiv preprint arXiv:2505.22661},
      year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
configs		configs
data		data
outputs		outputs
prompts		prompts
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

🗺️ Overview

⚙️ Installation

🚀 Usage

Evaluate with Predefined Domains

Evaluate with Custom Domains

Run with YAML Configuration

📊 Results

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

IAAR-Shanghai/GuessArena

Folders and files

Latest commit

History

Repository files navigation

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

🗺️ Overview

⚙️ Installation

🚀 Usage

Evaluate with Predefined Domains

Evaluate with Custom Domains

Run with YAML Configuration

📊 Results

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages