Skip to content

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

License

Notifications You must be signed in to change notification settings

IAAR-Shanghai/GuessArena

Repository files navigation

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Apache 2.0 License

Note

🌟 Love GuessArena? Star our project on GitHub to get instant updates and show your support!

🗺️ Overview

GuessArena
Abstract The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains—finance, healthcare, manufacturing, information technology, and education—demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.

The key contributions of GuessArena are as follows:

  • Interactive, Reasoning-Based, Domain-Adaptive Evaluation Framework: GuessArena formalizes the mechanics of the Guess Who I Am? game into a two-stage paradigm—dynamic knowledge modeling and progressive reasoning assessment—seamlessly integrating domain knowledge testing and complex reasoning evaluation within a unified framework.

  • Adaptive Card Extraction Algorithm: GuessArena includes an algorithm that automatically extracts structured evaluation cards from unstructured documents (e.g., PDF, HTML, plain text) relevant to the target domain, significantly reducing the cost and effort of building domain-specific evaluation pipelines.

  • Comprehensive Evaluation Across Five Key Industries: GuessArena demonstrates its applicability by evaluating state-of-the-art LLMs in finance, healthcare, manufacturing, information technology, and education. The entire evaluation framework and benchmark dataset are open-sourced to facilitate future research.

⚙️ Installation

  • Clone the repository:
$ git clone https://github.com/IAAR-Shanghai/GuessArena.git
$ cd GuessArena
  • Create a virtual environment:
$ conda create -n guessarena -y python=3.10
$ conda activate guessarena
  • Install the required packages:
$ pip install -r requirements.txt
  • Set up the models config file:
$ mv models_example.ini models.ini
$ vim models.ini # Edit the file to configure the models you want to use

🚀 Usage

Evaluate with Predefined Domains

  • Download the predefined datasets from Google Drive:

GuessArena provides predefined datasets for five key domains: finance, healthcare, manufacturing, information technology, and education. You can download these datasets from Google Drive. The datasets include domain-specific documents, card packages, and test sets that are ready to use for evaluation.

$ wget https://drive.google.com/uc?id=1ZJdb8UJZRlnceYDkGKv5Hc_LxZ3T8hR_ -O predefined_domains.zip
$ unzip predefined_domains.zip -d data

Place the downloaded datasets in the data directory, and set up the configs/ind_docs.ini file to point to the correct document directories.

  • Run the build script to prepare the evaluation cards:
$ python cli.py build_deck \
    --gen_model GPT-4o \
    --topic info_tech \
    --gen_max_keywords_per_doc 100
  • Run the evaluation script for predefined domains:
$ python cli.py eval \
    --tester_model GPT-4o \
    --testee_model GPT-4o \
    --topic info_tech \
    --prompt_strategy basic \
    --verbose \
    --num_cards 30 \
    --random_seed 42
  • Run the statistics script to analyze the evaluation results:
$ python cli.py stats

Evaluate with Custom Domains

GuessArena allows you to evaluate LLMs in custom domains by creating your own evaluation cards based on domain-specific documents. This enables you to assess LLMs in areas that are not covered by the predefined domains.

  • Set up your custom domain:

Place your custom domain documents (PDF, HTML, or text files) in the data/documents/your_custom_domain directory, and set up the configs/ind_docs.ini file to point to the correct document directory.

  • Run the build script to prepare the evaluation cards for your custom domain:
$ python cli.py build_deck \
    --gen_model GPT-4o \
    --topic your_custom_domain \
    --gen_max_keywords_per_doc 100
  • Alternatively, you can create a custom test set file in data/testsets/your_custom_domain.txt with the following format:
[
    "keyword1",
    "keyword2",
    "keyword3",
    "...",
    "keywordN"
]
  • Run the evaluation script for your custom domain:
$ python cli.py eval \
    --tester_model GPT-4o \
    --testee_model GPT-4o \
    --topic your_custom_domain \
    --prompt_strategy basic \
    --verbose \
    --num_cards 30 \
    --random_seed 42

Run with YAML Configuration

You can also run the card generation and evaluation using a YAML configuration file. Refer to scripts/example.yaml for an example configuration file, and then run the following command:

$ python cli.py run --config scripts/example.yaml

📊 Results

GuessArena Results

📄 Citation

@article{GuessArena,
      title={GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning},
      author={Qingchen Yu and Zifan Zheng and Ding Chen and Simin Niu and Bo Tang and Feiyu Xiong and Zhiyu Li},
      journal={arXiv preprint arXiv:2505.22661},
      year={2025},
}

Releases

No releases published

Packages

No packages published

Languages