Skip to content

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Notifications You must be signed in to change notification settings

aiming-lab/MDocAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MDocAgent

Overview

We propose MDocAgent, a novel multi-modal multi-agent framework for document question answering. It integrates text and image retrieval through five specialized agents — general, critical, text, image, and summarizing agents — enabling collaborative reasoning across modalities. Experiments on five benchmarks show a 12.1% improvement over state-of-the-art methods, demonstrating its effectiveness in handling complex real-world documents.

main_fig

Requirements

  1. Clone this repository and navigate to MDocAgent folder
git clone https://github.com/aiming-lab/MDocAgent.git
cd MDocAgent
  1. Install Package: Create conda environment
conda create -n mdocagent python=3.12
conda activate mdocagent
bash install.sh
  1. Data Preparation
  • Create a data directory:

    mkdir data
    cd data
  • Download the dataset from huggingface and place it in the data directory. The documents of PaperText are same as PaperTab. You can use symbol link or make a copy.

  • Return to the project root:

    cd ../
  • Extract the data using:

    python scripts/extract.py --config-name <dataset>  # (choose from mmlb / ldu / ptab / ptext / feta)

The extracted texts and images will be saved in tmp/<dataset>.

Retrieval

  • Text Retrieval

    Set the retrieval type to text in config/base.yaml:

    defaults:
    - retrieval: text

    Then run:

    python scripts/retrieve.py --config-name <dataset>
  • Image Retrieval

    Switch the retrieval type to image in config/base.yaml:

    defaults:
    - retrieval: image

    Run the retrieval process again:

    python scripts/retrieve.py --config-name <dataset>

The retrieval results will be stored in:

data/<dataset>/sample-with-retrieval-results.json

Multi-Agent Inference

Run the following command:

python scripts/predict.py --config-name <dataset> run-name=<run-name>

Note: <run-name> can be any string to uniquely identify this run (required).

The inference results will be saved to:

results/<dataset>/<run-name>/<run-time>.json

To specify the top-4 retrieval candidates, use:

python scripts/predict.py --config-name <dataset> run-name=<run-name> dataset.top_k=4

Evaluation

  1. Add your OpenAI API key in config/model/openai.yaml.

  2. Run the evaluation (make sure <run-name> matches your inference run):

    python scripts/eval.py --config-name <dataset> run-name=<run-name>

The evaluation results will be saved in:

results/<dataset>/<run-name>/results.txt

Note: Evaluation will use the newest inference result file with same <run-name>.

Citation

@article{han2025mdocagent,
  title={MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding},
  author={Han, Siwei and Xia, Peng and Zhang, Ruiyi and Sun, Tong and Li, Yun and Zhu, Hongtu and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2503.13964},
  year={2025}
}

About

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •