Skip to content

Fail to reproduce tf2 bert-large f1-score on SQuADv1.1 #18334

Closed
@zhuango

Description

@zhuango

System Info

transformers=4.20.1, python=3.7, tensorflow-gpu=2.9.1.

Who can help?

@Rocketknight1 @sgugger @patil-suraj

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Download bert-large tf2 pretrained models, .h5 file from huggingface.
  2. Run pytorch question-answering example for reference. My script is:
python run_qa.py \
  --model_name_or_path $BERT_DIR \
  --dataset_name $SQUAD_DIR \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 48 \
  --output_dir $OUTPUT \
  --save_steps 10000 \
  --overwrite_cache \

I got an f1-score 90.3953%. Note that the pretrained model are loaded from tf2 .h5 checkpoints.

  1. Run tf2 question-answering example with the same setting of pytorch example. My script is:
python run_qa.py \
  --model_name_or_path $BERT_DIR \
  --dataset_name $SQUAD_DIR \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 48 \
  --output_dir $OUTPUT \
  --save_steps 10000 \
  --overwrite_cache \

I only got an f1-score 88.5672%, which is much lower than expected and pytorch results (90.3953%).

Expected behavior

TF2 question-answering example should achieve similar f1-score with results of corresponding pytroch example.
Or you guys could provide script examples that achieves the target F1-score. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions