Fail to reproduce tf2 bert-large f1-score on SQuADv1.1

### System Info

transformers=4.20.1, python=3.7, tensorflow-gpu=2.9.1.

### Who can help?

@Rocketknight1 @sgugger @patil-suraj 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Download bert-large tf2 pretrained models, .h5 file from [huggingface](https://huggingface.co/bert-large-uncased/tree/main).
2. Run [pytorch question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) for reference. My script is:
```
python run_qa.py \
  --model_name_or_path $BERT_DIR \
  --dataset_name $SQUAD_DIR \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 48 \
  --output_dir $OUTPUT \
  --save_steps 10000 \
  --overwrite_cache \

```
I got an f1-score **90.3953%**. Note that the pretrained model are loaded from tf2 .h5 checkpoints.

3. Run [tf2 question-answering example](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) with the same setting of pytorch example. My script is:
```
python run_qa.py \
  --model_name_or_path $BERT_DIR \
  --dataset_name $SQUAD_DIR \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 48 \
  --output_dir $OUTPUT \
  --save_steps 10000 \
  --overwrite_cache \
```
I only got an f1-score **88.5672%**, which is much lower than expected and pytorch results (90.3953%).


### Expected behavior

TF2 question-answering example should achieve similar f1-score with results of corresponding pytroch example.
Or you guys could provide script examples that achieves the target F1-score. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to reproduce tf2 bert-large f1-score on SQuADv1.1 #18334

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to reproduce tf2 bert-large f1-score on SQuADv1.1 #18334

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions