Summary of evaluation metrics obtained by submitting predictions by different models on the SQuAD hold-out set on SQuAD website:

MODEL finetuned exact f1 total HasAns_exact HasAns_f1 HasAns_total NoAns_exact NoAns_f1 NoAns_total
human performance NA 86.831 89.452 NA NA NA NA NA NA NA
albert-base-v2 NA 18.38 19.13 11873 0.05 1.55 5928 36.65 36.65 5945
albert-xxlarge-v2 NA 0.008 0.64 11873 0 1.26 5928 0.017 0.017 5945
bert_base_uncased squad 25.16 26.22 11873 0.84 2.96 5928 49.4 49.4 5945
distilbert-ba Travis 56.77 59.17 11873 46.27 51.07 5928 67.24 67.24 5945
bert-ba Travis 62.58 65.52 11873 48.98 54.87 5928 76.14 76.14 5945
albert-lg Travis 63.56 65.12 11873 34.90 38.03 5928 92.14 92.1 5945
bert-tiny mrm8488, squad2 35.11 35.11 11873 0.15 0.34 5928 69.97 69.97 5945
albert-base-v2 twmkn9, squad2 77.92 81.38 11873 72.84 79.78 5928 82.98 82.98 5945
albert-xlarge-v2 ktrapeznikov, squad2 84.46 87.47 11873 80.01 86.04 5928 88.90 88.90 5945