All evaluations below have been computed with the OpenNMT-py converted models.
The evaluation script is taken from the https://github.com/FranxYao/chain-of-thought-hub repo and modified to use the OpenNMT-py models
There is a difference compared to the original MMLU Hendrycks script.
We do not compare the logprobs of A, B, C, D to determine the answer, we actually decode the next token after the prompt.
When the model is Sentencepiece based the next token can be 'A', 'B', 'C', 'D' or any other token. When the model is BPE based the tokens will be ' A', ' B', ' C', ' D' because the leading space is encoded with the letter, We strip that space to compute the metric.
For 7B params models:
-
Llama7B score (35.25) matches both the Llama paper and the score reported by chain-of-thought-hub
-
Falcon7B is a little higher then the score reported by chain-of-thought-hub (0.2641)
-
I ran MPT7B with chain-of-thought-hub and found 28.46, again ours is a little higher.
-
There are major discrepancies between those scores and Open LLM leaderboard of HF for MPT, Falcon, Redpajama that are way higher on the leaderboard.
For 13B, 33B, 40B models, we score with the 4-bit loading option, hence for Llama13B a score slightly under the paper (46.9), same for 33B (paper is 57.8)
MPT7B | Redpajama7B | Open Llama7B | Falcon7B | xgen7B | Flan-T5-3B | Llama7B | Llama-2-7B | Llama-2-chat-7B | Open Llama13B | Llama13B | Llama-2-13B | Llama-2-chat-13B | Falcon40B | Llama33B | Llama-2-70B | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC-all | 0.2958 | 0.2745 | 0.3007 | 0.2765 | 0.3468 | 0.4929 | 0.3525 | 0.4587 | 0.4569 | 0.4148 | 0.4472 | 0.5429 | 0.5217 | 0.5499 | 0.5701 | 0.6875 |
ACC-abstract_algebra | 0.2200 | 0.2500 | 0.3000 | 0.2400 | 0.2900 | 0.2700 | 0.2500 | 0.3000 | 0.3100 | 0.3200 | 0.2800 | 0.3100 | 0.3500 | 0.3200 | 0.3700 | 0.3900 |
ACC-anatomy | 0.2963 | 0.2667 | 0.3333 | 0.2444 | 0.3185 | 0.4296 | 0.3852 | 0.4815 | 0.4222 | 0.4667 | 0.4889 | 0.5037 | 0.5037 | 0.5111 | 0.5185 | 0.6296 |
ACC-astronomy | 0.2961 | 0.2763 | 0.2500 | 0.2434 | 0.3355 | 0.4737 | 0.3487 | 0.4079 | 0.4803 | 0.4737 | 0.4671 | 0.5263 | 0.5461 | 0.5658 | 0.6118 | 0.7895 |
ACC-business_ethics | 0.2900 | 0.2900 | 0.3200 | 0.1900 | 0.3200 | 0.6800 | 0.4100 | 0.5300 | 0.4200 | 0.4100 | 0.4300 | 0.5500 | 0.5000 | 0.5500 | 0.5800 | 0.6900 |
ACC-clinical_knowledge | 0.2943 | 0.3208 | 0.3887 | 0.3019 | 0.3057 | 0.5245 | 0.3585 | 0.4604 | 0.5208 | 0.4113 | 0.4189 | 0.5811 | 0.5698 | 0.6113 | 0.5547 | 0.7019 |
ACC-college_biology | 0.3056 | 0.3125 | 0.3264 | 0.2153 | 0.3958 | 0.4444 | 0.3819 | 0.4722 | 0.5417 | 0.4167 | 0.4722 | 0.5694 | 0.5347 | 0.6319 | 0.5833 | 0.8333 |
ACC-college_chemistry | 0.2800 | 0.2700 | 0.2400 | 0.2300 | 0.2500 | 0.3400 | 0.2900 | 0.3400 | 0.2500 | 0.2800 | 0.2400 | 0.3900 | 0.3600 | 0.4100 | 0.3800 | 0.5200 |
ACC-college_computer_science | 0.3100 | 0.3100 | 0.3100 | 0.3000 | 0.3300 | 0.3600 | 0.2900 | 0.3400 | 0.3600 | 0.4000 | 0.3700 | 0.4600 | 0.5100 | 0.4700 | 0.4400 | 0.6000 |
ACC-college_mathematics | 0.2900 | 0.2500 | 0.2800 | 0.2900 | 0.3200 | 0.2900 | 0.3400 | 0.3800 | 0.3400 | 0.3200 | 0.2500 | 0.3000 | 0.2800 | 0.3500 | 0.3600 | 0.3700 |
ACC-college_medicine | 0.2890 | 0.2659 | 0.3179 | 0.2659 | 0.3410 | 0.4277 | 0.3237 | 0.4220 | 0.4104 | 0.3699 | 0.4220 | 0.5318 | 0.4451 | 0.4798 | 0.5376 | 0.6532 |
ACC-college_physics | 0.2157 | 0.2451 | 0.1863 | 0.2157 | 0.2353 | 0.2941 | 0.2451 | 0.2255 | 0.2451 | 0.2549 | 0.1863 | 0.2647 | 0.3137 | 0.3333 | 0.3137 | 0.3333 |
ACC-computer_security | 0.3100 | 0.3600 | 0.3800 | 0.2800 | 0.3900 | 0.6400 | 0.4500 | 0.6200 | 0.5400 | 0.5400 | 0.6300 | 0.6900 | 0.6700 | 0.6500 | 0.6800 | 0.8100 |
ACC-conceptual_physics | 0.3362 | 0.2723 | 0.3064 | 0.3149 | 0.3489 | 0.4085 | 0.3702 | 0.4170 | 0.3872 | 0.3574 | 0.3915 | 0.4511 | 0.3787 | 0.4170 | 0.4723 | 0.6723 |
ACC-econometrics | 0.2895 | 0.2368 | 0.2895 | 0.2632 | 0.2632 | 0.2807 | 0.2632 | 0.2632 | 0.3333 | 0.3070 | 0.2719 | 0.2895 | 0.3158 | 0.3246 | 0.3333 | 0.4123 |
ACC-electrical_engineering | 0.2897 | 0.3034 | 0.3034 | 0.2828 | 0.3862 | 0.4552 | 0.2483 | 0.4759 | 0.4345 | 0.4966 | 0.3862 | 0.5172 | 0.5103 | 0.5034 | 0.4690 | 0.6276 |
ACC-elementary_mathematics | 0.2698 | 0.2646 | 0.2698 | 0.2593 | 0.2725 | 0.3148 | 0.2646 | 0.2672 | 0.2857 | 0.2487 | 0.2487 | 0.3360 | 0.3333 | 0.3413 | 0.3413 | 0.4180 |
ACC-formal_logic | 0.2540 | 0.4048 | 0.2381 | 0.1905 | 0.2619 | 0.3333 | 0.2619 | 0.2698 | 0.2381 | 0.3016 | 0.3889 | 0.3492 | 0.2857 | 0.3413 | 0.3571 | 0.5000 |
ACC-global_facts | 0.2700 | 0.3200 | 0.3200 | 0.3100 | 0.3300 | 0.3600 | 0.3000 | 0.3200 | 0.3100 | 0.2900 | 0.3400 | 0.3200 | 0.2900 | 0.3300 | 0.3900 | 0.4500 |
ACC-high_school_biology | 0.3097 | 0.2484 | 0.2968 | 0.2645 | 0.3290 | 0.5645 | 0.3387 | 0.5065 | 0.5258 | 0.4290 | 0.5065 | 0.6742 | 0.6194 | 0.6516 | 0.6419 | 0.8194 |
ACC-high_school_chemistry | 0.2020 | 0.2660 | 0.2512 | 0.2512 | 0.2611 | 0.3300 | 0.2956 | 0.3744 | 0.3547 | 0.3350 | 0.2660 | 0.4286 | 0.4138 | 0.4187 | 0.3793 | 0.5468 |
ACC-high_school_computer_science | 0.3400 | 0.2700 | 0.2800 | 0.3200 | 0.3200 | 0.5100 | 0.3300 | 0.4000 | 0.4500 | 0.2700 | 0.4500 | 0.5500 | 0.5800 | 0.6000 | 0.5800 | 0.7700 |
ACC-high_school_european_history | 0.3455 | 0.2848 | 0.3455 | 0.2909 | 0.3879 | 0.7333 | 0.4667 | 0.6121 | 0.5818 | 0.4727 | 0.6121 | 0.6545 | 0.6667 | 0.6667 | 0.7152 | 0.8121 |
ACC-high_school_geography | 0.3737 | 0.3283 | 0.3333 | 0.1667 | 0.3636 | 0.6414 | 0.3333 | 0.4899 | 0.5960 | 0.4899 | 0.5000 | 0.6616 | 0.6616 | 0.7121 | 0.7273 | 0.8636 |
ACC-high_school_government_and_politics | 0.3782 | 0.2124 | 0.3575 | 0.2591 | 0.4352 | 0.6632 | 0.4611 | 0.6736 | 0.6632 | 0.5959 | 0.6425 | 0.8135 | 0.7617 | 0.7927 | 0.8187 | 0.9430 |
ACC-high_school_macroeconomics | 0.3821 | 0.2718 | 0.3564 | 0.2615 | 0.3359 | 0.5359 | 0.3410 | 0.4513 | 0.4103 | 0.4282 | 0.4256 | 0.4923 | 0.4744 | 0.5641 | 0.5590 | 0.7308 |
ACC-high_school_mathematics | 0.2778 | 0.2667 | 0.2407 | 0.2481 | 0.2333 | 0.3074 | 0.2630 | 0.2963 | 0.2556 | 0.2667 | 0.2593 | 0.2889 | 0.3037 | 0.3111 | 0.2741 | 0.3630 |
ACC-high_school_microeconomics | 0.2941 | 0.3067 | 0.2941 | 0.2899 | 0.3697 | 0.5168 | 0.3319 | 0.4412 | 0.4328 | 0.4370 | 0.4454 | 0.5630 | 0.5042 | 0.5504 | 0.5588 | 0.7605 |
ACC-high_school_physics | 0.2583 | 0.2649 | 0.2517 | 0.3179 | 0.2450 | 0.2980 | 0.2649 | 0.3179 | 0.3046 | 0.2980 | 0.2517 | 0.3444 | 0.3245 | 0.2914 | 0.3311 | 0.3907 |
ACC-high_school_psychology | 0.2844 | 0.3229 | 0.3505 | 0.2440 | 0.4752 | 0.6771 | 0.4789 | 0.6312 | 0.6477 | 0.5486 | 0.5835 | 0.7413 | 0.7229 | 0.7541 | 0.7596 | 0.8752 |
ACC-high_school_statistics | 0.4028 | 0.2454 | 0.3981 | 0.1852 | 0.1620 | 0.3657 | 0.3241 | 0.2778 | 0.3241 | 0.2546 | 0.2685 | 0.4722 | 0.3611 | 0.4630 | 0.4676 | 0.6157 |
ACC-high_school_us_history | 0.2892 | 0.2255 | 0.3137 | 0.2892 | 0.4167 | 0.6863 | 0.3284 | 0.5245 | 0.6765 | 0.5490 | 0.5343 | 0.7108 | 0.6863 | 0.7108 | 0.7696 | 0.9069 |
ACC-high_school_world_history | 0.2489 | 0.2785 | 0.2869 | 0.2996 | 0.3966 | 0.6667 | 0.4262 | 0.6245 | 0.6667 | 0.5105 | 0.6287 | 0.7089 | 0.7215 | 0.6835 | 0.7637 | 0.8608 |
ACC-human_aging | 0.3274 | 0.1659 | 0.2870 | 0.4215 | 0.4260 | 0.5650 | 0.3991 | 0.5695 | 0.5695 | 0.5157 | 0.5112 | 0.6502 | 0.6816 | 0.7130 | 0.6861 | 0.7848 |
ACC-human_sexuality | 0.3511 | 0.2519 | 0.2748 | 0.2901 | 0.3359 | 0.5802 | 0.3435 | 0.5649 | 0.4885 | 0.4962 | 0.5649 | 0.6031 | 0.5878 | 0.6794 | 0.6718 | 0.8550 |
ACC-international_law | 0.3802 | 0.2231 | 0.3636 | 0.2479 | 0.5041 | 0.6860 | 0.5207 | 0.6529 | 0.5620 | 0.5207 | 0.6860 | 0.6860 | 0.7851 | 0.6612 | 0.7603 | 0.8595 |
ACC-jurisprudence | 0.3704 | 0.2315 | 0.3426 | 0.3426 | 0.4074 | 0.6204 | 0.4167 | 0.5370 | 0.5833 | 0.4444 | 0.4722 | 0.6852 | 0.7037 | 0.6667 | 0.6574 | 0.8148 |
ACC-logical_fallacies | 0.2945 | 0.2638 | 0.2883 | 0.2638 | 0.3558 | 0.6319 | 0.4172 | 0.5092 | 0.5399 | 0.4847 | 0.5031 | 0.6564 | 0.6319 | 0.6503 | 0.6994 | 0.7975 |
ACC-machine_learning | 0.3125 | 0.2232 | 0.2321 | 0.3750 | 0.2589 | 0.3571 | 0.2768 | 0.3839 | 0.3393 | 0.3571 | 0.3304 | 0.3036 | 0.3482 | 0.3036 | 0.3750 | 0.5089 |
ACC-management | 0.3301 | 0.2816 | 0.2524 | 0.2816 | 0.3010 | 0.6796 | 0.3301 | 0.5631 | 0.6699 | 0.5243 | 0.6311 | 0.7379 | 0.7184 | 0.7184 | 0.7573 | 0.8252 |
ACC-marketing | 0.3120 | 0.2735 | 0.3761 | 0.2949 | 0.5385 | 0.7906 | 0.4615 | 0.6795 | 0.7265 | 0.5897 | 0.7094 | 0.8077 | 0.7821 | 0.7949 | 0.8333 | 0.8932 |
ACC-medical_genetics | 0.3100 | 0.2400 | 0.2700 | 0.2800 | 0.3600 | 0.4800 | 0.3700 | 0.5500 | 0.5000 | 0.5100 | 0.5100 | 0.5500 | 0.5700 | 0.6200 | 0.6100 | 0.7400 |
ACC-miscellaneous | 0.3001 | 0.2899 | 0.3678 | 0.2976 | 0.5326 | 0.6782 | 0.4278 | 0.6450 | 0.6692 | 0.5900 | 0.6296 | 0.7407 | 0.7458 | 0.7471 | 0.7752 | 0.8557 |
ACC-moral_disputes | 0.2977 | 0.2659 | 0.3295 | 0.3092 | 0.3613 | 0.5983 | 0.4133 | 0.5116 | 0.5145 | 0.4798 | 0.4566 | 0.6272 | 0.5809 | 0.6503 | 0.6503 | 0.7572 |
ACC-moral_scenarios | 0.2436 | 0.2469 | 0.2469 | 0.2492 | 0.2425 | 0.2436 | 0.2425 | 0.2380 | 0.2145 | 0.2715 | 0.2480 | 0.3464 | 0.2927 | 0.2615 | 0.3855 | 0.4413 |
ACC-nutrition | 0.2810 | 0.2908 | 0.3301 | 0.2582 | 0.3431 | 0.4804 | 0.3922 | 0.4902 | 0.5098 | 0.3758 | 0.5163 | 0.6144 | 0.5980 | 0.6405 | 0.6471 | 0.7778 |
ACC-philosophy | 0.3183 | 0.2830 | 0.2830 | 0.2830 | 0.3151 | 0.5177 | 0.4051 | 0.6013 | 0.5659 | 0.4662 | 0.5145 | 0.6656 | 0.6077 | 0.6399 | 0.6656 | 0.7781 |
ACC-prehistory | 0.3056 | 0.3210 | 0.3210 | 0.3117 | 0.3488 | 0.5216 | 0.3519 | 0.4907 | 0.5679 | 0.5216 | 0.5093 | 0.6451 | 0.5926 | 0.5988 | 0.6667 | 0.8272 |
ACC-professional_accounting | 0.2447 | 0.2872 | 0.2553 | 0.2979 | 0.3050 | 0.3723 | 0.2730 | 0.3582 | 0.3475 | 0.3050 | 0.3227 | 0.3830 | 0.3759 | 0.4255 | 0.4326 | 0.5780 |
ACC-professional_law | 0.2784 | 0.2705 | 0.2523 | 0.2497 | 0.2647 | 0.3990 | 0.2973 | 0.3553 | 0.3266 | 0.3064 | 0.3566 | 0.4068 | 0.3722 | 0.4296 | 0.4342 | 0.5404 |
ACC-professional_medicine | 0.2206 | 0.2059 | 0.2500 | 0.3125 | 0.4375 | 0.4412 | 0.4265 | 0.5184 | 0.3529 | 0.3860 | 0.5000 | 0.5221 | 0.4706 | 0.6176 | 0.5441 | 0.7390 |
ACC-professional_psychology | 0.2876 | 0.2925 | 0.2696 | 0.2647 | 0.3203 | 0.4526 | 0.3546 | 0.4428 | 0.4739 | 0.3693 | 0.4575 | 0.5392 | 0.5065 | 0.5539 | 0.6144 | 0.7500 |
ACC-public_relations | 0.3455 | 0.3182 | 0.4091 | 0.3364 | 0.4182 | 0.5909 | 0.4091 | 0.5273 | 0.5182 | 0.5273 | 0.5545 | 0.6364 | 0.6091 | 0.6364 | 0.6818 | 0.7273 |
ACC-security_studies | 0.3796 | 0.2816 | 0.2939 | 0.3102 | 0.2531 | 0.6531 | 0.3306 | 0.4980 | 0.4571 | 0.4245 | 0.5224 | 0.6122 | 0.6531 | 0.6735 | 0.6367 | 0.8082 |
ACC-sociology | 0.2239 | 0.2587 | 0.2488 | 0.3532 | 0.4826 | 0.7363 | 0.4726 | 0.6318 | 0.5771 | 0.5473 | 0.6418 | 0.7264 | 0.7214 | 0.7761 | 0.7761 | 0.8955 |
ACC-us_foreign_policy | 0.3500 | 0.3200 | 0.3900 | 0.4200 | 0.5100 | 0.6600 | 0.4300 | 0.6500 | 0.6700 | 0.6100 | 0.7200 | 0.8500 | 0.7700 | 0.8000 | 0.8300 | 0.9100 |
ACC-virology | 0.3494 | 0.2530 | 0.3494 | 0.3554 | 0.3735 | 0.4819 | 0.3253 | 0.4217 | 0.4277 | 0.4398 | 0.4096 | 0.4458 | 0.4940 | 0.4639 | 0.5000 | 0.5361 |
ACC-world_religions | 0.3158 | 0.3041 | 0.4035 | 0.3333 | 0.6140 | 0.5614 | 0.4912 | 0.6842 | 0.6842 | 0.6550 | 0.6491 | 0.7602 | 0.7427 | 0.7719 | 0.7953 | 0.8538 |
0.3022 | 0.2747 | 0.3053 | 0.2818 | 0.3515 | 0.5018 | 0.3569 | 0.4682 | 0.4662 | 0.4258 | 0.4559 | 0.5482 | 0.5340 | 0.5580 | 0.5741 | 0.6932 |