{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T06:23:06Z","timestamp":1770963786479,"version":"3.50.1"},"reference-count":37,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T00:00:00Z","timestamp":1770336000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004423","name":"Waseda University","doi-asserted-by":"publisher","award":["2025E-008"],"award-info":[{"award-number":["2025E-008"]}],"id":[{"id":"10.13039\/501100004423","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>Large language models (LLMs) have begun to function as assistants or teammates in language learning, teaching, and research. However, what prerequisites are required for LLMs to reliably play these roles, and how such prerequisites should be measured, remains under-discussed. This study focuses on measuring Pedagogical Grammar Pattern Recognition (P-GPR) and establishes the Chinese Pedagogical Grammar Evaluation (CPG-EVAL), a multi-tiered benchmark designed to evaluate P-GPR within International Chinese Language Education. CPG-EVAL operationalizes grammar\u2013instance correspondence through five task types that progressively increase contextual load and interference. We evaluate multiple proprietary and open-source LLMs as well as human participants. Results show a monotonic ordering across groups (humans &gt; larger-scale models &gt; semi-larger-scale models &gt; smaller-scale models). In comparison with human participants, LLM performance is more sensitive to task-format complexity. In addition, we identify a set of completely failed items that consistently mislead all evaluated LLMs, exposing shared and systematic weaknesses in current models\u2019 pedagogical grammar recognition. Overall, this study provides an operational framework for diagnosing the capabilities and risks of LLMs when they are deployed as assistants or teammates in grammar-related language-education tasks and offers empirical reference for safer and more syllabus-aligned use of LLMs in educational settings.<\/jats:p>","DOI":"10.3390\/informatics13020029","type":"journal-article","created":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T16:32:59Z","timestamp":1770395579000},"page":"29","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-0410-3694","authenticated-orcid":false,"given":"Dong","family":"Wang","sequence":"first","affiliation":[{"name":"Faculty of Education and Integrated Arts and Sciences, Waseda University, Shinjuku-ku 169-8050, Tokyo, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,6]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"19343","DOI":"10.1007\/s10639-024-12574-6","article-title":"Incorporating AI in Foreign Language Education: An Investigation into ChatGPT\u2019s Effect on Foreign Language Learners","volume":"29","author":"Abedi","year":"2024","journal-title":"Educ. Inf. Technol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"100266","DOI":"10.1016\/j.caeai.2024.100266","article-title":"A Systematic Review of the First Year of Publications on ChatGPT and Language Education: Examining Research on ChatGPT\u2019s Use in Language Learning and Teaching","volume":"7","author":"Li","year":"2024","journal-title":"Comput. Educ. Artif. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1016\/0272-7757(94)90003-5","article-title":"Subject Area Preparation of Secondary Mathematics and Science Teachers and Student Achievement","volume":"13","author":"Monk","year":"1994","journal-title":"Econ. Educ. Rev."},{"key":"ref_4","unstructured":"Medgyes, P. (1994). The Non-Native Teacher, Macmillan."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"100529","DOI":"10.1016\/j.caeai.2025.100529","article-title":"Large Language Models in Education: A Systematic Review of Empirical Applications, Benefits, and Challenges","volume":"10","author":"Shi","year":"2026","journal-title":"Comput. Educ. Artif. Intell."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2518","DOI":"10.1080\/0144929X.2024.2394886","article-title":"The Promise and Challenges of Generative AI in Education","volume":"44","author":"Giannakos","year":"2025","journal-title":"Behav. Inf. Technol."},{"key":"ref_7","first-page":"457","article-title":"The English Grammar Profile of Learner Competence: Methodology and Key Findings","volume":"22","author":"Mark","year":"2022","journal-title":"Int. J. Corpus Linguist."},{"key":"ref_8","unstructured":"Ying, C., Wang, H., Jin, H., Li, Y., and Liu, Y. (2022). Chinese Proficiency Grading Standards for International Chinese Language Education Grammar Learning Manual (Elementary, Intermediate, Advanced), Beijing Language and Culture University Press."},{"key":"ref_9","unstructured":"Sunakawa, Y. (2015). A Handbook of Japanese Grammar Patterns for Teachers and Learners, Kurosio Publishers."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Odlin, T. (1994). Perspectives on Pedagogical Grammar, Cambridge Applied Linguistics, Cambridge University Press.","DOI":"10.1017\/CBO9781139524605"},{"key":"ref_11","unstructured":"Wang, D. (2025, January 20\u201322). Evaluation of Large Language Models\u2019 Foreign Language Teaching Ability: An Experimental Study Focusing on Pedagogical Grammar. Proceedings of the 103rd Language and Speech Understanding and Dialogue Processing Study Group, Tokyo, Japan."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1080\/09500789908666766","article-title":"Why do L2 teachers need to \u2018know about language\u2019? Teacher metalinguistic awareness and input for learning","volume":"13","author":"Andrews","year":"1999","journal-title":"Lang. Educ."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Andrews, S. (2007). Teacher Language Awareness, Cambridge University Press. [1st ed.].","DOI":"10.1017\/CBO9780511497643"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"10","DOI":"10.54475\/jlt.2024.023","article-title":"A review of teacher language awareness (2015\u20132024): Current trends and future directions","volume":"4","author":"Wang","year":"2024","journal-title":"J. Lang. Teach."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Cenoz, J., Gorter, D., and May, S. (2017). Teacher Language Awareness. Language Awareness and Multilingualism, Springer International Publishing.","DOI":"10.1007\/978-3-319-02240-6"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1080\/09658410108667027","article-title":"The language awareness of the L2 teacher: Its impact upon pedagogical practice","volume":"10","author":"Andrews","year":"2001","journal-title":"Lang. Aware."},{"key":"ref_17","unstructured":"Linzen, T., Chrupa\u0142a, G., and Alishahi, A. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018, Association for Computational Linguistics."},{"key":"ref_18","first-page":"3266","article-title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","volume":"Volume 32","author":"Wang","year":"2019","journal-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8\u201314 December 2019"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., Xu, Y., Sun, K., Yu, D., and Yu, C. (2020, January 13\u201318). CLUE: A Chinese Language Understanding Evaluation Benchmark. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain.","DOI":"10.18653\/v1\/2020.coling-main.419"},{"key":"ref_20","unstructured":"Xu, L., Li, A., Zhu, L., Xue, H., Zhu, C., Zhao, K., He, H., Zhang, X., Kang, Q., and Lan, Z. (2023). SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark. arXiv."},{"key":"ref_21","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (May, January 26). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_22","first-page":"1","article-title":"Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models","volume":"2023","author":"Srivastava","year":"2022","journal-title":"Trans. Mach. Learn. Res."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., and Lu, X. (2019, January 3\u20137). PubMedQA: A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1259"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Chen, Z., Chen, W., Smiley, C., Shah, S., Borova, I., Langdon, D., Moussa, R., Beane, M., Huang, T.H., and Routledge, B. (2021, January 7\u201311). FINQA: A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.emnlp-main.300"},{"key":"ref_25","first-page":"44123","article-title":"LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models","volume":"36","author":"Guha","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Hou, J., Ao, C., Wu, H., Kong, X., Zheng, Z., Tang, D., Li, C., Hu, X., Xu, R., and Ni, S. (2024). E-EVAL: A Comprehensive Chinese k-12 Education Evaluation Benchmark for Large Language Models. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.462"},{"key":"ref_27","unstructured":"Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, J., Lv, C., Zhang, Y., and Lei, J. (2023, January 10\u201316). C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA."},{"key":"ref_28","unstructured":"National Language Commission (2021). Chinese Proficiency Grading Standards for International Chinese Language Education, Beijing Language and Culture University Press."},{"key":"ref_29","unstructured":"Team Qwen (2024). Qwen2.5 Technical Report. arXiv."},{"key":"ref_30","unstructured":"Team GLM, Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., and Lai, H. (2024). ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv."},{"key":"ref_31","unstructured":"Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., Chen, Z., Chen, Z., and Chu, P. (2024). InternLM2 Technical Report. arXiv."},{"key":"ref_32","unstructured":"Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., and Vaughan, A. (2024). The Llama 3 Herd of Models. arXiv."},{"key":"ref_33","unstructured":"DeepSeek-AI (2025). DeepSeek-V3 Technical Report. arXiv."},{"key":"ref_34","unstructured":"OpenAI (2025, June 01). GPT-4o-2024-08-06. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-4o."},{"key":"ref_35","unstructured":"OpenAI (2025, June 01). GPT-4o-mini-2024-07-18. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-4o-mini."},{"key":"ref_36","unstructured":"Doubao Team (2025, June 01). Doubao 1.5 Pro. Available online: https:\/\/team.doubao.com\/en\/special\/doubao_1_5_pro."},{"key":"ref_37","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00fcttler, H., Lewis, M., Yih, W.-T., and Rockt\u00e4schel, T. (2020, January 6\u201312). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada."}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/13\/2\/29\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T05:34:17Z","timestamp":1770960857000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/13\/2\/29"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,6]]},"references-count":37,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["informatics13020029"],"URL":"https:\/\/doi.org\/10.3390\/informatics13020029","relation":{},"ISSN":["2227-9709"],"issn-type":[{"value":"2227-9709","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,6]]}}}