Document Type : Research Paper

Author

Department of English, Faculty of Literature, Alzahra University, Tehran, Iran

Abstract

The evaluation of students' writings and the allocation of scores are traditionally time-intensive and inherently subjective, often resulting in inconsistencies among human raters. Automated essay scoring systems were introduced to address these issues; however, their development has historically been resource-intensive, restricting their application to standardized tests such as TOEFL and IELTS. Consequently, these systems were not readily accessible to educators and learners. Recent advancements in Artificial Intelligence (AI) have expanded the potential of automated scoring systems, enabling them to analyze written texts and assign scores with increased efficiency and versatility. This study aimed to compare the efficacy of an AI-based scoring system, DeepAI, with human evaluators. A quantitative approach, grounded in Corder's (1974) Error Analysis framework, was used to analyze approximately 200 essays written by Persian-speaking EFL learners. Paired sample t-tests and Pearson correlation coefficients were employed to assess the congruence between errors identified and scores assigned by the two methods. The findings revealed a moderate correlation between human and AI scores, with AI diagnosing a greater number of errors than human raters. These results underscore the potential of AI in augmenting writing assessment practices while highlighting its pedagogical implications for language instructors and learners, particularly in evaluating the essays of EFL students.

Keywords

Main Subjects

Al-Ahdal, A. (2020). Using computer software as a tool of error analysis: Giving EFL teachers and learners a much-needed impetus. International Journal of Innovation, Creativity, and Change, 12(2), 418–437. https://doi.org/10.3390/languages4010019
Alrashidi, O., & Phan, H. (2015). Education context and English teaching and learning in the Kingdom of Saudi Arabia: An overview. English Language Teaching, 8(5), 33–44. https://doi.org/10.5539/elt.v8n5p33
Alshakhi, A. (2019). Revisiting the writing assessment process at a Saudi English language institute: Problems and solutions. English Language Teaching, 12(1), 176–185. https://doi.org/10.5539/elt.v12n1p176
Attali, Y. (2013). Validity and reliability of automated essay scoring. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 181–198). Routledge.
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
Attali, Y., Lewis, W., & Steier, M. (2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125–141. https://doi.org/10.1177/0265532212452396
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. The Journal of Technology, Learning and Assessment, 4(3).
Barkaoui, K. (2010a). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31–57. https://doi.org/10.5054/tq.2010.214047
Barkaoui, K. (2010b). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17, 9–17. https://doi.org/10.1002/j.2333-8504.1997.tb01734.x
Burstein, J., & Chodorow, M. (2010). Progress and new directions in technology for automated essay evaluation. In R. B. Kaplan (Ed.), The Oxford handbook of applied linguistics (pp. 529–539). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780195384253.013.0036
Burstein, J., Marcu, D., & Knight, K. (1998). A machine learning approach to recognizing features of coherence in student essays. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics.
Chen, C., Cheng, Y., & Huang, H. (2020). The impact of automated feedback on EFL learners’ writing performance: A meta-analysis. Educational Technology Research and Development, 68(2), 123-145.
Chukharev-Hudilainen, E., & Saricaoglu, A. (2016). Causal discourse analyzer: Improving automated feedback on academic ESL writing. Computer Assisted Language Learning, 29(3), 494–516. https://doi.org/10.1080/09588221.2014.991795
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Corder, S. P. (1967). The significance of learners’ errors. International Review of Applied Linguistics, 5, 161–170.
Corder, S. P. (1981). Error analysis and interlanguage. Oxford University Press.
Cotos, E. (2015). Automated writing analysis for writing pedagogy. Writing & Pedagogy, 7(2–3), 197–231. https://doi.org/10.1558/wap.v7i2-3.26381
Dembsey, J. M. (2017). Closing the Grammarly® gaps: A study of claims and feedback from an online grammar program. The Writing Center Journal, 36(1), 63–96. https://www.jstor.org/stable/44252638
Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing Writing, 22, 1–17. https://doi.org/10.1016/j.asw.2014.03.006
Douglas, D. (2010). Understanding language testing. Hodder Education.
Farangi, M. R., & Zabbah, M. (2023). Intelligent scoring in an English reading comprehension course using artificial neural networks and neuro-fuzzy systems. Teaching English as a Second Language Quarterly, 42(4), 1-21. Retrieved from https://tesl.shirazu.ac.ir
Gamper, J., & Knapp, J. (2002). A review of intelligent CALL systems. Computer Assisted Language Learning, 15(4), 329–342. https://doi.org/10.1076/call.15.4.329.8270
Ghufron, M. A., & Rosyida, F. (2018). The role of Grammarly in assessing English as a foreign language (EFL) writing. Lingua Cultura, 12(4), 395–403. https://doi.org/10.21512/lc.v12i4.4582
Goh, T. T., Sun, H., & Yang, B. (2020). Microfeatures influencing writing quality: The case of Chinese students’ SAT essays. Computer Assisted Language Learning, 33(4), 455–481. https://doi.org/10.1080/09588221.2019.1572017
Gonzalez, M., Liu, Y., & Zhang, J. (2021). Addressing bias in automated essay scoring: A case study on EFL learners’ essays. Language Testing, 38(4), 495-515.
Higgins, J. J. (1983). Computer-assisted language learning. Language Teaching, 16(2), 102–114.
Huang, S. J. (2001). Error analysis and teaching composition [Unpublished master’s thesis]. National Tsing Hua University.
Huang, S. J. (2014). Automated versus human scoring: A case study in an EFL context. Electronic Journal of Foreign Language Teaching, 11, 149–164. https://doi.org/10.1080/15434303.2016.1230121
Jayavalan, K., & Razali, A. B. (2018). Effectiveness of online grammar checkers to improve secondary students’ English narrative essay writing. International Research Journal of Education and Sciences, 2(1), 1–6. http://psasir.upm.edu.my/id/eprint/14442
Ke, Z. (2019). Automated essay scoring: A survey of the state of the art [Paper presentation]. International Joint Conference on Artificial Intelligence 28th Annual Meeting. https://doi.org/10.24963/ijcai.2019/879
Kenning, M. J., & Kenning, M. M. (1983). Introduction to computer-assisted language teaching. Oxford University Press.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated essay assessment. Assessment & Evaluation in Higher Education, 28(5), 491-505.
Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1–18. https://doi.org/10.1016/j.jslw.2014.10.004
Li, Z., Link, S., Ma, H., Yang, H., & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44, 66–78. https://doi.org/10.1016/j.system.2014.02.007
Lu, M., Deng, Q., & Yang, M. (2019). EFL writing assessment: Peer assessment vs. automated essay scoring. In E. Popescu, H. Tianyong, T.-C. Hsu, H. Xie, & M. Temperini (Eds.), International Symposium on Emerging Technologies for Education (pp. 21–29). Springer. https://doi.org/10.1007/978-3-030-38778-5_3
Morse, J. M. (1991). Strategies for sampling. In J. M. Morse (Ed.), Qualitative nursing research: A contemporary dialogue (pp. 127–145).
Nova, M. (2018). Utilizing Grammarly in evaluating academic writing: A narrative research on EFL students’ experience. Premise: Journal of English Education, 7(1), 80–97. https://doi.org/10.24127/pj.v7i1.1300
O’Neill, R., & Russell, A. M. T. (2019). Stop! Grammar time: University students’ perceptions of the automated feedback program Grammarly. Australasian Journal of Educational Technology, 35(1), 42–56. https://doi.org/10.14742/ajet.3795
Park, J. (2019). An AI-based English grammar checker vs. human raters in evaluating EFL learners’ writing. Multimedia-Assisted Language Learning, 22(1), 112–131. https://doi.org/10.15702/mall.2019.22.1.112
Perdana, I., & Farida, M. (2019). Online grammar checkers and their use for EFL writing. Journal of English Teaching, Applied Linguistics, and Literatures, 2(2), 67–76. https://doi.org/10.20527/jetall.v2i2.7332
Prinsloo, D., & Bothma, T. (2020). A copulative decision tree as a writing tool for Sepedi. South African Journal of African Languages, 40(1), 85–97. https://doi.org/10.1080/02572117.2020.1733834
Polit, D. F., & Hungler, B. P. (1993). Study guide for essentials of nursing research: Methods, appraisal, and utilization. Lippincott, Williams, & Wilkins.
Ranalli, J. (2018). Automated written corrective feedback: How well can students make use of it? Computer Assisted Language Learning, 31(7), 653–674. https://doi.org/10.1080/09588221.2018.1428994
Rao, Z., & Li, X. (2017). Native and non-native teachers’ perceptions of error gravity: The effects of cultural and educational factors. The Asia-Pacific Education Researcher, 26(1–2), 51–59. https://doi.org/10.1007/s40299-017-0326-5
Reilly, E. D., Stafford, R. E., Williams, K. M., & Corliss, S. B. (2014). Evaluating the validity and applicability of automated essay scoring in two massive open online courses. International Review of Research in Open and Distributed Learning, 15(5), 83–98. https://doi.org/10.19173/irrodl.v15i5.1857
Seker, M. (2018). Intervention in teachers’ differential scoring judgments in assessing L2 writing through communities of assessment practice. Studies in Educational Evaluation, 59, 209–217. https://doi.org/10.1016/j.stueduc.2018.08.003
Sharma, C., Bishnoi, A., Sachan, A. K., & Verma, A. (2019). Automated essay evaluation using natural language processing. International Research Journal of Engineering and Technology, 5(6), 2055–2058. https://www.irjet.net/archives/V6/i5/IRJET-V6I5398.pdf
Shermis, M., & Burstein, J. (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.
Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art essay scorers: A preliminary report on the first automated essay scoring challenge. Proceedings of the 2012 Conference on Computer-Based Test Development.
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76. https://doi.org/10.1016/j.asw.2013.04.001
Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In E. Baker, B. McGaw, & N. S. Petersen (Eds.), International encyclopedia of education (3rd ed., pp. 75–80). Elsevier.
Sparks, J. R., Song, Y., Brantley, W., & Liu, O. L. (2014). Assessing written communication in higher education: Review and recommendations for next-generation assessment (Issue No. 2). ETS Research Report Series. https://doi.org/10.1002/ets2.12035
Vojak, C., Kline, S., Cope, B., McCarthey, S., & Kalantzis, M. (2011). New spaces and old places: An analysis of writing assessment software. Computers and Composition, 28(2), 97–111. https://doi.org/10.1016/j.compcom.2011.04.004
Wali, F. A., & Huijser, H. (2018). Write to improve: Exploring the impact of an automated feedback tool on Bahraini learners of English. Learning & Teaching in Higher Education: Gulf Perspectives, 15(1). https://doi.org/10.18538/lthe.v15.n1.293
Wang, J., & Brown, M. S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). https://files.eric.ed.gov/fulltext/EJ838612.pdf
Wilson, J., & Roscoe, R. (2019). Automated writing evaluation and feedback: Multiple metrics of efficacy. Journal of Educational Computing Research, 58(1), 87–125. https://doi.org/10.1177/0735633119830764
Wiseman, C. S. (2012). Rater effects: Ego engagement in rater decision-making. Assessing Writing, 17(3), 150–173. https://doi.org/10.1016/j.asw.2011.12.001
Wind, S. A., & Engelhard, G. Jr. (2013). How invariant and accurate are domain ratings in writing assessment? Assessing Writing, 18(4), 278–299. https://doi.org/10.1016/j.asw.2013.09.002
Wu, H., & Garza, E. V. (2014). Types and attributes of English writing errors in the EFL context: A study of error analysis. Journal of Language Teaching & Research, 5(6), 125–141. https://doi.org/10.4304/jltr.5.6.1256-1262
Zhang, Y., Wang, L., & Li, X. (2019). The effectiveness of automated essay scoring: A systematic review and meta-analysis. Computers & Education, 139(1), 56-68.