Capstone Phase 1 Report
Capstone Phase 1 Report
In the Department of
Machine Learning
(UG Program: B.E. in Artificial Intelligence and Machine Learning)
Submitted by
Basem Kaunain 1BM22AI026
Bhavana S 1BM22AI029
Faiza Khanum 1BM22AI047
6th Semester
vi
TABLE OF CONTENTS
Declaration i
Acknowledgement ii
Abstract iii
Table of Contents iv
1 INTRODUCTION 1
1.1 Overview
2 PROBLEM STATEMENT 4
2.2 Motivation 4
3 LITERATURE SURVEY 7
5 SYSTEM DESIGN 18
vi
5.3 Detailed Design 22
vi
LIST OF FIGURES
vi
LIST OF TABLES
vi
ABSTRACT
The automation of grading handwritten, long-format responses has garnered considerable interest
in research. Recent studies integrate optical character recognition (OCR) with natural language
processing (NLP) to convert student-written answers into digital form and assess them against
model responses. Methods vary from basic keyword matching and cosine text similarity to more
sophisticated techniques such as neural network classifiers, fuzzy logic for awarding partial
credit, and fine-tuned large language models (LLMs) for semantic appraisal. Together, these
strategies enhance fairness and consistency in scoring, mitigate human bias, and significantly
decrease the time required for evaluating detailed answers, marking an advancement towards
dependable automated evaluation in education. Various approaches have been investigated to
bolster the accuracy and fairness of automated grading. These methods include straightforward
keyword matching, cosine similarity computations using TF-IDF vectors, and more advanced
strategies like neural network classifiers, BERT-based assessments of semantic similarity, and
fuzzy logic systems that permit partial credit based on the relevance and comprehensiveness of
responses. Recent progress has also led to the adoption of large language models (LLMs) like
LLaMA or those based on GPT architectures, which can comprehend the essence of a student's
response and deliver a nuanced assessment. These combined methods significantly lessen the
burden on teachers, guarantee consistent grading across extensive sets of answer sheets, and
offer students prompt, more objective evaluations. Ultimately, they signify a substantial
advancement in making the evaluation of descriptive answers more scalable, precise, and fair.
iii
Chapter 1
INTRODUCTION
No matter how skilled or dedicated, human evaluators are vulnerable to fatigue, stress, and
personal biases. These elements can cause variations in scoring, where the same answer might
receive different grades based on the time or the evaluators’ mindset. Such discrepancies
jeopardize the fairness and objectivity of the examination system, raising significant concerns in
high-stakes assessments like university semester examinations, board exams, or standardized
national tests. As a result, there is a growing need for dependable, scalable, and impartial
alternatives to manual grading.
Objective questions, such as multiple-choice or true/false formats, have long been suitable for
automation. Optical Mark Recognition (OMR) systems have transformed the grading of these
types, enabling precise and efficient evaluations on a large scale. However, automating the
grading of open-ended, handwritten responses—particularly those that require essay-type
answers, detailed explanations, or sequential derivations—presents a much more intricate
challenge. These answers demand not only recognition of the written text but also a deep
understanding of the semantic content, logical structure, and thoroughness of the explanation.
Recent developments in Artificial Intelligence (AI), especially in Machine Learning (ML) and
Natural Language Processing (NLP), present promising solutions to this issue. By training
intelligent systems on extensive datasets that include scanned answer sheets along with their
associated human-assigned scores, machines can learn to recognize important characteristics of
high-quality answers. These characteristics might encompass the inclusion of certain keywords,
1
compliance with expected formats, grammatical accuracy, and semantic relevance to the posed
questions. Significantly, machine learning models can uphold consistent grading standards across
vast datasets, thereby removing the subjectivity linked to human grading.
The primary benefit of automated grading systems is their reliability and speed. Once trained,
these systems can evaluate thousands of answer sheets in a tiny fraction of the time required by
human graders. Additionally, they uniformly apply the same grading criteria to all submissions,
thereby ensuring fairness in the evaluations. This consistency is particularly advantageous in
large-scale testing environments, where variability in human assessment frequently results in
student dissatisfaction and an increase in requests for re-evaluation.
The COVID-19 pandemic hastened the shift towards online and hybrid learning formats,
highlighting the need for scalable digital assessment solutions. In online testing scenarios, the
logistical challenges of distributing, collecting, and manually grading handwritten responses
became even more evident. Consequently, the creation of automated grading tools for descriptive
answers has experienced significant growth within academic and educational technology circles.
A standard system for automated evaluation of descriptive answers consists of several essential
components. Initially, Optical Character Recognition (OCR) technologies are utilized to convert
text from scanned handwritten answer sheets into digital format. Historically, OCR has primarily
focused on printed text; however, recent advancements, particularly in neural network-based
OCR models, have greatly enhanced the precision of handwriting recognition across various
styles and languages.
After the handwritten material is digitized, Natural Language Processing (NLP) techniques are
applied. These algorithms examine the extracted text for grammatical accuracy, relevance to the
subject matter, keyword alignment, sentence construction, and coverage of topics. Depending on
the answer's complexity and the scoring methodology, different NLP approaches might be
utilized, ranging from basic rule-based matching and bag-of-words techniques to more advanced
deep learning models such as Recurrent Neural Networks (RNNs), Transformers, and pre-trained
Large Language Models (LLMs) like BERT and GPT.
2
Furthermore, some systems integrate fuzzy logic and probabilistic frameworks to address unclear
or ambiguous responses, allowing for partial credit when warranted. Others utilize semantic
similarity measures like cosine similarity, BLEU scores, or tailored scoring algorithms to
evaluate student answers against model responses. Recent studies are also investigating the
application of graph-based models and knowledge representation systems to evaluate conceptual
understanding and coherence in student writing.
Another exciting avenue is the adoption of generative models that mimic human grading. These
models are trained not only to evaluate response content but also to provide feedback—assisting
students in recognizing their errors and enhancing their future performance. This educational
aspect of automated grading significantly enriches the learning experience by transforming
assessments into opportunities for learning rather than merely evaluative tasks.
Deploying such systems does come with its challenges. Errors in OCR can result in imprecise
text extraction, particularly when dealing with poor handwriting or low-resolution scans. Bias in
training datasets can distort grading results if the system encounters a limited variety of answer
styles or student backgrounds. Additionally, issues of interpretability and transparency remain
vital concerns. Educators need to comprehend and trust the decision-making processes of these
systems to fully integrate them into practical educational environments.
Despite these obstacles, the potential advantages of automated grading for descriptive answers
are considerable. It has the capacity to lessen the workload for educators, speed up the feedback
process for learners, and harmonize assessment methods across various institutions and regions.
With ongoing developments in AI, computer vision, and natural language processing (NLP),
along with thorough testing and validation in educational settings, these technologies are poised
to become essential tools in the realm of assessment.
This document seeks to deliver a thorough overview of this emerging field. It starts by
examining the reasons and challenges that drive the creation of automated grading solutions for
handwritten descriptive answers. It then assesses existing systems and their fundamental
methodologies, emphasizing advancements in optical character recognition (OCR), NLP, deep
learning, and hybrid scoring algorithms. Lastly, it discusses the present limitations of these
systems and highlights promising avenues for future research and practical applications.
3
Chapter 2
PROBLEM STATEMENT
2.2 Motivation
4
2.3 Scope and Objectives
Scope:
The aim of automated evaluation systems for descriptive answers is to simplify the grading
process for handwritten exam responses of short to medium length typically seen in schools and
universities. These systems are specifically crafted to assess theoretical or conceptual responses
worth between 2 to 10 marks, provided there is a clear reference answer or grading rubric
available. The emphasis is on analyzing scanned handwritten exam sheets, where the primary
challenge involves accurate recognition of handwriting through OCR, followed by semantic
content analysis using advanced NLP and machine learning methodologies. Although they were
initially implemented in fields like computer science and engineering, these solutions are not
limited to any particular content area and can be adapted to other academic disciplines with the
appropriate training data. The systems are designed for legible handwriting in English and are
capable of addressing moderate noise in images, but they do not currently accommodate
multilingual inputs or highly subjective or creative answers. Ultimately, these systems are aimed
at structured examination environments to provide scalable, consistent, and equitable evaluation
results that can match or surpass the accuracy of human grading.
Objectives:
5
2. To reduce the time, labor, and subjectivity involved in manual grading by
providing a fast, unbiased, and scalable automated evaluation mechanism.
By reducing the role of humans in the scoring process, teachers can concentrate on
qualitative educational tasks, while students can obtain faster and more uniform
feedback.
Impact on Society/Academics/Industry:
The effective implementation of automated systems for evaluating descriptive answers can
revolutionize educational assessment by providing prompt and unbiased grading, particularly in
extensive examinations. For educators, it alleviates their workload and improves the speed of
feedback. For learners, it guarantees fairness and openness. In the corporate sphere, such systems
can be modified for training evaluations or certification tests, showcasing their versatility across
different fields. In summary, this technology promotes educational fairness, enhances operational
efficiency, and drives digital advancements in evaluation methods.
6
Chapter 3
LITERATURE SURVEY
Evaluating descriptive answer scripts by hand is a time-consuming task that can be influenced by
human bias and exhaustion. To mitigate these issues, various automated systems have been
created that utilize advancements in Natural Language Processing (NLP), Optical Character
Recognition (OCR), machine learning, and fuzzy logic to assess lengthy handwritten answers.
These technologies strive to replicate or enhance human assessment regarding efficiency,
impartiality, and uniformity.
Early attempts at automatic grading were basic and mainly depended on matching keywords.
Bharambe et al. (2021) introduced a system that compared a student’s response to a list of key
terms supplied by the examiner. The model would tally the matches and assign scores according
to established criteria. Although efficient, these systems struggled with semantic variation and
could be easily misled by superficial inclusion of keywords without true comprehension of the
content.
Later research shifted focus to similarity metrics based on vectors. Kulkarni et al. (2022) applied
Term Frequency–Inverse Document Frequency (TF–IDF) to transform both student and model
7
answers into vector representations, subsequently employing cosine similarity to evaluate
content alignment. This marked an advancement, allowing for the identification of answers
despite variations in phrasing.
Sanuvala and Fatima (2021) developed the Handwritten Answer Evaluation System (HAES),
which is a supervised machine learning model trained on graded examples. Each sentence in the
answer was evaluated using similarity metrics, and the overall response was compiled to produce
a final score. Dheerays et al. (2024) and Bharambe et al. (2021) investigated the use of artificial
neural networks (ANNs), which incorporated features like answer length, frequency of
keywords, and semantic relevance. These networks were designed to predict grades based on the
training data, enhancing accuracy across various answer formats.
Acknowledging that human grading often relies on subjective and vague assessments,
Nandwalkar et al. (2023) utilized fuzzy logic for automated scoring. Their approach generated
scores based on keyword and semantic similarities, and then applied fuzzy inference rules to
determine grades. This allowed for the awarding of partial credit for answers that were somewhat
correct, resembling the qualitative reasoning of human graders. The system classified responses
into categories such as “Very Good,” “Fair,” and others according to graded membership
functions, enhancing the flexibility of the evaluation process.
8
OCR-Enabled Handwritten Input
OCR is essential for converting handwritten documents into digital formats. Tesseract, a popular
open-source optical character recognition engine, continues to be widely utilized. Approaches
developed by Zhang et al. (2021) and Deepak et al. (2022) improve Tesseract by employing
pre-processing techniques like binarization, noise elimination, and skew adjustment to enhance
the accuracy of OCR for answer sheets.
LLM-Based Evaluation
Recent research by Agarwal et al. (2024) incorporates Large Language Models (LLMs) for
assessment purposes. A customized LLaMA-2 model, enhanced with retrieval-augmented
generation (RAG), was utilized to evaluate comprehensive answers based on textbook content.
This system demonstrated accuracy and consistency similar to that of human graders, effectively
pinpointing essential concepts and evaluating relevance. Implemented via AWS SageMaker, it
serves as a state-of-the-art, scalable solution.
Commercial Solutions
Tools such as Gradescope, created by Turnitin, employ AI-supported grading for scanned
student responses. Although not fully automated, these systems enable semi-automated
assessment by grouping similar answers and permitting batch grading. Likewise, Mettl and
ExamSoft offer online proctoring and limited answer evaluation through AI, though restricted to
structured or semi-structured responses.
9
3.2 Research Gaps
Despite significant progress, several gaps persist in the automated evaluation of handwritten
descriptive answers:
OCR systems such as Tesseract struggle when dealing with irregular handwriting, different ink
types, and varying scan qualities. Many systems rely on high-quality digital inputs. It is essential
for research to tackle the variability of handwriting specific to different domains, particularly
concerning age groups and regional scripts.
4. Feedback Generation
Although systems can allocate scores, few offer meaningful feedback. Feedback plays a vital
role in the learning process. Adding generative features for in-depth, question-oriented
explanations would greatly improve educational effectiveness.
10
5. Subject-Specific Evaluation Challenges
Responses in areas such as mathematics, physics, or law typically demand a precise logical
framework or references. General natural language processing methods might struggle to assess
these answers correctly. Tailoring to specific domains is essential.
In both Indian and international educational environments, responses are frequently composed in
various languages or feature a blend of English and local expressions. Only a limited number of
systems manage code-mixed content or multilingual writing well.
Models built on biased data could perpetuate inequalities in grading. It is essential that
transparent assessment methods and de-biasing techniques are incorporated into any
implemented system.
11
Chapter 4
● The system must allow users (e.g., teachers or examiners) to upload scanned images or
PDFs of student answer scripts.
● Support for bulk uploads should be included.
● The system shall extract text from handwritten answer images using OCR (e.g., Tesseract
or equivalent).
● It must accommodate preprocessing procedures like converting to grayscale, eliminating
noise, and applying thresholding to enhance OCR precision.
3. Text Evaluation
● The answer that has been derived needs to be assessed by comparing it to a reference
response using semantic similarity methods (such as BERT embeddings and cosine
similarity).
● The presence of keywords, checks on length, and overall structural clarity of the response
will be evaluated. Functional requirements outline the fundamental actions and
characteristics of the system. These requirements concentrate on the system's reactions to
user inputs, its data management, and its execution of primary tasks.
12
4. Score Generation
● According to the degree of similarity and significance of the reply, the system ought to
produce a rating (for instance, on a scale of 5 or 10).
● The grading mechanism may incorporate fuzzy logic, predictions from machine learning
models, or outputs generated by large language models.
5. Feedback Generation
● The system will provide concise feedback highlighting the strengths and weaknesses of
the response to aid in learning.
7. Admin Panel
● The system must feature an admin panel that allows examiners to upload reference
responses, modify evaluation criteria, and retrain models when needed.
13
4.2 Non-functional Requirements
1. Accuracy
● The system must achieve a high level of OCR precision, exceeding 90% for clear
handwriting, and maintain consistency in evaluation.
● NLP similarity measures need to be refined to guarantee that the scores accurately
represent the quality of the answers.
2. Performance
● Each answer script must be evaluated within a reasonable timeframe of less than five
seconds per answer.
● The system must be able to handle the simultaneous processing of at least 10 answer
scripts.
3. Scalability
● The system needs to be able to scale in order to accommodate various exams and manage
numerous scripts with each batch upload.
● It must support deployment on cloud platforms for elastic resource allocation.
4. Usability
● The user interface must be easy to navigate, necessitating little training for instructors.
● Error messages and feedback from the system should be straightforward and offer clear
next steps.
5. Reliability
14
6. Security
● Answer scripts and results that are uploaded need to be stored in a secure manner, with
encryption both during transmission and while at rest.
● Access to evaluation data should be limited to authorized users only.
7. Maintainability
● The system ought to be modular and adhere to clean coding practices to facilitate
straightforward updates, debugging, and enhancements.
● Each module should be accompanied by appropriate documentation.
The following outlines the essential and suggested hardware resources needed for the system's
efficient deployment and operation.
Component Specification
RAM 8 GB
15
Recommended Hardware Requirements (for production/cloud deployment)
Component Specification
RAM 16–32 GB
Operating system
Backend Technologies
● Python 3.9+
● Flask or FastAPI (for API endpoints)
● Node.js (for backend services, optional)
● MongoDB / PostgreSQL (for data storage)
● Tesseract OCR engine
● TensorFlow / PyTorch (for ML/ANN models)
● HuggingFace Transformers (for BERT/RoBERTa)
Frontend Technologies
16
Machine Learning and NLP Libraries
● Scikit-learn
● Transformers (HuggingFace)
● SpaCy / NLTK (for text preprocessing)
● OpenCV (for image processing)
● FuzzyWuzzy (for fuzzy matching)
17
Chapter 5
SYSTEM DESIGN
The phase of system design is essential for establishing the framework, functionality, and
interactions of software elements. This part outlines the architecture, approach, and
comprehensive design of the suggested solution.
The system follows a modular multi-layered architecture comprising four main components:
Integrates:
18
4. Database layer
● User data
● Uploaded scripts
● Reference answers
● Evaluation results
The suggested system streamlines the process of assessing handwritten descriptive responses
through a combined method that utilizes Optical Character Recognition (OCR), Natural
Language Processing (NLP), semantic similarity assessment, and rule-based marking. The aim is
to mimic the actions a human evaluator would perform
Educators start by submitting scanned images of handwritten exam papers via the online platform.
These images are processed using the Tesseract OCR engine, which retrieves the text content. To
enhance accuracy, preprocessing techniques like binarization and noise reduction can be utilized.
The result is unrefined machine-readable text that reflects the student's response.
The extracted text is subjected to normalization, which involves converting everything to lowercase,
removing punctuation, and tidying up whitespace. The refined text is then tokenized—divided
into significant units like words or phrases—to facilitate analysis and comparison. This step
guarantees that unnecessary differences in formatting do not affect the assessment.
19
3. Reference Answer Integration
A benchmark or sample response for each question serves as a basis for comparison. This can be
provided by the instructor or generated automatically using a Large Language Model (LLM). At
this point, key terms, expected length, and evaluation criteria (such as required points) can be
established. These elements assist in guiding the assessment and ensuring consistency with the
desired learning outcomes.
This is the core evaluation step and involves both advanced NLP techniques and rule-based
checks:
This dual-layered evaluation ensures both semantic understanding and structural completeness of
the answer.
According to the results of semantic similarity evaluations and rule compliance checks, a
numerical score is assigned to every answer by the system. The scoring mechanism relies on set
thresholds (for instance, a cosine similarity greater than 0.8 equates to full points) or a
combination of various weighted features. Feedback is generated automatically, emphasizing the
strengths and pinpointing overlooked keywords or ideas, which assists the student in grasping
their performance.
20
6. Result Storage and Display
The final results and comments are kept in a backend database and displayed to the teacher
through the UI dashboard. This allows for straightforward review, manual adjustments (when
necessary), and tracking of performance over the long term. Educators have the option to
download or export outcomes for additional examination.
21
5.3 Detailed Design
The detailed design stage deconstructs the system architecture into separate components,
specifying their internal structure, functions, responsibilities, and interactions. This part
encompasses the logic of each component, the functions of modules, and the flows of data.
The class diagram illustrates the primary elements of the system and their interactions. It features
classes such as OCRProcessor, SimilarityEvaluator, and ScoreCalculator, each tasked with
distinct functions like extracting text, comparing content, and calculating scores.
22
5.3.2 Activity Diagram
The activity diagram demonstrates the sequential flow of the automated assessment process for
handwritten descriptive responses. It starts with the user submitting a scanned answer paper,
which is subsequently processed by an OCR module to retrieve the text. The extracted text is
then sanitized and assessed against a model answer using techniques for semantic similarity and
keyword matching. Following these assessments, a cumulative score is computed, and feedback
is created. The results of the evaluation are then either displayed or stored. This diagram aids in
visualizing the complete workflow of the system, illustrating how the different components
interact from the initial input to the final output.
23
5.3.3 Use Case Diagram
The use case diagram illustrates the relationships between users, such as students and evaluators,
and the system. It describes essential tasks including submitting answer scripts, conducting
evaluations, producing scores, and accessing results. This diagram aids in defining the system's
functions from the viewpoint of the user.
24
REFERENCES
1. P. Deepak, R. Rohan, R. Rohith, and R. Roopa, "NLP and OCR Based Automatic
Answer Script Evaluation System," International Journal of Computer Applications, vol.
186, no. 42, pp. 22–27, Sep. 2024.ResearchGate+2ResearchGate+2IJCA+2
2. A. Rayate, S. Ghumare, F. Sayyed, H. Sapkal, and S. Rokade, "Automated Grading
System for Subjective Answers Evaluation Using Machine Learning and NLP,"
International Journal of Innovative Research in Management, Engineering and
Technology, vol. 12, no. 2, pp. 45–50, Mar.–Apr. 2024.ijirmps.org
3. M. A. Rahaman and H. Mahmud, "Automated Evaluation of Handwritten Answer Script
Using Deep Learning Approach," Transactions on Engineering and Computing Sciences,
vol. 10, no. 4, pp. 35–42, 2022.journals.scholarpublishing.org
4. G. Sanuvala and S. S. Fatima, "A Pedagogical Approach with NLP and Deep Learning,"
Ain Shams Engineering Journal, vol. 12, no. 3, pp. 1234–1240, 2021.
5. K. Sharma, A. Verma, and R. Singh, "Eval - Automatic Evaluation of Answer Scripts
Using Deep Learning and NLP," International Journal of Intelligent Systems and
Applications in Engineering, vol. 9, no. 2, pp. 78–84, 2023.
6. S. Kumar and M. Patel, "Automatic Grading of Answer Sheets Using Machine Learning
Techniques," International Journal of Computer Applications, vol. 175, no. 8, pp. 15–20,
2023.
7. M. Kulkarni, A. Desai, and R. Mehta, "Digital Handwritten Answer Sheet Evaluation
System," International Journal of Computer Applications, vol. 186, no. 16, pp. 30–35,
2024.
8. P. Kudi and A. Manekar, "Automated Descriptive Answer Evaluation System Using
Machine Learning," International Journal of Advance Research and Innovative Ideas in
Education, vol. 7, no. 2, pp. 60–65, 2023.
9. S. Gupta and R. Sharma, "AI-Powered Framework for Automated, Objective Evaluation
of Handwritten Responses Using OCR & NLP," ReadyTensor Publications, vol. 1, no. 1,
pp. 10–15, 2025.
25
10. J. Osaka, A. Maeda, and H. Oka, "Reliable and Efficient Automated Short-Answer
Scoring for a Large Dataset Using Active Learning and Deep Learning," Interactive
Learning Environments, vol. 33, no. 2, pp. 200–215, 2025.ResearchGate
11. M. Haller, S. Müller, and T. Schmidt, "Survey on Automated Short Answer Grading with
Deep Learning: From Word Embeddings to Transformers," arXiv preprint
arXiv:2204.03503, 2022.
12. S. Saha, R. Ghosh, and A. Banerjee, "Joint Multi-Domain Learning for Automatic Short
Answer Grading," arXiv preprint arXiv:1902.09183, 2019.
13. S. Chauhan, A. Kumar, and P. Gupta, "Explanation Based Handwriting Verification,"
arXiv preprint arXiv:1909.02548, 2019.
14. I. Aggarwal, P. Gautam, and G. Parashar, "Automated Subjective Answer Evaluation
Using Machine Learning," SSRN Electronic Journal, vol. 1, no. 1, pp. 1–10, Jan.
2024.ResearchGate
15. N. Bharambe, S. Patil, and R. Deshmukh, "Automatic Answer Evaluation Using Machine
Learning," International Journal of Innovative Technology, vol. 7, no. 2, pp. 25–30,
2021.
16. B. Nandwalkar, A. Joshi, and M. Kale, "Descriptive Handwritten Paper Grading System
Using NLP and Fuzzy Logic," International Journal of Pattern Recognition and
Artificial Intelligence, vol. 37, no. 4, pp. 273–282, 2023.
17. S. Kulkarni, R. Pawar, and A. Deshmukh, "Digital Handwritten Answer Sheet Evaluation
System," International Journal of Scientific Research in Engineering and Management,
vol. 9, no. 3, pp. 50–55, 2024.
18. P. S. Preetha, R. Nair, and S. Kumar, "Descriptive Answers Evaluation Using Natural
Language Processing Approaches," International Journal of Computer Applications, vol.
186, no. 20, pp. 40–45, 2024.ResearchGate
19. S. Mahapatra, A. Das, and R. Roy, "Automated Exam Paper Checking System Using
NLP," Devpost Projects, vol. 1, no. 1, pp. 1–5, 2023.
20. S. Kumar and M. Patel, "Automatic Grading of Answer Sheets Using Machine Learning
Techniques," International Journal of Computer Applications, vol. 175, no. 8, pp. 15–20,
2023.
26