0% found this document useful (0 votes)
47 views33 pages

Capstone Phase 1 Report

The document presents a Capstone Project report on an automated answer script evaluation system called EvalAI, developed by students from Visvesvaraya Technological University. It outlines the challenges of manual grading of handwritten responses and proposes a solution that integrates optical character recognition (OCR) and natural language processing (NLP) to enhance grading efficiency and fairness. The project aims to create a reliable AI-based system that reduces the workload for educators while providing timely feedback to students.

Uploaded by

bmmithun688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views33 pages

Capstone Phase 1 Report

The document presents a Capstone Project report on an automated answer script evaluation system called EvalAI, developed by students from Visvesvaraya Technological University. It outlines the challenges of manual grading of handwritten responses and proposes a solution that integrates optical character recognition (OCR) and natural language processing (NLP) to enhance grading efficiency and fairness. The project aims to create a reliable AI-based system that reduces the workload for educators while providing timely feedback to students.

Uploaded by

bmmithun688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi - 590 018, Karnataka

Capstone Project Phase – 1


(Course code:
24AM6PWPW1)

PROJECT WORK REPORT

EvalAI - Automated Answer Script Evaluation

In the Department of
Machine Learning
(UG Program: B.E. in Artificial Intelligence and Machine Learning)

Submitted by
Basem Kaunain 1BM22AI026
Bhavana S 1BM22AI029
Faiza Khanum 1BM22AI047

6th Semester

Under the Guidance of


Prof. Chethana V
Assistant Professor
Dept. of MEL, BMSCE, Bengaluru – 19

DEPARTMENT OF MACHINE LEARNING


B.M.S. COLLEGE OF ENGINEERING
(An Autonomous Institute, Affiliated to VTU)
vi
P.O. Box No. 1908, Bull Temple Road, Bengaluru - 560 019

vi
TABLE OF CONTENTS

Declaration i

Acknowledgement ii

Abstract iii

Table of Contents iv

1 INTRODUCTION 1

1.1 Overview

1.2 Organization of the Report

2 PROBLEM STATEMENT 4

2.1 Problem Statement 4

2.2 Motivation 4

2.3 Objectives and Scope 5

3 LITERATURE SURVEY 7

3.1 Existing System 7

3.2 Research Gap 10

4 SYSTEM REQUIREMENT SPECIFICATION 12

4.1 Functional Requirements 12

4.2 Non-functional Requirements 14

4.3 Hardware Requirements 15

4.4 Software Requirements 16

5 SYSTEM DESIGN 18

5.1 System Architecture 18

5.2 Proposed Methodology 19

vi
5.3 Detailed Design 22

5.3.1 Class Diagram

5.3.2 Activity Diagram

5.3.3 Use Case Diagram

APPENDIX............Plagiarism and AI writing report

vi
LIST OF FIGURES

Figure No. Description Page No.

5.1 Workflow of Proposed Methodology 21


5.2 Class Diagram of System Design 22

5.3 Activity Diagram of System Design 23

5.4 Use Case Diagram of System Design 24

vi
LIST OF TABLES

Table No. Description Page No.

4.1 Minimum Hardware Requirements 15

4.2 Recommended Hardware Requirements 16

vi
ABSTRACT

The automation of grading handwritten, long-format responses has garnered considerable interest
in research. Recent studies integrate optical character recognition (OCR) with natural language
processing (NLP) to convert student-written answers into digital form and assess them against
model responses. Methods vary from basic keyword matching and cosine text similarity to more
sophisticated techniques such as neural network classifiers, fuzzy logic for awarding partial
credit, and fine-tuned large language models (LLMs) for semantic appraisal. Together, these
strategies enhance fairness and consistency in scoring, mitigate human bias, and significantly
decrease the time required for evaluating detailed answers, marking an advancement towards
dependable automated evaluation in education. Various approaches have been investigated to
bolster the accuracy and fairness of automated grading. These methods include straightforward
keyword matching, cosine similarity computations using TF-IDF vectors, and more advanced
strategies like neural network classifiers, BERT-based assessments of semantic similarity, and
fuzzy logic systems that permit partial credit based on the relevance and comprehensiveness of
responses. Recent progress has also led to the adoption of large language models (LLMs) like
LLaMA or those based on GPT architectures, which can comprehend the essence of a student's
response and deliver a nuanced assessment. These combined methods significantly lessen the
burden on teachers, guarantee consistent grading across extensive sets of answer sheets, and
offer students prompt, more objective evaluations. Ultimately, they signify a substantial
advancement in making the evaluation of descriptive answers more scalable, precise, and fair.

iii
Chapter 1
INTRODUCTION

In today’s educational environment, manually grading descriptive exam responses continues to


be a significant obstacle in the assessment process. This approach, while conventional and
widely practiced, is both labor-intensive and prone to various human errors. Educators must
carefully analyze each answer, often deciphering different handwriting styles, judging the
relevance and clarity of the material, and making challenging decisions regarding partial
correctness. These activities are not only time-consuming but also mentally taxing, particularly
when graders are confronted with hundreds of papers to evaluate within a tight timeframe.

No matter how skilled or dedicated, human evaluators are vulnerable to fatigue, stress, and
personal biases. These elements can cause variations in scoring, where the same answer might
receive different grades based on the time or the evaluators’ mindset. Such discrepancies
jeopardize the fairness and objectivity of the examination system, raising significant concerns in
high-stakes assessments like university semester examinations, board exams, or standardized
national tests. As a result, there is a growing need for dependable, scalable, and impartial
alternatives to manual grading.

Objective questions, such as multiple-choice or true/false formats, have long been suitable for
automation. Optical Mark Recognition (OMR) systems have transformed the grading of these
types, enabling precise and efficient evaluations on a large scale. However, automating the
grading of open-ended, handwritten responses—particularly those that require essay-type
answers, detailed explanations, or sequential derivations—presents a much more intricate
challenge. These answers demand not only recognition of the written text but also a deep
understanding of the semantic content, logical structure, and thoroughness of the explanation.

Recent developments in Artificial Intelligence (AI), especially in Machine Learning (ML) and
Natural Language Processing (NLP), present promising solutions to this issue. By training
intelligent systems on extensive datasets that include scanned answer sheets along with their
associated human-assigned scores, machines can learn to recognize important characteristics of
high-quality answers. These characteristics might encompass the inclusion of certain keywords,

1
compliance with expected formats, grammatical accuracy, and semantic relevance to the posed
questions. Significantly, machine learning models can uphold consistent grading standards across
vast datasets, thereby removing the subjectivity linked to human grading.

The primary benefit of automated grading systems is their reliability and speed. Once trained,
these systems can evaluate thousands of answer sheets in a tiny fraction of the time required by
human graders. Additionally, they uniformly apply the same grading criteria to all submissions,
thereby ensuring fairness in the evaluations. This consistency is particularly advantageous in
large-scale testing environments, where variability in human assessment frequently results in
student dissatisfaction and an increase in requests for re-evaluation.

The COVID-19 pandemic hastened the shift towards online and hybrid learning formats,
highlighting the need for scalable digital assessment solutions. In online testing scenarios, the
logistical challenges of distributing, collecting, and manually grading handwritten responses
became even more evident. Consequently, the creation of automated grading tools for descriptive
answers has experienced significant growth within academic and educational technology circles.

A standard system for automated evaluation of descriptive answers consists of several essential
components. Initially, Optical Character Recognition (OCR) technologies are utilized to convert
text from scanned handwritten answer sheets into digital format. Historically, OCR has primarily
focused on printed text; however, recent advancements, particularly in neural network-based
OCR models, have greatly enhanced the precision of handwriting recognition across various
styles and languages.

After the handwritten material is digitized, Natural Language Processing (NLP) techniques are
applied. These algorithms examine the extracted text for grammatical accuracy, relevance to the
subject matter, keyword alignment, sentence construction, and coverage of topics. Depending on
the answer's complexity and the scoring methodology, different NLP approaches might be
utilized, ranging from basic rule-based matching and bag-of-words techniques to more advanced
deep learning models such as Recurrent Neural Networks (RNNs), Transformers, and pre-trained
Large Language Models (LLMs) like BERT and GPT.

2
Furthermore, some systems integrate fuzzy logic and probabilistic frameworks to address unclear
or ambiguous responses, allowing for partial credit when warranted. Others utilize semantic
similarity measures like cosine similarity, BLEU scores, or tailored scoring algorithms to
evaluate student answers against model responses. Recent studies are also investigating the
application of graph-based models and knowledge representation systems to evaluate conceptual
understanding and coherence in student writing.

Another exciting avenue is the adoption of generative models that mimic human grading. These
models are trained not only to evaluate response content but also to provide feedback—assisting
students in recognizing their errors and enhancing their future performance. This educational
aspect of automated grading significantly enriches the learning experience by transforming
assessments into opportunities for learning rather than merely evaluative tasks.

Deploying such systems does come with its challenges. Errors in OCR can result in imprecise
text extraction, particularly when dealing with poor handwriting or low-resolution scans. Bias in
training datasets can distort grading results if the system encounters a limited variety of answer
styles or student backgrounds. Additionally, issues of interpretability and transparency remain
vital concerns. Educators need to comprehend and trust the decision-making processes of these
systems to fully integrate them into practical educational environments.

Despite these obstacles, the potential advantages of automated grading for descriptive answers
are considerable. It has the capacity to lessen the workload for educators, speed up the feedback
process for learners, and harmonize assessment methods across various institutions and regions.
With ongoing developments in AI, computer vision, and natural language processing (NLP),
along with thorough testing and validation in educational settings, these technologies are poised
to become essential tools in the realm of assessment.

This document seeks to deliver a thorough overview of this emerging field. It starts by
examining the reasons and challenges that drive the creation of automated grading solutions for
handwritten descriptive answers. It then assesses existing systems and their fundamental
methodologies, emphasizing advancements in optical character recognition (OCR), NLP, deep
learning, and hybrid scoring algorithms. Lastly, it discusses the present limitations of these
systems and highlights promising avenues for future research and practical applications.

3
Chapter 2

PROBLEM STATEMENT

2.1 Problem Statement

Evaluating handwritten descriptive exam answers manually is a lengthy, error-prone, and


subjective task that heavily burdens educators, especially when handling a large number of
answer sheets. Conventional assessment methods frequently encounter problems such as
inconsistencies, mistakes from fatigue, and delayed feedback, all of which can adversely affect
the learning experience. Although automated systems have demonstrated potential in grading
objective questions and typed submissions, the challenges of recognizing various handwriting
styles, grasping natural language meanings, and offering constructive feedback on descriptive
responses remain significant obstacles. There is an increasing demand for reliable, scalable, and
precise AI-driven systems capable of processing handwritten answers, interpreting their
meaning, and evaluating them fairly and consistently, thus enhancing both efficiency and quality
in educational assessments.

2.2 Motivation

Evaluating handwritten theoretical responses by hand is an arduous and subjective task,


especially when processing a large volume of exam papers in educational environments.
Educators often face pressure to complete grading by tight deadlines, which can lead to fatigue,
inconsistency, and unintentional bias in their assessments. Furthermore, students may encounter
delays or insufficient feedback, which limits their opportunities for learning and development. In
an era where automation and artificial intelligence are rapidly transforming traditional practices,
there is a critical need to adopt intelligent systems capable of evaluating descriptive responses
accurately and fairly. Thanks to progress in optical character recognition, natural language
processing, and transformer-based models, it is now feasible to create systems that can grasp the
semantic meaning of handwritten responses. This project intends to alleviate the workload for
teachers, improve fairness in grading, and provide students with timely and constructive
feedback—ultimately enhancing the overall quality and scalability of academic assessments.

4
2.3 Scope and Objectives

Scope:
The aim of automated evaluation systems for descriptive answers is to simplify the grading
process for handwritten exam responses of short to medium length typically seen in schools and
universities. These systems are specifically crafted to assess theoretical or conceptual responses
worth between 2 to 10 marks, provided there is a clear reference answer or grading rubric
available. The emphasis is on analyzing scanned handwritten exam sheets, where the primary
challenge involves accurate recognition of handwriting through OCR, followed by semantic
content analysis using advanced NLP and machine learning methodologies. Although they were
initially implemented in fields like computer science and engineering, these solutions are not
limited to any particular content area and can be adapted to other academic disciplines with the
appropriate training data. The systems are designed for legible handwriting in English and are
capable of addressing moderate noise in images, but they do not currently accommodate
multilingual inputs or highly subjective or creative answers. Ultimately, these systems are aimed
at structured examination environments to provide scalable, consistent, and equitable evaluation
results that can match or surpass the accuracy of human grading.

Objectives:

1. To develop an AI-based system capable of automatically evaluating


handwritten descriptive exam answers with accuracy and consistency
comparable to human graders.
This requires incorporating OCR to recognize handwriting, utilizing NLP for grasping
content, and employing machine learning models to forecast scores based on semantic
similarity and the relevance of answers.

5
2. To reduce the time, labor, and subjectivity involved in manual grading by
providing a fast, unbiased, and scalable automated evaluation mechanism.
By reducing the role of humans in the scoring process, teachers can concentrate on
qualitative educational tasks, while students can obtain faster and more uniform
feedback.

3. To demonstrate the practical applicability of the proposed system across


academic settings and subject domains, while maintaining fairness and
transparency in the scoring process.
The system seeks to improve the education sector by increasing the efficiency of grading
in large examinations and providing data-driven insights into trends in student
performance.

Impact on Society/Academics/Industry:
The effective implementation of automated systems for evaluating descriptive answers can
revolutionize educational assessment by providing prompt and unbiased grading, particularly in
extensive examinations. For educators, it alleviates their workload and improves the speed of
feedback. For learners, it guarantees fairness and openness. In the corporate sphere, such systems
can be modified for training evaluations or certification tests, showcasing their versatility across
different fields. In summary, this technology promotes educational fairness, enhances operational
efficiency, and drives digital advancements in evaluation methods.

6
Chapter 3

LITERATURE SURVEY

The automated assessment of descriptive responses, particularly handwritten responses, presents


a challenging task that merges computer vision, natural language processing, and machine
learning. Numerous studies have been undertaken over the last ten years to tackle the
shortcomings of manual grading—like inconsistency, subjectivity, and inefficiency. This section
provides an overview of significant existing systems and technological methods, as well as
highlights the deficiencies in the present state of the art.

3.1 Existing System

Evaluating descriptive answer scripts by hand is a time-consuming task that can be influenced by
human bias and exhaustion. To mitigate these issues, various automated systems have been
created that utilize advancements in Natural Language Processing (NLP), Optical Character
Recognition (OCR), machine learning, and fuzzy logic to assess lengthy handwritten answers.
These technologies strive to replicate or enhance human assessment regarding efficiency,
impartiality, and uniformity.

Early Keyword-Based Approaches

Early attempts at automatic grading were basic and mainly depended on matching keywords.
Bharambe et al. (2021) introduced a system that compared a student’s response to a list of key
terms supplied by the examiner. The model would tally the matches and assign scores according
to established criteria. Although efficient, these systems struggled with semantic variation and
could be easily misled by superficial inclusion of keywords without true comprehension of the
content.

TF-IDF and Vector Similarity

Later research shifted focus to similarity metrics based on vectors. Kulkarni et al. (2022) applied
Term Frequency–Inverse Document Frequency (TF–IDF) to transform both student and model

7
answers into vector representations, subsequently employing cosine similarity to evaluate
content alignment. This marked an advancement, allowing for the identification of answers
despite variations in phrasing.

Transformer-Based Embedding Models

To achieve a greater understanding of semantic similarity, researchers such as Kulkarni et al.


(2024) utilized transformer models like BERT and RoBERTa. These models converted both
student and reference answers into contextualized vector representations, enhancing the
assessment of paraphrased or conceptually related responses. In contrast to conventional TF-IDF
models, transformers grasp the meaning of words based on context, allowing for more refined
grading.

Machine Learning Models

Sanuvala and Fatima (2021) developed the Handwritten Answer Evaluation System (HAES),
which is a supervised machine learning model trained on graded examples. Each sentence in the
answer was evaluated using similarity metrics, and the overall response was compiled to produce
a final score. Dheerays et al. (2024) and Bharambe et al. (2021) investigated the use of artificial
neural networks (ANNs), which incorporated features like answer length, frequency of
keywords, and semantic relevance. These networks were designed to predict grades based on the
training data, enhancing accuracy across various answer formats.

Fuzzy Logic Systems

Acknowledging that human grading often relies on subjective and vague assessments,
Nandwalkar et al. (2023) utilized fuzzy logic for automated scoring. Their approach generated
scores based on keyword and semantic similarities, and then applied fuzzy inference rules to
determine grades. This allowed for the awarding of partial credit for answers that were somewhat
correct, resembling the qualitative reasoning of human graders. The system classified responses
into categories such as “Very Good,” “Fair,” and others according to graded membership
functions, enhancing the flexibility of the evaluation process.

8
OCR-Enabled Handwritten Input

OCR is essential for converting handwritten documents into digital formats. Tesseract, a popular
open-source optical character recognition engine, continues to be widely utilized. Approaches
developed by Zhang et al. (2021) and Deepak et al. (2022) improve Tesseract by employing
pre-processing techniques like binarization, noise elimination, and skew adjustment to enhance
the accuracy of OCR for answer sheets.

LLM-Based Evaluation

Recent research by Agarwal et al. (2024) incorporates Large Language Models (LLMs) for
assessment purposes. A customized LLaMA-2 model, enhanced with retrieval-augmented
generation (RAG), was utilized to evaluate comprehensive answers based on textbook content.
This system demonstrated accuracy and consistency similar to that of human graders, effectively
pinpointing essential concepts and evaluating relevance. Implemented via AWS SageMaker, it
serves as a state-of-the-art, scalable solution.

Commercial Solutions

Tools such as Gradescope, created by Turnitin, employ AI-supported grading for scanned
student responses. Although not fully automated, these systems enable semi-automated
assessment by grouping similar answers and permitting batch grading. Likewise, Mettl and
ExamSoft offer online proctoring and limited answer evaluation through AI, though restricted to
structured or semi-structured responses.

9
3.2 Research Gaps

Despite significant progress, several gaps persist in the automated evaluation of handwritten
descriptive answers:

1. OCR Limitations with Diverse Handwriting

OCR systems such as Tesseract struggle when dealing with irregular handwriting, different ink
types, and varying scan qualities. Many systems rely on high-quality digital inputs. It is essential
for research to tackle the variability of handwriting specific to different domains, particularly
concerning age groups and regional scripts.

2. Lack of Publicly Available Datasets

Numerous studies depend on proprietary datasets, which complicates reproducibility and


benchmarking. There is an urgent requirement for extensive, publicly available datasets of
handwritten, evaluated exam responses to promote standardization and competition.

3. Contextual Understanding Beyond Keywords

Many conventional systems emphasize superficial characteristics such as keywords or


predetermined similarity thresholds. Some embedding-based models also face challenges in
grasping subtle nuances (for instance, sarcasm, logical coherence, and contradictory reasoning).
Large language models show potential in this area, but they need fine-tuning and thorough
assessment to guarantee their dependability in educational contexts.

4. Feedback Generation

Although systems can allocate scores, few offer meaningful feedback. Feedback plays a vital
role in the learning process. Adding generative features for in-depth, question-oriented
explanations would greatly improve educational effectiveness.

10
5. Subject-Specific Evaluation Challenges

Responses in areas such as mathematics, physics, or law typically demand a precise logical
framework or references. General natural language processing methods might struggle to assess
these answers correctly. Tailoring to specific domains is essential.

6. Multi-Language and Code-Mixed Inputs

In both Indian and international educational environments, responses are frequently composed in
various languages or feature a blend of English and local expressions. Only a limited number of
systems manage code-mixed content or multilingual writing well.

7. Ethical Concerns and Bias in ML Models

Models built on biased data could perpetuate inequalities in grading. It is essential that
transparent assessment methods and de-biasing techniques are incorporated into any
implemented system.

11
Chapter 4

SYSTEM REQUIREMENT SPECIFICATION

Automated assessment of handwritten descriptive responses requires a system that combines


Optical Character Recognition (OCR), Natural Language Processing (NLP), and Machine
Learning (ML) components within a dependable backend framework and an intuitive user
interface. The subsequent specification details the functional and non-functional requirements of
the system, in addition to the necessary hardware and software specifications.

4.1 Functional Requirements

1. Upload Answer Scripts

● The system must allow users (e.g., teachers or examiners) to upload scanned images or
PDFs of student answer scripts.
● Support for bulk uploads should be included.

2. OCR and Preprocessing

● The system shall extract text from handwritten answer images using OCR (e.g., Tesseract
or equivalent).
● It must accommodate preprocessing procedures like converting to grayscale, eliminating
noise, and applying thresholding to enhance OCR precision.

3. Text Evaluation

● The answer that has been derived needs to be assessed by comparing it to a reference
response using semantic similarity methods (such as BERT embeddings and cosine
similarity).
● The presence of keywords, checks on length, and overall structural clarity of the response
will be evaluated. Functional requirements outline the fundamental actions and
characteristics of the system. These requirements concentrate on the system's reactions to
user inputs, its data management, and its execution of primary tasks.

12
4. Score Generation

● According to the degree of similarity and significance of the reply, the system ought to
produce a rating (for instance, on a scale of 5 or 10).
● The grading mechanism may incorporate fuzzy logic, predictions from machine learning
models, or outputs generated by large language models.

5. Feedback Generation

● The system will provide concise feedback highlighting the strengths and weaknesses of
the response to aid in learning.

6. Result Display and Export

● Evaluated scores and feedback must be displayed in a structured format.


● The system should allow exporting results as CSV, PDF, or JSON formats.

7. Admin Panel

● The system must feature an admin panel that allows examiners to upload reference
responses, modify evaluation criteria, and retrain models when needed.

8. Authentication and Access Control

● Users must log in with valid credentials.


● Role-based access must be implemented (e.g., Admin, Teacher, Reviewer).

13
4.2 Non-functional Requirements

Non-functional requirements describe the system’s performance, scalability, usability, and


security characteristics.

1. Accuracy

● The system must achieve a high level of OCR precision, exceeding 90% for clear
handwriting, and maintain consistency in evaluation.
● NLP similarity measures need to be refined to guarantee that the scores accurately
represent the quality of the answers.

2. Performance

● Each answer script must be evaluated within a reasonable timeframe of less than five
seconds per answer.
● The system must be able to handle the simultaneous processing of at least 10 answer
scripts.

3. Scalability

● The system needs to be able to scale in order to accommodate various exams and manage
numerous scripts with each batch upload.
● It must support deployment on cloud platforms for elastic resource allocation.

4. Usability

● The user interface must be easy to navigate, necessitating little training for instructors.
● Error messages and feedback from the system should be straightforward and offer clear
next steps.

5. Reliability

● The system must be resilient to faulty uploads or erroneous data.


● Retry mechanisms should be put in place for unsuccessful OCR or scoring tasks.

14
6. Security

● Answer scripts and results that are uploaded need to be stored in a secure manner, with
encryption both during transmission and while at rest.
● Access to evaluation data should be limited to authorized users only.

7. Maintainability

● The system ought to be modular and adhere to clean coding practices to facilitate
straightforward updates, debugging, and enhancements.
● Each module should be accompanied by appropriate documentation.

4.3 Hardware Requirements

The following outlines the essential and suggested hardware resources needed for the system's
efficient deployment and operation.

Minimum Hardware Requirements (for local testing/development)

Component Specification

Processor Intel Core i5 or equivalent

RAM 8 GB

Storage 256 GB SSD

GPU (Optional) Integrated or basic GPU

Table.4.1. Minimum Hardware Requirements

15
Recommended Hardware Requirements (for production/cloud deployment)

Component Specification

Processor Multi-core CPU (e.g., 4 vCPUs or more)

RAM 16–32 GB

Storage 500 GB SSD or scalable cloud storage

GPU NVIDIA Tesla T4 or better (for LLMs/ML)

Table.4.2. Recommended Hardware Requirements

4.4 Software Requirements

Operating system

● Ubuntu 20.04+ / Windows 10 / macOS (for development)


● Docker-based Linux environment (for deployment)

Backend Technologies

● Python 3.9+
● Flask or FastAPI (for API endpoints)
● Node.js (for backend services, optional)
● MongoDB / PostgreSQL (for data storage)
● Tesseract OCR engine
● TensorFlow / PyTorch (for ML/ANN models)
● HuggingFace Transformers (for BERT/RoBERTa)

Frontend Technologies

● Angular / React.js (for UI)


● Tailwind CSS / Bootstrap (for styling)

16
Machine Learning and NLP Libraries

● Scikit-learn
● Transformers (HuggingFace)
● SpaCy / NLTK (for text preprocessing)
● OpenCV (for image processing)
● FuzzyWuzzy (for fuzzy matching)

DevOps & Deployment Tools

● Docker & Docker Compose


● Kubernetes (for orchestration)
● AWS EC2/S3/SageMaker or equivalent cloud platform
● GitHub Actions / Jenkins (for CI/CD)

17
Chapter 5

SYSTEM DESIGN

The phase of system design is essential for establishing the framework, functionality, and
interactions of software elements. This part outlines the architecture, approach, and
comprehensive design of the suggested solution.

5.1 System Architecture

The system follows a modular multi-layered architecture comprising four main components:

1. Frontend (User Interface)

● Developed using Angular or React.


● Allows teachers to upload answer scripts, view evaluation results, and download reports.

2. Backend API Layer

● Developed using Flask or FastAPI.


● Handles incoming HTTP requests, file uploads, authentication, and coordination between
services.

3. Processing & Evaluation Layer

Integrates:

● OCR Module (Tesseract + OpenCV): Extracts text from images.


● Text Preprocessing Module (SpaCy/NLTK): Cleans and tokenizes text.
● Similarity Module (BERT + cosine similarity): Compares student answers to model
answers.
● Scoring Engine: Assigns marks based on semantic similarity and keyword matches.

18
4. Database layer

MongoDB or PostgreSQL stores:

● User data
● Uploaded scripts
● Reference answers
● Evaluation results

5.2 Proposed Methodology

The suggested system streamlines the process of assessing handwritten descriptive responses
through a combined method that utilizes Optical Character Recognition (OCR), Natural
Language Processing (NLP), semantic similarity assessment, and rule-based marking. The aim is
to mimic the actions a human evaluator would perform

1. Image Upload and OCR Extraction

Educators start by submitting scanned images of handwritten exam papers via the online platform.
These images are processed using the Tesseract OCR engine, which retrieves the text content. To
enhance accuracy, preprocessing techniques like binarization and noise reduction can be utilized.
The result is unrefined machine-readable text that reflects the student's response.

2. Text Cleaning and Tokenization

The extracted text is subjected to normalization, which involves converting everything to lowercase,
removing punctuation, and tidying up whitespace. The refined text is then tokenized—divided
into significant units like words or phrases—to facilitate analysis and comparison. This step
guarantees that unnecessary differences in formatting do not affect the assessment.

19
3. Reference Answer Integration

A benchmark or sample response for each question serves as a basis for comparison. This can be
provided by the instructor or generated automatically using a Large Language Model (LLM). At
this point, key terms, expected length, and evaluation criteria (such as required points) can be
established. These elements assist in guiding the assessment and ensuring consistency with the
desired learning outcomes.

4. Semantic Similarity Evaluation

This is the core evaluation step and involves both advanced NLP techniques and rule-based
checks:

● A pre-trained sentence transformer (such as BERT or RoBERTa) converts both


the student’s and reference answers into dense semantic embeddings.
● Cosine similarity between the embeddings measures the degree of semantic overlap,
capturing meaning even when wording differs.
● Additional rule-based checks are applied:
○ Keyword presence: Ensures critical concepts are addressed.
○ Length requirement: Verifies that the student’s response meets a
minimum expected length.
○ Grammar/coherence: Basic checks to ensure the text is understandable.

This dual-layered evaluation ensures both semantic understanding and structural completeness of
the answer.

5. Scoring and Feedback Generation

According to the results of semantic similarity evaluations and rule compliance checks, a
numerical score is assigned to every answer by the system. The scoring mechanism relies on set
thresholds (for instance, a cosine similarity greater than 0.8 equates to full points) or a
combination of various weighted features. Feedback is generated automatically, emphasizing the
strengths and pinpointing overlooked keywords or ideas, which assists the student in grasping
their performance.

20
6. Result Storage and Display

The final results and comments are kept in a backend database and displayed to the teacher
through the UI dashboard. This allows for straightforward review, manual adjustments (when
necessary), and tracking of performance over the long term. Educators have the option to
download or export outcomes for additional examination.

Fig.5.1. Workflow of Proposed Methodology

21
5.3 Detailed Design

The detailed design stage deconstructs the system architecture into separate components,
specifying their internal structure, functions, responsibilities, and interactions. This part
encompasses the logic of each component, the functions of modules, and the flows of data.

5.3.1 Class Diagram

Fig.5.2. Class Diagram of System Design

The class diagram illustrates the primary elements of the system and their interactions. It features
classes such as OCRProcessor, SimilarityEvaluator, and ScoreCalculator, each tasked with
distinct functions like extracting text, comparing content, and calculating scores.

22
5.3.2 Activity Diagram

Fig.5.3. Activity Diagram of System Design

The activity diagram demonstrates the sequential flow of the automated assessment process for
handwritten descriptive responses. It starts with the user submitting a scanned answer paper,
which is subsequently processed by an OCR module to retrieve the text. The extracted text is
then sanitized and assessed against a model answer using techniques for semantic similarity and
keyword matching. Following these assessments, a cumulative score is computed, and feedback
is created. The results of the evaluation are then either displayed or stored. This diagram aids in
visualizing the complete workflow of the system, illustrating how the different components
interact from the initial input to the final output.

23
5.3.3 Use Case Diagram

Fig.5.4. Use Case Diagram of System Design

The use case diagram illustrates the relationships between users, such as students and evaluators,
and the system. It describes essential tasks including submitting answer scripts, conducting
evaluations, producing scores, and accessing results. This diagram aids in defining the system's
functions from the viewpoint of the user.

24
REFERENCES

1. P. Deepak, R. Rohan, R. Rohith, and R. Roopa, "NLP and OCR Based Automatic
Answer Script Evaluation System," International Journal of Computer Applications, vol.
186, no. 42, pp. 22–27, Sep. 2024.ResearchGate+2ResearchGate+2IJCA+2
2. A. Rayate, S. Ghumare, F. Sayyed, H. Sapkal, and S. Rokade, "Automated Grading
System for Subjective Answers Evaluation Using Machine Learning and NLP,"
International Journal of Innovative Research in Management, Engineering and
Technology, vol. 12, no. 2, pp. 45–50, Mar.–Apr. 2024.ijirmps.org
3. M. A. Rahaman and H. Mahmud, "Automated Evaluation of Handwritten Answer Script
Using Deep Learning Approach," Transactions on Engineering and Computing Sciences,
vol. 10, no. 4, pp. 35–42, 2022.journals.scholarpublishing.org
4. G. Sanuvala and S. S. Fatima, "A Pedagogical Approach with NLP and Deep Learning,"
Ain Shams Engineering Journal, vol. 12, no. 3, pp. 1234–1240, 2021.
5. K. Sharma, A. Verma, and R. Singh, "Eval - Automatic Evaluation of Answer Scripts
Using Deep Learning and NLP," International Journal of Intelligent Systems and
Applications in Engineering, vol. 9, no. 2, pp. 78–84, 2023.
6. S. Kumar and M. Patel, "Automatic Grading of Answer Sheets Using Machine Learning
Techniques," International Journal of Computer Applications, vol. 175, no. 8, pp. 15–20,
2023.
7. M. Kulkarni, A. Desai, and R. Mehta, "Digital Handwritten Answer Sheet Evaluation
System," International Journal of Computer Applications, vol. 186, no. 16, pp. 30–35,
2024.
8. P. Kudi and A. Manekar, "Automated Descriptive Answer Evaluation System Using
Machine Learning," International Journal of Advance Research and Innovative Ideas in
Education, vol. 7, no. 2, pp. 60–65, 2023.
9. S. Gupta and R. Sharma, "AI-Powered Framework for Automated, Objective Evaluation
of Handwritten Responses Using OCR & NLP," ReadyTensor Publications, vol. 1, no. 1,
pp. 10–15, 2025.

25
10. J. Osaka, A. Maeda, and H. Oka, "Reliable and Efficient Automated Short-Answer
Scoring for a Large Dataset Using Active Learning and Deep Learning," Interactive
Learning Environments, vol. 33, no. 2, pp. 200–215, 2025.ResearchGate
11. M. Haller, S. Müller, and T. Schmidt, "Survey on Automated Short Answer Grading with
Deep Learning: From Word Embeddings to Transformers," arXiv preprint
arXiv:2204.03503, 2022.
12. S. Saha, R. Ghosh, and A. Banerjee, "Joint Multi-Domain Learning for Automatic Short
Answer Grading," arXiv preprint arXiv:1902.09183, 2019.
13. S. Chauhan, A. Kumar, and P. Gupta, "Explanation Based Handwriting Verification,"
arXiv preprint arXiv:1909.02548, 2019.
14. I. Aggarwal, P. Gautam, and G. Parashar, "Automated Subjective Answer Evaluation
Using Machine Learning," SSRN Electronic Journal, vol. 1, no. 1, pp. 1–10, Jan.
2024.ResearchGate
15. N. Bharambe, S. Patil, and R. Deshmukh, "Automatic Answer Evaluation Using Machine
Learning," International Journal of Innovative Technology, vol. 7, no. 2, pp. 25–30,
2021.
16. B. Nandwalkar, A. Joshi, and M. Kale, "Descriptive Handwritten Paper Grading System
Using NLP and Fuzzy Logic," International Journal of Pattern Recognition and
Artificial Intelligence, vol. 37, no. 4, pp. 273–282, 2023.
17. S. Kulkarni, R. Pawar, and A. Deshmukh, "Digital Handwritten Answer Sheet Evaluation
System," International Journal of Scientific Research in Engineering and Management,
vol. 9, no. 3, pp. 50–55, 2024.
18. P. S. Preetha, R. Nair, and S. Kumar, "Descriptive Answers Evaluation Using Natural
Language Processing Approaches," International Journal of Computer Applications, vol.
186, no. 20, pp. 40–45, 2024.ResearchGate
19. S. Mahapatra, A. Das, and R. Roy, "Automated Exam Paper Checking System Using
NLP," Devpost Projects, vol. 1, no. 1, pp. 1–5, 2023.
20. S. Kumar and M. Patel, "Automatic Grading of Answer Sheets Using Machine Learning
Techniques," International Journal of Computer Applications, vol. 175, no. 8, pp. 15–20,
2023.

26

You might also like