Miniproject
Miniproject
A Project Report
Submitted to the APJ Abdul Kalam Technological University
in partial fulfillment of requirements for the award of degree
Bachelor of Technology
in
Computer Science and Engineering (Artificial Intelligence and Machine
Learning)
by
Adithyan S Pillai (SCT22AM007)
Adrija A (SCT22AM008)
Gowri P N (SCT22AM033)
CERTIFICATE
This is to certify that the report entitled AURA: Academic Utility and Resource
Assistant submitted by Adithyan S Pillai (SCT22AM007), Adrija A (SCT22AM008) & Gowri
P N (SCT22AM033) to the APJ Abdul Kalam Technological University in partial fulfillment of
the B.Tech. degree in Computer Science and Engineering is a bonafide record of the project
work carried out by them under our guidance and supervision. This report in any form has not
been submitted to any other University or Institute for any purpose.
We hereby declare that the project report AURA: Academic Utility and Resource Assistant,
submitted for partial fulfillment of the requirements for the award of degree of Bachelor of Tech-
nology of the APJ Abdul Kalam Technological University, Kerala is a bonafide work done by us
under supervision of Prof. Sreepriya S L.
This submission represents our ideas in our own words and where ideas or words of others
have been included, we have adequately and accurately cited and referenced the original sources.
We also declare that we have adhered to ethics of academic honesty and integrity and have not
misrepresented or fabricated any data or idea or fact or source in our submission. We understand
that any violation of the above will be a cause for disciplinary action by the institute and/or the
University and can also evoke penal action from the sources which have thus not been properly
cited or from whom proper permission has not been obtained. This report has not been previously
formed the basis for the award of any degree, diploma or similar title of any other University.
Adithyan S Pillai
TRIVANDRUM Adrija A
09-05-2024 Gowri P N
Abstract
i
Acknowledgement
We take this opportunity to express our deepest sense of gratitude and sincere thanks to
everyone who helped us to complete this work successfully. We express our sincere thanks to Dr.
HOD Name, Head of Department, COMPUTER SCIENCE & ENGINEERING, SREE CHITRA
THIRUNAL COLLEGE OF ENGINEERING for providing us with all the necessary facilities and
support.We gratefully acknowledge the contributions of Prof. Syama R, our coordinator, whose
support and assistance were instrumental in the completion of this project.
We would like to place on record our sincere gratitude to our project guide Prof. Sreepriya S L,
Assistant Professor, COMPUTER SCIENCE & ENGINEERING, SREE CHITRA THIRUNAL
COLLEGE OF ENGINEERING for the guidance and mentorship throughout this work.
Adithyan S Pillai
Adrija A
Gowri P N
ii
Contents
Abstract i
Acknowledgement ii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 AURA - Academic Utility and Resource Assistant . . . . . . . . . . . . . . . . . 1
1.3 Intelligent Document Management . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 AI-Powered Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Real-Time Notifications and User Engagement . . . . . . . . . . . . . . . . . . . 2
1.6 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.7 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 AI-Driven Academic Achievement Tracking . . . . . . . . . . . . . . . . . . . . 4
2.2 Learning Management System Utilization for Deadlines . . . . . . . . . . . . . . 5
2.3 AI-Powered Assessment Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Real-Time Data Integration and Analytics . . . . . . . . . . . . . . . . . . . . . 6
3 Methodology 8
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 AI Model Selection and Training . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Implementation and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Evaluation and Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Implementation 15
4.1 Development Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Faiss (v1.6.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
iii
4.1.2 Pickle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Pandas (v1.1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.4 Streamlit (v0.62.0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.5 Sentence Transformers (v0.3.8) . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.6 Transformers (v3.3.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.7 NumPy (v1.19.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.8 Torch (v1.8.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.9 Folium (v0.2.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.10 Setuptools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.11 find namespace packages . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Data Collection and Cleaning . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Vectorizing the Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Building an Index with Faiss . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.4 Searching with User Queries . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Results 31
6 Conclusion 35
References 36
iv
Chapter 1
Introduction
1.1 Motivation
In academic environments, students and faculty often struggle with managing multiple academic
resources, deadlines, and performance tracking across various platforms. Traditional academic
management systems are often fragmented, requiring users to rely on multiple tools for assignment
tracking, attendance monitoring, and resource sharing. This inefficiency leads to disorganization,
missed deadlines, and a lack of real-time insights into academic progress. Consequently, there
is a growing need for an integrated platform that streamlines academic utilities and enhances the
overall learning experience.
1
1.3 Intelligent Document Management
AURA incorporates automated certificate classification and a structured repository for notes and
documents. Users can upload certificates, which are then classified and assigned activity points
based on predefined academic criteria. Additionally, AURA integrates with GitHub to provide
a centralized repository for storing and sharing academic materials, ensuring easy access and
organization of important documents.
1.6 Objectives
The primary objective of AURA is to develop an integrated academic management platform that
enhances productivity, organization, and academic performance. Key goals include:
• Centralized Academic Management: Providing a unified platform for students and faculty
to manage assignments, attendance, notes, and deadlines efficiently.
2
• Real-Time Academic Insights: Implementing a notification system to ensure users stay
updated with essential academic information.
• User-Friendly Interface: Designing an intuitive and engaging user experience that simplifies
academic management.
• Scalability Challenges: While AURA is designed for efficiency, scaling to support larger
user bases may necessitate additional optimization and infrastructure enhancements.
• AI Accuracy: The effectiveness of AURA’s AI-driven tools depends on training data and
fine-tuning, which may require ongoing refinement to ensure optimal performance.
3
Chapter 2
Literature Review
The study utilizes machine learning models trained on historical academic data to identify patterns
in student performance. The process involves data preprocessing, where raw student records are
cleaned and structured. Feature selection is applied to extract relevant variables that influence
academic outcomes. Predictive modeling techniques, including regression analysis and neural
networks, are employed to forecast student performance. Additionally, reinforcement learning
strategies are used to optimize recommendations, ensuring that students receive the most effective
interventions based on their learning behavior.
4
While AI-driven tracking improves efficiency, it may face challenges related to data privacy,
algorithmic bias, and the accuracy of predictive models. Errors in prediction could lead to incor-
rect assessments, potentially affecting student outcomes if not properly validated. Additionally,
the reliance on historical data may not fully capture real-time changes in student performance,
necessitating periodic recalibration of the models.
S. Lopez, A. Pham, J. Hsu, and P. Halpin explored how students bypass traditional syllabus
structures and utilize alternate LMS locations to track assignment deadlines [2]. The research
reveals that students often rely on informal sources such as discussion forums, peer interactions,
and notification-based reminders instead of the official syllabus.
The study employs surveys and system analytics to track student interactions with LMS plat-
forms. Researchers collect data on login frequency, navigation patterns, and user engagement
with assignment-related content. Statistical analysis is conducted to identify common behaviors
among students who rely on alternative tracking methods. The findings are used to develop recom-
mendations for optimizing LMS interfaces and ensuring deadline information is more prominently
displayed.
A major limitation is the inconsistency in student behavior across different LMS platforms, mak-
ing it challenging to standardize solutions. Additionally, reliance on informal tracking methods
could lead to misinformation if students do not verify deadlines with official sources. The study
also highlights the cognitive overload students experience due to fragmented information sources,
which can negatively impact their ability to manage deadlines effectively.
5
enhancing their understanding of mathematical concepts.
The study analyzes AI-powered calculators that utilize symbolic computation and machine learn-
ing to assist students with problem-solving. The tools integrate natural language processing (NLP)
to interpret mathematical queries and generate step-by-step solutions. Data is collected on student
interactions, error rates, and time spent per problem. Performance metrics are analyzed to evaluate
how AI assistance influences student learning and retention.
One major concern is the potential for academic dishonesty, as students may become overly reliant
on AI tools rather than developing problem-solving skills. Additionally, the effectiveness of AI-
powered assessment tools varies depending on the subject complexity and the accuracy of AI
interpretations. Ethical concerns arise regarding the extent to which AI should be integrated into
assessments, as excessive dependence on automated tools could diminish critical thinking abilities.
A. Ambasht explored real-time data integration and analytics as a means to empower data-driven
decision-making in academic settings [4]. The study discusses how integrating real-time analytics
into educational platforms enhances decision-making processes for both students and educators.
By leveraging automated data pipelines, institutions can improve academic tracking, performance
analysis, and personalized recommendations.
This research focuses on the integration of cloud-based analytics platforms that collect and process
student performance data in real-time. The system employs streaming data frameworks to capture
continuous academic activity, which is then processed through machine learning algorithms for
trend analysis. Advanced visualization techniques are used to present insights to educators,
enabling proactive interventions for student support.
Challenges include the high computational costs associated with real-time data processing and
concerns over data security. Ensuring the accuracy and reliability of real-time analytics remains a
critical issue in large-scale implementations. Additionally, real-time data systems must handle vast
6
volumes of information while maintaining system performance, requiring robust infrastructure
and optimization strategies.
7
Chapter 3
Methodology
This chapter presents the methodology used in developing AURA (Academic Utility and Re-
source Assistant). The system is designed to improve academic management through assignment
tracking, deadline reminders, internal marks monitoring, AI-driven figure generation, real-time
notifications, activity tracking, and grade calculation. This chapter details the system architecture,
data processing pipeline, AI model training, implementation, and evaluation methods.
Frontend
The frontend of AURA is developed using React.js, providing an intuitive and responsive user
interface. It features dashboards, progress visualizations, and real-time notifications to enhance
the user experience. The use of modular components ensures that new features can be added
without disrupting the existing functionality.
Backend
The backend is implemented using Node.js and Express.js, which handle business logic, data
processing, and API endpoints. The backend ensures efficient management of academic records
and interactions. Secure API communication protocols are integrated to prevent unauthorized data
access.
8
Database
AURA utilizes GitHub as its database for storing structured academic data, including student
records, assignment details, and activity logs. The system employs GitHub repositories to manage
data efficiently, leveraging version control for tracking changes and updates. This approach
ensures data integrity, accessibility, and easy synchronization across multiple devices and users.
AI Modules
To ensure data privacy and integrity, AURA enforces security measures such as role-based access
control (RBAC), encrypted data storage, and multi-factor authentication (MFA). Regular security
audits are performed to identify vulnerabilities and strengthen protection mechanisms.
Data Collection
Users upload assignments, certificates, and academic records through the frontend interface. These
inputs are securely transmitted and stored in the backend database, ensuring data integrity and
security. File validation checks are implemented to prevent corrupted or malicious uploads.
Preprocessing
Raw academic data undergoes preprocessing, including cleaning, structuring, and categorization.
AI-based techniques remove inconsistencies, standardize formats, and classify data according to
predefined academic categories. This process improves the accuracy and efficiency of downstream
processing.
9
Feature Extraction
Key academic parameters, such as deadlines, grades, and activity points, are extracted using NLP-
based parsing and classification models. These extracted features are used to generate insights for
students and faculty, enhancing decision-making and academic performance tracking.
The structured data is indexed for rapid retrieval, reducing latency in queries and improving per-
formance. Optimized search algorithms allow users to quickly access relevant academic records,
making the system highly efficient.
Data Synchronization
AURA ensures real-time synchronization of academic records across multiple users and devices.
Automated update mechanisms periodically refresh all modules, ensuring that users always have
access to the latest information.
Certificate Classification
AI-driven certificate classification employs NLP-based deep learning models to analyze and cat-
egorize certificates based on predefined university criteria. Transformer-based architectures such
as BERT are used to ensure high accuracy in classification tasks.
AURA uses AI to automate the allocation of activity points based on uploaded certificates. The
system processes certificate data, matches it against predefined university guidelines, and assigns
corresponding activity points without requiring manual intervention.
10
Each model is trained using academic datasets, validated through multiple test cases, and opti-
mized for accuracy, efficiency, and minimal computational overhead.
The Activity Points Tracker predicts certificate points using a keyword-based classification sys-
tem.Uploaded certificates are categorized into predefined types, each assigned weighted points.A
React-based UI allows seamless file uploads, validation, and point tracking. User profiles dynam-
ically update with awarded points, ensuring an intuitive and automated recognition system for
achievements.
Assignment Assistant
The Assignment Assistant utilizes React.js for a dynamic UI and Cohere’s AI API for intelligent
academic guidance.User inputs (topic, subject, code, pages) are processed to generate structured
recommendations.Axios handles API requests, and state management ensures seamless interac-
tion.The UI integrates Lucide icons for enhanced usability.
Attendance
The Attendance Tracker utilizes React.js for interactive UI, managing subject-wise and daily at-
tendance.useState controls visibility, while useMemo extracts subject mappings.Data is processed
to compute attendance percentages and required classes.Conditional rendering enhances usability,
and Lucide icons improve accessibility, ensuring a seamless user experience.
Chatbot
The AURA Chatbot integrates React.js for UI, Firebase for data management, and Gemini AI
for intelligent responses.It processes user queries using local logic for academic data (timetable,
attendance) and an AI API for broader queries.The chatbot employs state management, API
handling, and UI animations for an interactive experience.
11
Dashboard
The Dashboard Component in AURA dynamically retrieves and displays the user’s daily schedule
using React.js.It processes timetable data, formats subject names, and renders an interactive UI
with Tailwind CSS.State management and conditional rendering ensure accurate, real-time aca-
demic updates, enhancing user experience through structured data visualization and accessibility.
College Events
The Events Component in AURA dynamically fetches, parses, and displays college events using
React.js, Axios, and PapaParse.It retrieves CSV data, filters, and formats events while handling
errors and loading states.Tailwind CSS enhances UI responsiveness, ensuring seamless event
management with real-time updates and interactive registration links.
Notes
The Notes Component in AURA fetches and displays academic notes from a GitHub repository
using React.js.It enables dynamic browsing, searching, and filtering of subjects while allowing file
downloads.Efficient state management, API handling, and intuitive UI with Tailwind CSS ensure
seamless user experience and optimized content retrieval.
Smartboard
The SmartBoard Component in AURA provides quick access to an interactive online white-
board.Built using React.js, it integrates seamless navigation by opening the SmartBoard in a new
tab upon user interaction.The minimalistic UI, enhanced with Tailwind CSS, ensures an intuitive
and accessible digital collaboration experience.
Timetable
The TimeTable Component in AURA dynamically displays a structured weekly schedule using
React.js.It features collapsible sections for each day, enabling intuitive navigation.The integration
of Tailwind CSS ensures a responsive, visually appealing UI.State management via React hooks
allows seamless interaction, enhancing accessibility and user experience.
12
Sidebar
The Sidebar Component in AURA enhances navigation using React.js and Tailwind CSS.It dy-
namically toggles between expanded and collapsed states for an adaptive UI.Interactive elements,
including animated color-changing logos and intuitive menu selection, improve user experience.React
hooks efficiently manage state, ensuring smooth and responsive interactions.
Splashscreen
The Splash Screen in AURA uses React.js to create a smooth loading animation.It dynamically
controls opacity and progress using state and requestAnimationFrame(), ensuring a seamless
transition.Tailwind CSS enhances styling, while timed animations progressively reveal the AURA
logo and tagline, creating an engaging and professional introduction.
User Testing
Performance Metrics
System performance is measured based on key metrics, including response time, AI model ac-
curacy, and database query efficiency. Benchmarking ensures that the system meets performance
expectations.
Scalability Testing
AURA undergoes load testing to verify its ability to handle a growing number of users and large
volumes of academic data. Stress tests simulate peak usage conditions to ensure reliability and
stability.
13
Security Assessments
Periodic security audits are performed to detect vulnerabilities and implement the latest security
protocols. Encryption, authentication mechanisms, and access control policies are continuously
updated to safeguard user data.
User feedback is continuously collected and analyzed to refine system interactions and optimize
features. Iterative updates ensure that AURA evolves to meet the dynamic needs of students and
faculty.
Through iterative refinements, AURA continuously evolves into a more advanced and efficient
academic management platform, ensuring an optimized experience for students and faculty alike.
14
Chapter 4
Implementation
Faiss is a library developed by Facebook AI Research that provides highly efficient algorithms for
similarity search and clustering of dense vectors. The CPU version (faiss-cpu) is used for indexing
and querying the dense embeddings generated by Sentence Transformers.
Faiss enhances the capabilities of the Vector-based Search Engine by providing efficient algo-
rithms for similarity search and clustering of dense vectors. Its integration with Sentence Trans-
formers allows for the creation of a powerful and scalable search system capable of delivering
accurate and relevant search . Faiss is utilized because of its:
Efficient Similarity Search: Faiss offers highly efficient algorithms for similarity search, allowing
the search engine to quickly retrieve documents that are similar to a given query. This is partic-
ularly important in applications where real-time or near-real-time search responses are required,
such as web search engines or recommendation systems.
Clustering of Dense Vectors: In addition to similarity search, Faiss also provides algorithms
for clustering dense vectors. Clustering is useful for organizing large collections of documents
into groups based on their semantic similarities. This can facilitate tasks such as topic modeling,
document categorization, or recommendation system personalization.
Indexing and Querying: Faiss provides functionalities for indexing and querying dense embed-
dings efficiently. The CPU version of Faiss, known as faiss-cpu, is utilized for these operations in
15
scenarios where GPU resources are not available or feasible. This ensures that the search engine
can be deployed on a wide range of hardware configurations, including servers without GPU
support or cloud platforms with limited GPU availability.
Integration with Sentence Transformers: Faiss seamlessly integrates with Sentence Transform-
ers, a library used for generating dense embeddings for sentences and text documents. After
generating embeddings for documents using Sentence Transformers, Faiss is employed to index
and query these embeddings for similarity search. This integration enables the search engine to
leverage the semantic information encoded in the dense embeddings to retrieve relevant documents
efficiently.
4.1.2 Pickle
Pickle is a standard Python library used for serializing and deserializing Python objects. It is
used here to save and load data structures such as Faiss indexes or trained Sentence Transformer
models.Here’s how Pickle is utilized :
Serialization: Pickle facilitates the conversion of Python objects into a byte stream, which can
be stored in a file or transmitted over a network. When a Faiss index or a trained Sentence
Transformer model is created or modified during the development process, Pickle is employed
to serialize these objects, preserving their state and structure.
Storage: Serialized objects can be stored in files on disk or databases for long-term storage. In the
case of the Vector-based Search Engine, Pickle is utilized to save the Faiss indexes, which contain
the dense embeddings of documents, and the trained Sentence Transformer models, which encode
semantic information into fixed-length vectors.
Loading: Pickle also enables the deserialization of serialized objects, allowing them to be recon-
structed into their original Python objects. During the execution of the search engine, Pickle is
utilized to load the previously serialized Faiss indexes and trained Sentence Transformer models
into memory, restoring them to their original state.
16
4.1.3 Pandas (v1.1.2)
Pandas is a powerful library for data manipulation and analysis in Python. It is used for handling
structured data, such as loading datasets of academic articles, preprocessing, and data explo-
ration.Here’s how Pandas is utilized ,
Structured Data Handling: Pandas excels in handling structured data, providing versatile data
structures like DataFrame that allow for easy manipulation and analysis of tabular data. In the
context of the search engine, Pandas is used to load datasets of academic articles, which typically
come in structured formats such as CSV or Excel files.
Data Exploration: Understanding the characteristics of the dataset is crucial for developing an
effective search engine. Pandas provides extensive capabilities for data exploration, including
descriptive statistics, data visualization, and summarization. These functionalities enable devel-
opers to gain insights into the dataset, identify patterns, and make informed decisions about the
preprocessing and modeling steps.
Integration with Other Libraries: Pandas seamlessly integrates with other Python libraries
commonly used in data science and machine learning, such as NumPy, Scikit-learn, and Mat-
plotlib. This interoperability allows for smooth data exchange and collaboration between different
components of the search engine, facilitating the development of a cohesive and efficient solution.
Streamlit is a Python library used for building interactive web applications for machine learning
and data science. It is used here to create the user interface for the vector-based search engine,
allowing users to interact with the system through a web browser.
17
Streamlit plays a crucial role in the development of the Vector-based Search Engine with Sentence
Transformers and Faiss by enabling the creation of an interactive and user-friendly interface. Its
simplicity, versatility, and seamless integration with Python libraries make it an ideal choice for
building web applications for machine learning and data science tasks.Here’s how Streamlit is
leveraged :
Interactive Web Applications: Streamlit enables the creation of interactive web applications
directly from Python scripts, without the need for HTML, CSS, or JavaScript. This simplifies the
development process and allows for rapid prototyping of user interfaces for machine learning and
data science applications.
User Interface Development: Streamlit provides a simple and intuitive API for building user
interfaces using familiar Python syntax. Developers can easily create various UI components such
as buttons, sliders, text inputs, and plots to facilitate user interaction with the search engine.
Real-time Updates: Streamlit offers automatic reactive updates, allowing the user interface to
dynamically update in response to user inputs or changes in the underlying data. This enables real-
time exploration and visualization of search results, enhancing the user experience and providing
immediate feedback.
Integration with Data Processing: Streamlit seamlessly integrates with data processing libraries
such as Pandas, allowing developers to incorporate data analysis and visualization directly into
the user interface. This facilitates data exploration and interpretation within the search engine
interface, empowering users to gain insights from the search results.
Deployment: Streamlit simplifies the deployment of web applications, providing built-in support
for deploying applications to various platforms such as Streamlit Sharing, Heroku, or Docker
containers. This ensures that the vector-based search engine can be easily deployed and accessed
by users via a web browser, without the need for complex setup or configuration.
Sentence Transformers is a Python library for generating dense embeddings for sentences and
text documents. It provides access to pre-trained Transformer models for encoding text into fixed-
length vectors. In this case, Sentence Transformers is used to generate embeddings for documents,
18
which are then indexed and queried using Faiss.
Pre-trained Transformer Models: The library offers access to a variety of pre-trained Trans-
former models, such as BERT, RoBERTa, and DistilBERT, which are renowned for their effective-
ness in natural language processing tasks. These models have been trained on large-scale corpora
of text data and can encode the semantic meaning of sentences into dense vector representations.
Efficient Integration with Faiss: After generating embeddings for documents using Sentence
Transformers, Faiss is leveraged to index and query these embeddings efficiently. Faiss provides
highly efficient algorithms for similarity search, allowing the search engine to retrieve relevant
documents quickly and accurately based on their semantic similarity.
Scalability and Performance: Sentence Transformers offers scalable solutions for generating
embeddings, allowing for the processing of large volumes of text data efficiently. This ensures that
the search engine can handle diverse datasets and deliver high-performance search capabilities to
users.
19
4.1.6 Transformers (v3.3.1)
Transformers is a Python library developed by Hugging Face that provides access to a wide range
of state-of-the-art pre-trained Transformer models for natural language processing tasks. It is
used alongside Sentence Transformers for accessing pre-trained models like BERT, RoBERTa,
and DistilBERT.Here’s how the Transformers library is utilized in this project:
Fine-tuning for Specific Tasks: Transformers library offers capabilities for fine-tuning pre-trained
models on domain-specific datasets. This allows developers to adapt the pre-trained models to the
specific requirements of the search engine, enhancing their ability to generate embeddings that are
tailored to the semantics of the document collection.
Integration with Sentence Transformers: Transformers library seamlessly integrates with Sen-
tence Transformers, enabling the search engine to access and utilize pre-trained Transformer
models for generating embeddings. This integration ensures that the search engine can leverage
the latest advancements in NLP research to enhance the quality and effectiveness of its semantic
representations.
Versatility and Flexibility: Transformers library provides a flexible and versatile API that allows
developers to easily load and utilize pre-trained models for various NLP tasks. This flexibility
enables the search engine to experiment with different models and configurations to find the most
suitable approach for generating embeddings and performing similarity search.
20
4.1.7 NumPy (v1.19.2)
NumPy is a fundamental package for scientific computing in Python. It is used for handling nu-
merical operations efficiently, particularly during the vectorization and indexing processes.NumPy’s
role here is:
Efficient Numerical Operations: NumPy provides a powerful array object that allows for effi-
cient storage and manipulation of large datasets. It offers a wide range of mathematical functions
and operations optimized for performance, making it ideal for handling numerical computations
required during various stages of building the search engine.
Indexing and Data Manipulation: NumPy’s array indexing capabilities are crucial for accessing
and manipulating data stored in arrays. It provides intuitive syntax for slicing, indexing, and
reshaping arrays, allowing for seamless data manipulation tasks such as filtering documents,
extracting embeddings, or performing similarity calculations.
Interoperability with Faiss: NumPy arrays serve as the primary data structure for storing dense
embeddings generated by Sentence Transformers and indexed by Faiss. Faiss seamlessly inte-
grates with NumPy arrays, enabling efficient indexing and querying of dense vector representa-
tions for similarity search operations.
Torch is a machine learning library primarily used for deep learning tasks. It serves as the
backbone for libraries like Transformers and Sentence Transformers, providing GPU acceleration
and tensor computation capabilities. Here’s how Torch is leveraged:
21
Deep Learning Framework: Torch is a powerful deep learning framework that provides essen-
tial tools and functionalities for building and training neural network models. It offers a wide
range of modules and utilities for constructing various types of neural architectures, including
convolutional networks, recurrent networks, and transformer-based models.
GPU Acceleration: Torch seamlessly integrates with GPU hardware, allowing for accelerated
computation of tensor operations on CUDA-enabled devices. This GPU acceleration significantly
speeds up the training and inference processes, especially when dealing with large-scale datasets
and complex models.
Tensor Computation: Torch provides efficient tensor computation capabilities, enabling the ma-
nipulation and transformation of multi-dimensional arrays commonly used in deep learning tasks.
These tensor operations are fundamental for tasks such as data preprocessing, model training, and
inference, providing a versatile and efficient foundation for building machine learning pipelines.
Integration with Transformers and Sentence Transformers: Torch serves as the backbone
for libraries like Transformers and Sentence Transformers, which rely on its tensor computation
capabilities and GPU acceleration to implement advanced natural language processing models.
These libraries utilize Torch’s deep learning primitives to construct and train transformer-based
architectures for tasks such as text encoding, semantic representation learning, and similarity
search.
Scalability and Performance: Torch’s efficient implementation and GPU support make it well-
suited for handling large-scale datasets and complex models. Its scalability and performance
optimizations ensure that the Vector-based Search Engine can efficiently process and analyze text
data, delivering high-quality search results with minimal latency.
Folium is a Python library for creating interactive maps. While not directly related to the vector-
based search engine, it might be used for visualization purposes or integrating location-based
features into the search engine interface.Folium may not be directly related to the core function-
ality of the Vector-based Search Engine with Sentence Transformers and Faiss, it can enhance the
search engine’s capabilities by providing visualizations of geographic data, integrating location-
based features, and improving the overall user experience. Its flexibility and ease of use make it a
22
valuable tool for incorporating spatial context into the search engine interface.
Visualization of Geographic Data: Folium enables the creation of interactive maps with cus-
tomizable features such as markers, polygons, and heatmaps. While the search engine primarily
deals with text data, Folium can be employed to visualize geographical information associated
with documents or search results. For instance, if the documents contain location-based metadata,
Folium can be used to plot these locations on an interactive map for visual exploration.
Integration with External Data Sources: Folium allows for the integration of external data
sources such as GeoJSON files or spatial databases. This functionality can be leveraged to
incorporate additional geographic context into the search engine interface. For example, if the
search engine retrieves documents related to specific geographic regions or landmarks, Folium
can be used to display these regions or landmarks on an interactive map alongside the search
results.
Enhanced User Experience: By incorporating interactive maps into the search engine interface,
Folium can enhance the user experience by providing visual context to the search results. Users
can visually explore the spatial distribution of documents or search results on a map, gaining
insights into geographical patterns or correlations that may not be apparent from textual represen-
tations alone.
Location-based Search Features: Folium can also be used to integrate location-based search
features into the search engine interface. For example, users may be able to specify a geographic
area of interest and retrieve documents or search results that are relevant to that location. Folium
can facilitate the visualization of search results within the specified geographic region, enabling
users to explore content based on geographical proximity.
4.1.10 Setuptools
Setuptools is a package development library for Python that facilitates the packaging, distribution,
and installation of Python packages. It is often used in conjunction with the setup.py file to define
package metadata and dependencies.Setuptools is utilized in this context:
23
includes information such as the project name, version, author, and dependencies required for
installation.
Installation: Setuptools automates the installation process of Python packages by providing the
setup.py install command. This command installs the package along with its dependencies into the
Python environment, ensuring that the necessary dependencies are resolved and installed correctly.
Versioning: Setuptools allows developers to manage package versions effectively, ensuring con-
sistency and compatibility across different environments. By specifying version constraints in the
setup.py file, developers can ensure that their package is compatible with specific versions of its
dependencies, preventing potential conflicts or compatibility issues during installation.
Integration with Package Indexes: Setuptools seamlessly integrates with Python package in-
dexes such as PyPI (Python Package Index), allowing developers to publish their packages and
make them available for installation by others. By registering their packages on PyPI and using
tools like twine, developers can easily upload their package distributions for others to discover
and install.
find namespace packages is a function provided by setuptools that facilitates the discovery of
Python packages within a namespace. It is useful when organizing code into hierarchical names-
paces, allowing for cleaner and more modular package structures.Here’s an elaboration on how
”find namespace packages” is relevant to this project:
24
and sub-packages, making it easier to navigate and manage the codebase.
Facilitates Modular Package Structures: By using namespace packages, developers can break
down the codebase into smaller, self-contained modules and sub-packages, each focusing on a
specific aspect of functionality. This modular approach promotes code reusability, extensibility,
and maintainability, as different components of the system can be developed and maintained
independently.
Cleaner Code Organization: Namespace packages help maintain a clean and organized codebase
by avoiding naming conflicts and providing clear separation between different components of
the system. This enhances readability and comprehension, as developers can easily locate and
understand the purpose of each module or sub-package within the namespace hierarchy.
Integration with Setuptools: Setuptools provides the ”find namespace packages” function as
part of its API, allowing developers to specify namespace packages within their project’s setup
configuration. This enables Setuptools to correctly identify and include namespace packages
during packaging, distribution, and installation processes, ensuring that the project’s modular
structure is preserved.
For this project, we aimed to build a vector-based search engine capable of retrieving relevant
academic articles on misinformation, disinformation, and fake news. To achieve this, we acquired
a real-world dataset of research papers.
The dataset was compiled by querying the Microsoft Academic Graph (MAG) using Orion, a tool
that facilitates large-scale data access from MAG. This approach allowed us to efficiently gather a
comprehensive collection of relevant academic literature.
25
3.Dataset Description
The retrieved dataset comprised 8,430 articles published between 2010 and 2020. Each article
entry included the following information:
• Title: The article’s main title, providing a high-level overview of the research topic.
• Citations: References made by the article to other scholarly works. (This information might
not be used in the current iteration of the search engine but could be valuable for future
enhancements).
• Publication Year:The year the article was published, which can be helpful for tracking trends
in misinformation research.
4.Data Cleaning
We performed minimal data cleaning to ensure the quality of the dataset for the search engine’s de-
velopment. This involved removing entries lacking abstracts, as abstracts are crucial for capturing
the core content and meaning of the research.
The resulting dataset provided a solid foundation for building the vector-based search engine,
enabling it to process and analyze the semantic content of academic articles related to misinfor-
mation.
Once you have your documents, you need to convert them into a format that a computer can un-
derstand. This process is called vectorization. In this case, you will use a pre-trained model called
Sentence Transformers to encode the text data into numerical vectors. Sentence Transformers are
a type of deep learning model that can learn how to represent the meaning of a sentence as a
vector.
After acquiring and cleaning our dataset of research articles, we embarked on the process of
vectorization. This crucial step transforms the textual content of the abstracts into numerical
26
representations that computers can comprehend and manipulate. Essentially, we are translating
the meaning of each abstract into a unique mathematical code.
To achieve this feat, we leverage the power of Sentence Transformers, a type of deep learning
model specifically trained for semantic tasks. These models possess the remarkable ability to
capture the essence of a sentence and express it as a vector – a multidimensional array of numbers.
This vector encapsulates the semantic relationships between words within the sentence, allowing
us to compare and analyze the meaning of different pieces of text.
For our project, we selected the ”distilbert-base-nli-stsb-mean-tokens” model from the vast library
of pre-trained Sentence Transformers. This particular model excels in tasks involving Semantic
Textual Similarity (STS), which aligns perfectly with our objective of finding articles with similar
meanings. Additionally, it boasts a significant advantage over the original BERT model – its
smaller size translates to faster processing times, making it computationally efficient for our
project.
2. Harnessing GPU Power : If a Graphics Processing Unit (GPU) is available on the system, we
can leverage its superior computational capabilities to accelerate the vectorization process.
3. Encoding the Abstracts:This is where the magic happens! We employ the ‘.encode()‘
method of the Sentence Transformer. This method takes each abstract from our dataset
as input and generates a corresponding high-dimensional vector. These vectors encapsulate
the semantic meaning of the abstracts, allowing us to establish relationships between articles
based on their content, rather than just keywords.
By successfully vectorizing our text data, we unlock a powerful tool for building our search
engine. These document vectors will serve as the foundation for the next stage – constructing
an efficient index using Faiss.
27
4.2.3 Building an Index with Faiss
Faiss is a library that allows you to efficiently search for similar vectors in a large dataset. To use
Faiss, you first need to create an index that contains all of the document vectors. You can then use
this index to search for documents that are similar to a query vector.
Having transformed our text data into meaningful vectors, we now enter the realm of Faiss
(Facebook AI Similarity Search). Faiss is a powerful library that empowers us to efficiently search
for similar vectors within a vast dataset. This is precisely what we need to build our search engine
– the ability to find articles with content closely related to a user’s query.
Faiss operates by constructing an index, essentially a data structure that meticulously organizes
the document vectors. This index allows Faiss to rapidly locate vectors similar to a new query
vector, enabling us to retrieve relevant articles based on their semantic meaning. The beauty of
Faiss lies in its ability to handle datasets of any size, even those exceeding the limitations of a
computer’s RAM.
1. Data Type Conversion: As Faiss works with 32-bit floating-point matrices, we need to
convert the document vectors from their current format (likely float64) to this specific data
type. This ensures compatibility with Faiss’ internal operations.
2. Index Creation: We create a Faiss index object. This object serves as the central hub for
storing and searching the document vectors. We specify the dimensionality of the vectors
(the number of elements in each vector) to tailor the index for our data.
3. Unique Identification (Optional): Faiss employs the ”IndexIDMap” object to enable as-
signing custom IDs to the indexed vectors. In our case, we can leverage the paper IDs
retrieved from Microsoft Academic Graph to uniquely identify each document vector within
the index.
4. Populating the Index: Finally, we populate the Faiss index with the transformed document
vectors and their corresponding IDs. This process essentially builds a comprehensive map
of semantic relationships between the articles in our dataset.
5. Verification : To ensure the index functions as intended, we can test it with a sample vector.
By querying the index with an already indexed vector, we expect the first retrieved document
28
(along with its distance) to be the query itself (distance of zero). This confirms the index is
functioning correctly and ready to handle user queries.
By constructing this Faiss index, we’ve laid the groundwork for the final stage of our search engine
development: implementing the search functionality itself. The index allows us to efficiently
navigate the semantic landscape of our document collection, paving the way for retrieving articles
that resonate with a user’s search intent.
Once you have built your index, you can use it to search for relevant documents for a given
query. To do this, you first need to encode the query text into a vector using the same Sentence
Transformers model that you used to encode the document vectors. Then, you can search the
index for documents that have similar vectors to the query vector. The documents with the most
similar vectors will be the most relevant to the query.
The process begins with understanding the user’s query. We take the user’s search text and encode
it using the same Sentence Transformer model (”distilbert-base-nli-stsb-mean-tokens”) employed
for the document vectors. This ensures consistency in how we represent both the user’s intent and
the content of the articles. The encoding process creates a query vector, a numerical representation
capturing the semantic meaning of the user’s search.
As with document vectors, the query vector needs to conform to Faiss’ data type requirements.
Therefore, we convert the query vector from its current format (likely float64) to a 32-bit floating-
point representation (float32). This ensures seamless interaction with the Faiss index.
Now comes the moment of truth! We unleash the power of the Faiss index. We feed the encoded
query vector into the index, prompting it to search for document vectors with the closest semantic
resemblance. Imagine the index meticulously sifting through the vast collection of document
vectors, identifying those that share similar meaning with the user’s query.
29
The documents associated with the most similar vectors are considered the most relevant to
the user’s search. These documents will be presented to the user as potential answers to their
information need.
By seamlessly integrating user queries with the Faiss index, we empower users to navigate the vast
ocean of academic knowledge and efficiently discover articles that address their specific informa-
tion needs. This is the culmination of our efforts, a search engine that prioritizes semantic meaning
over simple keyword matching, ultimately fostering a more insightful research experience.
30
Chapter 5
Results
In this chapter, we present the outcomes of the implemented search engine, showcasing its ef-
fectiveness in retrieving relevant information for various queries. Through a series of sample
search results, we demonstrate the engine’s capability in addressing specific informational needs.
Each result is accompanied by an analysis of the retrieved articles, providing insights into their
relevance and significance.
Search Result:1
Figure 5.1: Search Sample for covid19 news, output news articles
The image shows a sample search result for the query ”covid-19 misinformation and social me-
dia”. Here’s a breakdown of the results:
31
• Search bar: The user has entered the query ”covid-19 misinformation and social media”
• Filters: There’s a filter by publication year, ranging from 2010 to 2021. You can refine your
search based on the year of publication.
• Sample Result: The first retrieved article is titled ”A first look at COVID-19 information
and misinformation sharing on Twitter” published in 2020. The citation count is 20.
Overall, the image suggests a promising initial implementation of a search engine focused on
COVID-19 misinformation on social media. By leveraging semantic search, the engine can
retrieve articles that are conceptually relevant to the user’s query, even if they don’t contain the
exact keywords.
Search Result:2
Figure 5.2: Search Sample for Haiti Earthquake fake news, output relevant articles
The image shows a sample search result for the query ”what are the effects of misinformation
on social medis during extereme events like the Haiti earthquake ?”. Here’s a breakdown of the
results:
• Search bar: The user has entered the query ”What are the effects of misinformation on social
media during extreme events like the Haiti earthquake?”
32
• Filters: Publication year can be filtered by a slider ranging from 2010 to 2021.
• Sample Result: The first retrieved article is titled”Human and Algorithmic Contributions to
Misinformation Online - Identifying the Culprit (2019). This abstract discusses who is to
blame for the spread of misinformation online, humans or algorithms.”
Search Result:3
Figure 5.3: Search Sample for doubts about Brain death criteria and organ donation, output
relevant articles
The image shows a sample search result for the query ”What are some ethical considerations
around expressing doubts about brain death criteria and organ donation ?”. Here’s a breakdown
of the results:
• Search bar: The user has entered the query ”What are some ethical considerations around
expressing doubts about brain death criteria and organ donation ?”
• Filters: Publication year can be filtered by a slider ranging from 2010 to 2021.
• Sample Result: The first retrieved article is titled”this is the result of my project where a
semantic search engine leveraging Sentence Transformers and Faiss , where the query given
33
by user is ”What are some ethical considerations around expressing doubts about brain death
criteria and organ donation ? ?” and the output is the articles and the publication year. create
a result for the report based on the given information and image.”
34
Chapter 6
Conclusion
The AURA (Academic Utility and Resource Assistant) project aims to enhance academic man-
agement by integrating automation, real-time tracking, and AI-driven classification. Traditional
academic systems often face challenges in tracking student activities, managing certificates, and
ensuring seamless communication between faculty and students. AURA overcomes these lim-
itations by offering a structured platform that automates certificate classification, activity point
tracking, and real-time notifications.
By leveraging technologies such as React.js for a dynamic frontend, Node.js and Express.js for a
robust backend, and GitHub as a version-controlled database, AURA ensures a seamless academic
experience. The platform integrates WebSockets for real-time updates, enhancing engagement
and responsiveness. Additionally, AI-based certificate classification streamlines the verification
process, reducing manual workload while ensuring accuracy.
AURA revolutionizes how academic records are managed by providing a unified system that en-
sures transparency, efficiency, and automation. With its modular design and scalable architecture,
the platform can be expanded with additional features in the future. By automating essential
academic processes, AURA empowers students and faculty with a smarter, more efficient way
to track progress, manage activities, and stay informed. This project lays the foundation for
an improved academic ecosystem that prioritizes accessibility, ease of use, and technological
advancement.
35
References
36