SMART INDIA HACKATHON 2024
• Organisation/Ministry - Ministry of Electronics and Information Technology
• Problem Statement ID - SIH1669
• Problem Statement Title - Transformo Docs Application: Empowering
Machine-Readable Document Management System.
• Theme- Smart Automation
• Team ID- 8471
• Institute- G.L Bajaj Institute of Technology and Management , Greater Noida ,UP
• Institute code(AISHE)- C-46239
• Team Name- CodeXplorers
• Team leader name - Shubhanshu Omer Transformo Docs
CodeXplorers IDEA/APPROACH DETAILS
IDEA/SOLUTION : Problem Resolution :
TransformoDocs is a comprehensive document
❖ Apply OCR and NLP to turn scanned images and text
transformation application designed to restrict
into structured formats (XML, JSON, CSV) for
non-machine-readable documents and automatically automated data processing and integration [4,6].
generate machine-readable formats, enhancing ❖ Use a validation engine to filter out non-compliant
automation, accessibility, and data extraction [1]. formats and accept only machine-readable documents.
❖ Document Ingestion Restriction: Blocks non-machine
readable formats (e.g., PDFs, DOCs) from being processed by
the system [1,3]. Unique Value Propositions (UVP) :
❖ Automatic Conversion: Automatically convert any
❖ Dual Functionality: This not only blocks
document-scanned, generated by software, or uploaded-into a
non-machine-readable formats but also converts them
machine-readable format [2,3].
into readable documents seamlessly [3,4].
❖ Enhanced Searchability: Allows documents to be indexed and
❖ Universal Compatibility: This handles scanned
searched effectively, enhancing accessibility and efficiency
images, PDFs, and software documents, easily
[5,6].
integrating into any workflow [2,6]
@SIH Idea submission- Template
.
CodeXplorers TECHNICAL APPROACH
Algorithm Development :
● Natural Language Processing (NLP)
● Optical Character Recognition (OCR)
● Search Algorithms
● Document Layout Analysis
Frontend : Database:
ReactJs PostgreSQL
TypeScript MongoDB
Tailwind CSS
Backend : Cloud Services:
Tesseract OCR AWS
Django Azure Blob Storage Document Management Process Flowchart
@SIH Idea submission- Template 3
CodeXplorers FEASIBILITY AND VIABILITY
POTENTIAL CHALLENGES :
FEASIBILITY:
❖ Non-Machine-Readable Documents: PDFs and
❖ Technical Feasibility: utilizes OCR and NLP tools DOCs contain data but are hard to process
like Tesseract,Google vision and AWS Textract for automatically due to their unstructured nature [5].
fast document conversion [2,4]. ❖ Limited Searchability and Insights: Extracting and
❖ Operational Feasibility:Scales with high analyzing information from these documents is often
manual, time-consuming, and inefficient [6].
document volumes using cloud infrastructure and
❖ Barrier to Automation: Non-machine-readable
microservices [5].
documents hinder automated workflows [1].
❖ Regularity & compliance feasibility: Ensure
security of sensitive documents with STRATEGIES:
encryption,access control and data anonymization
[6]. ❖ Utilize established OCR and NLP tools like
❖ Market Feasibility: Meets growing demand for Tesseract, Google Cloud Vision, or AWS Textract [4].
efficient documents management by handling ❖ Develop an intuitive user interface with input from
non-machine-readable formats [2]. end-users to ensure ease of use [5].
@SIH Idea submission- Template 4
CodeXplorers IMPACT AND BENEFITS
POTENTIAL IMPACT: BENEFITS:
❖ Positive Impact: ❖ Improved Efficiency: Automatic conversion of
non-machine-readable documents saves time, reducing
● Enhanced Efficiency: Automates conversion the need for manual intervention [6].
and validation, reducing manual effort [3].
● Cost Savings: Cuts labor costs by reducing ❖ Enhanced Data Accessibility: Organizations can
manual data entry [6]. access the data contained in their documents quickly
● Improved Data Accessibility: Enables faster and efficiently [5].
information retrieval through machine
readable formats [4,5]. ❖ Greater Compliance: Document conversion ensures
that data follows the required formats for regulatory and
❖ Negative Impact: accessibility standards [4].
● Learning Curve: Users may face challenges ❖ Scalability: The application is designed to scale across
adapting to new automated processes [5]. multiple departments or even organizations, handling
● Data Security Risks: Automation could expose large volumes of documents seamlessly [6].
sensitive data to vulnerabilities.
@SIH Idea submission- Template 5
CodeXplorers RESEARCH AND REFERENCES
REFERENCES:
1. Pandey, M., Arora, M., Arora, S., Goyal, C., Gera, V. K., & Yadav, H. (2023). AI-based Integrated Approach for the Development of
Intelligent Document Management System (IDMS). Procedia Computer Science, 230, 725-736. [CrossRef]
2. Parikh, A. (2023). Information Extraction from Unstructured data using Augmented-AI and Computer Vision. arXiv preprint
arXiv:2312.09880. [CrossRef]
3. Zhu, M., & Cole, J. M. (2022). PDFDataExtractor: A tool for reading scientific text and interpreting metadata from the typeset
literature in the portable document format. Journal of Chemical Information and Modeling, 62(7), 1633-1643. [CrossRef]
4. Pudasaini, S., Shakya, S., Lamichhane, S., Adhikari, S., Tamang, A., & Adhikari, S. (2022). Application of NLP for information
extraction from unstructured documents. In Expert Clouds and Applications: Proceedings of ICOECA 2021 (pp. 695-704).
Springer Singapore. [CrossRef]
5. Sage, C., Douzon, T., Aussem, A., Eglin, V., Elghazel, H., Duffner, S., ... & Espinas, J. (2021). Data-efficient information extraction
from documents with pre-trained language models. Springer International Publishing. [CrossRef]
6. Adnan, K., & Akbar, R. (2019). An analytical study of information extraction from unstructured and multidimensional big data.
Journal of Big Data, 6(1), 1-38. [CrossRef]
@SIH Idea submission- Template 6