A
MINI PROJECT REPORT
on
HEALTH CARE ANALYTICS USING BIGDATA
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
Submit by
CH. Likhil Kumar Goud
(197Y1A0521)
Y. Navadeep Reddy
(197Y1A0526)
Under the Guidance of
Mrs. K. Jaysri (Assistant Professor)
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
MARRI LAXMAN REDDY
INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(AUTONOMOUS)
(Affiliated to JNTU-H, Approved by AICTE New Delhi and Accredited by
NBA & NAAC With ‘A’ Grade)
CERTIFICATE
This is to certify that the project report titled “Health Care Analytics using Bigdata” is
being submitted by CH. Likhil Kumar Goud (197Y1A0521) in IV B.Tech I Semester
Computer Science & Engineering is a record bonafide work carried out by him. The
results embodied in this report have not been submitted to any other University for the
award of anydegree.
Internal Guide HOD
Principal External Examiner
DECLARATION
I hereby declare that the Minor Project Report entitled, “Health Care Analytics using
Bigdata” submitted for the B.Tech degree is entirely my work and all ideas and references
have been duly acknowledged. It does not contain any work for the award of any other
degree.
Date:
CH.Likhil Kumar Goud
(197Y1A0521)
Y. Navadeep Reddy
(197Y1A0526)
Health Care Analytics
ACKNOWLEDGEMENT
I am happy to express my deep sense of gratitude to the principal of the college Dr. K.
Venkateswara Reddy, Professor, Department of Computer Science and Engineering, Marri
Laxman Reddy Institute of Technology & Management, for having provided me with adequate
facilities to pursue myproject.
I would like to thank Mr. Abdul Basith Khateeb, Assoc. Professor and Head, Department of
Computer Science and Engineering, Marri Laxman Reddy Institute of Technology &
Management, for having provided the freedom to use all the facilities available in the
department, especially the laboratories and the library.
I am very grateful to my project guide Mrs. K.Jaysri, Assi. Prof., Department of Computer
Science and Engineering, Marri Laxman Reddy Institute of Technology & Management, for his
extensive patience and guidance throughout my project work.
I sincerely thank my seniors and all the teaching and non-teaching staff of the Department of
Computer Science for their timely suggestions, healthy criticism and motivation during the
course of this work.
I would also like to thank my classmates for always being there whenever I needed help or
moral support. With great respect and obedience, I thank my parents and brother who were the
backbone behind my deeds.
Finally, I express my immense gratitude with pleasure to the other individuals who have either
directly or indirectly contributed to my need at right time for the development and success of
this work.
Department of CSE, MLRITM Page 4
September 2022
Health Care Analytics
CONTENTS
TABLE OF CONTENTS:
Certificates ii
Acknowledgement
Abstract vii
Chapter 1: Introduction 1
Chapter 2: Literature survey 2
1. INTRODUCTION
1.1 Bigdata 3V’s
1.2 Ecosystem
Hdfs
Map reduce
Pig
Hive
Sqoop
Impala
1.3 Applications of Bigdata:
Department of CSE, MLRITM Page 5
September 2022
Health Care Analytics
1.4 Cloudera
1.5 Hue
2. LITERATURE SURVEY
2.1 Existing system
2.2 Proposed system
3. REQUIREMENT ANALYSIS
3.1 Hardware requirements
3.2 Software requirements
4.IMPLEMENTATION
4.1 Problem Definition:
4.2 System Architecture
Get to the Source
Ingestion Strategy and Acquisition
Storage
Data processing
. Export Data sets
. Reporting and visualization
. Data Exploration
. Adhoc Querying
Department of CSE, MLRITM Page 6
September 2022
Health Care Analytics
5. METHODOLOGY
5.1 how Hdfs is used in our project:
5.2 how hive is used:
5.3 how cloudera is used:
5.4 how hue is used:
5.5 how sqoop is used:
6. SCREENSHOTS
To create database
To create table
To display fields
Loading data into mysql
To import data from mysql to hdfs
COMPILATION TIME
Department of CSE, MLRITM Page 7
September 2022
Health Care Analytics
LIST OF FIGURES
4.2 System Architecture 17
5.1 how Hdfs is used in our project 21
Department of CSE, MLRITM Page 8
September 2022
Health Care Analytics
LIST OF TABLES
6. SCREENSHOTS 22
Department of CSE, MLRITM Page 9
September 2022
Health Care Analytics
ABSTRACT
In today's modern world, healthcare also needs to be modernized. It
means that the healthcare data should be properly analyzed so that we
can categorize it into groups of Gender, Disease, City, symptoms and
treatment.
BIGDATA is used to predict epidemics, cure disease, improve quality
of life and avoid preventable deaths. With the increasing population of
the world, and everyone living longer, models of treatment delivery are
rapidly changing and many of the decision behind those changes are
being driven by data.
The drive now is to understand as much as a patient as possible as early
in their life as possible, hopefully picking up warning signs of serious
illness at early enough stage that treatment is far simpler and less
expensive than if it had not been spotted until later. The gigantic size of
analytics will need large computation which can be done with the help
of distributed processing HADOOP.
The frameworks use will provide multipurpose beneficial outputs which
includes getting the healthcare data analysis into various forms.The
groups made by the system would be symptoms wise, age wise, gender
wise, season wise, disease wise etc. As the system will display the data
group wise, it would be helpful to get a clear idea about the disease and
their rate of spreading, so that appropriate treatment could be given at
proper time.
Department of CSE, MLRITM Page 10
September 2022
Health Care Analytics
1. INTRODUCTION
1.1 Bigdata 3V’s:
The 3Vs that define Big Data are Variety, Velocity
and Volume.
Volume
We currently see the exponential growth in the data storage as
the data is now more than text data. We can find data in the
format of videos, music and large images on our social media
channels. It is very common to have Terabytes and Petabytes of
the storage system for enterprises. As the database grows the
applications and architecture built to support the data needs to be
reevaluated quite often. Sometimes the same data is re-evaluated
with multiple angles and even though the original data is the
same the new found intelligence creates explosion of the data.
The big volume indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how
we look at the data. There was a time when we used to believe
that data of yesterday is recent. The matter of the fact
newspapers is still following that logic. However, news channels
and radios have changed how fast we receive the news. Today,
people reply on social media to update them with the latest
happening. They often discard old messages and pay attention to
recent updates. The data movement is now almost real time and
the update window has reduced to fractions of the seconds. This
high velocity data represent Big Data.
Department of CSE, MLRITM Page 11
September 2022
Health Care Analytics
Variety
Data can be stored in multiple format. For example database,
excel, csv, access or for the matter of the fact, it can be stored in
a simple text file. Sometimes the data is not even in the
traditional format as we assume, it may be in the form of video,
SMS, pdf or something we might have not thought about it. It is
the need of the organization to arrange it and make it
meaningful. It will be easy to do so if we have data in the same
format, however it is not the case most of the time. The real
world have data in many different formats and that is the
challenge we need to overcome with the BigData. This variety
of the data represent represent Big Data.
1.2 Ecosystem
Hdfs:
HDFS is built to support applications with large data sets,
including individual files that reach into the terabytes. It uses a
master/slave architecture, with each cluster consisting of a single
Namenode that manages file system operations and supporting
Datanodes that manage data storage on individual compute
nodes.
Map reduce:
MapReduce is a programming model for processing large
data sets with a parallel , distributed algorithm on a
Department of CSE, MLRITM Page 12
September 2022
Health Care Analytics
cluster. MapReduce model consist of two separate routines,
namely Map-function and Reduce-function. The computation on
an input in MapReduce model occurs in three stages:
In the map stage, the mapper takes a single(key, value) pair
as input and produces any number of pairs as output .
The shuffle stage is automatically handled by the
MapReduce framework. The underlying system implementing
MapReduce routes all of the values that are associated with an
individual key to the same reducer.
In the reduce stage, the reducer takes all of the values
associated with a single key k and outputs any number of pairs.
Pig:
Pig is a high-level platform for creating programs that run
on Apache Hadoop. The language for this platform is called Pig
Latin. Pig Latin abstracts the programming from
the Java MapReduce idiom into a notation which makes
MapReduce programming high level, similar to that
of SQL for RDBMS.
Hive:
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
Hive gives an SQL-like interface to query data stored in various
Department of CSE, MLRITM Page 13
September 2022
Health Care Analytics
databases and file systems that integrate with Hadoop. The
traditional SQL queries must be implemented in
the MapReduce Java API to execute SQL applications and
queries over a distributed data.
Sqoop:
Sqoop is a command-line interface application for
transferring data between relational databases and Hadoop. It is
a big data tool that offers the capability to extract datafrom
non-Hadoop data stores, transform the data into a form usable
by Hadoop, and then load the data into HDFS. This process is
called ETL, for Extract, Transform, and Load.
Impala:
Cloudera Impala is Cloudera's open source massively
parallel processing (MPP) SQL query engine for data stored in
a computer cluster running Apache Hadoop. Impala brings
scalable parallel database technology to Hadoop, enabling users
to issue low-latency SQL queries to data stored
in HDFS and Apache HBase without requiring data movement
or transformation. Impala is integrated with Hadoop to use the
same file and data formats, metadata, security and resource
management frameworks used by MapReduce, Apache
Hive, Apache Pig and other Hadoop software.
1.3 Applications of Bigdata:
Healthcare contributions
Banking Sectors and Fraud Detection
Department of CSE, MLRITM Page 14
September 2022
Health Care Analytics
Private sector uses the big data in traffic management,
direction preparation,intellectual transportation arrangements
and overcrowding administration.
Private sector uses the big data in income administration,
industrial improvements, logistics and for reasonable benefit.
1.4 Cloudera:
Cloudera's open-source Apache Hadoop distribution, CDH
(Cloudera Distribution Including Apache Hadoop), targets
enterprise-class deployments of that technology. Cloudera says
that more than 50% of its engineering output is donated
upstream to the various Apache-licensed open source projects
(Apache Hive, Apache Avro, Apache HBase, and so on) that
combine to form the Hadoop platform.
1.5 Hue:
Hue is an open source Web interface for analyzing data with
any Apache Hadoop.Hue allows technical and non-technical
users to take advantage of Hive, Pig, and many of the other tools
that are part of the Hadoop.
You can load your data, runinteractive Hive queries, develop
and run Pig scripts, work with HDFS, check on the status of
your jobs, and more. Hue’s File Browser allows you to browse
Department of CSE, MLRITM Page 15
September 2022
Health Care Analytics
Amazon Simple Storage Service (S3) buckets and you can use
the Hive editor to run queries against data stored in S3.
2. LITERATURE SURVEY
2.1 Existing system:
The existing systems are done using RDBMS which stores
data in the form of tables. RDBMS allows to store only
structured data.
When any user want to know about the basic information of
diseases the person will interact the concern hospital and if the
user want to take appointment user want to go directly to the
hospital to fix the appointment . if the user is enable to go
hospital in particular time. User will enable to take appointment
instantly.
2.2 Proposed system:
The proposed system will group together the disease and
their symptoms data and analyze it to provide cumulative
information. After the analysis, algorithm could be applied to
the resultant and grouping can be made to show a clear picture
of the analysis.
3. REQUIREMENT ANALYSIS
3.1 Hardware requirements
Processor
16 GB Memory
4 TB Disk
Department of CSE, MLRITM Page 16
September 2022
Health Care Analytics
3.2 Software requirements
VM ware
Linux OS
4.IMPLEMENTATION
4.1 Problem Definition:
Health care analytics using big data hadoop.
4.2 System Architecture
Get to the Source!
Source profiling is one of the most important steps in deciding
the architecture. It involves identifying the different source
systems and categorizing them based on their nature and type.
Points to be considered while profiling the data sources:
Identify the internal and external sources systems
High Level assumption for the amount of data ingested
from each source
Identify the mechanism used to get data – push or pull
Determine the type of data source – Database, File, web
service, streams etc.
Determine the type of data – structured, semi structured or
unstructured.
Ingestion Strategy and Acquisition
Department of CSE, MLRITM Page 17
September 2022
Health Care Analytics
Data ingestion in the Hadoop world means ELT (Extract, Load
and Transform) as opposed to ETL (Extract, Transform and
Load) in case of traditional warehouses.
Points to be considered:
Determine the frequency at which data would be ingested
from each source
Is there a need to change the semantics of the data append
replace etc?
Is there any data validation or transformation required
before ingestion (Pre-processing)?
Segregate the data sources based on mode of ingestion –
Batch or real-time
Storage:
Hadoop distributed file system is the most commonly used
storage framework in BigData world, others are the NoSql data
stores – MongoDB, HBase, Cassandra etc. One of the salient
features of Hadoop storage is its capability to scale, self-manage
and self-heal.
Things to consider while planning storage methodology:
Type of data (Historical or Incremental)
Format of data ( structured, semi structured and
unstructured)
Compression requirements
Frequency of incoming data
Query pattern on the data
Consumers of the data
Data processing:
Earlier frequently accessed data was stored in Dynamic
RAMs but now due to the sheer volume, it is been stored on
Department of CSE, MLRITM Page 18
September 2022
Health Care Analytics
multiple disks on a number of machines connected via the
network. Instead of bringing the data to processing, in the new
way, processing is taken closer to data which significantly
reduce the network I/O.The Processing methodology is driven
by business requirements. It can be categorized into Batch, real-
time or Hybrid based on the SLA.
Batch Processing – Batch is collecting the input for a
specified interval of time and running transformations on it
in a scheduled way. Historical data load is a typical batch
operation.
Technology Used: MapReduce, Hive, Pig
Real-time Processing – Real-time processing involves
running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL.
Hybrid Processing – It’s a combination of both batch and
real-time processing needs.
Data consumption:
Different users like administrator, Business users, vendor,
partners etc. can consume data in different format. Output of
analysis can be consumed by recommendation engine or
business processes can be triggered based on the analysis.
Different forms of data consumption are:
Export Data sets – There can be requirements for third
party data set generation. Data sets can be generated using
hive export or directly from HDFS.
Department of CSE, MLRITM Page 19
September 2022
Health Care Analytics
Reporting and visualization – Different reporting and
visualization tool scan connect to Hadoop using
JDBC/ODBC connectivity to hive.
Data Exploration – Data scientist can build models and
perform deep exploration in a sandbox environment.
Sandbox can be a separate cluster (Recommended
approach) or a separate schema within same cluster that
contains subset of actual data.
Adhoc Querying – Adhoc or Interactive querying can be
supported by using Hive, Impala or spark SQL.
5. METHODOLOGY
5.1 how Hdfs is used in our project:
HDFS holds very large amount of data and provides
easier access. To store such huge data, the files are stored
across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data
losses in case of failure.
HDFS also makes applications available to Parallel
processing. HDFS mainly consists of two nodes
Namenode
Datanode
Department of CSE, MLRITM Page 20
September 2022
Health Care Analytics
5.2 how hive is used:
Hive gives an SQL-like interface to query data stored
in various databases and file systems that integrate
with Hadoop.
Hive supports easy portability of SQL-based application
to Hadoop.
SQL statements are broken down by the Hive service
into MapReducejobs and executed across a Hadoop cluster.
5.3 how cloudera is used:
5.4 how hue is used:
5.4 how sqoop is used:
Department of CSE, MLRITM Page 21
September 2022
Health Care Analytics
6. SCREENSHOTS
To create database
To create table
To display fields
Department of CSE, MLRITM Page 22
September 2022
Health Care Analytics
Loading data into mysql
To import data from mysql to hdfs
Department of CSE, MLRITM Page 23
September 2022
Health Care Analytics
COMPILATION TIME
Department of CSE, MLRITM Page 24
September 2022
Health Care Analytics
Department of CSE, MLRITM Page 25
September 2022