Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
I S S N 22 2 0 9 6 - 0 6 54 l l 0 5 / 0 6 l l p p 4 8– 5 7
Volume 2, Number 1, March 2019
DOI: 10.26599/BDMA.2018.9020031
     Abstract: In recent years, huge amounts of structured, unstructured, and semi-structured data have been generated
     by various institutions around the world and, collectively, this heterogeneous data is referred to as big data. The
     health industry sector has been confronted by the need to manage the big data being produced by various sources,
     which are well known for producing high volumes of heterogeneous data. Various big-data analytics tools and
     techniques have been developed for handling these massive amounts of data, in the healthcare sector. In this
     paper, we discuss the impact of big data in healthcare, and various tools available in the Hadoop ecosystem for
     handling it. We also explore the conceptual architecture of big data analytics for healthcare which involves the data
     gathering history of different branches, the genome database, electronic health records, text/imagery, and clinical
     decisions support system.
variety as a result of the linking of a diverse range               data can improve healthcare delivery and reduce its
of biomedical data sources including, for example,                  cost, while supporting advanced patient care, improving
sensor data, imagery, gene arrays, laboratory tests,                patient outcomes, and avoiding unnecessary costs[8] .
free text, and demographics[5] . Most data in healthcare            Big data analytics is currently used to predict the
system (e.g., doctor’s notes, lab test results, and                 outcomes of decisions made by physicians, the outcome
clinical data) is unstructured and is not stored                    of a heart operation for a condition based on patient’s
electronically, i.e., it exists only in hard copies                 age, current condition, and health status. Essentially,
and its volume is increasing very rapidly. Currently,               we can say that the role of big data in the health
there is a major focus on the digitization of these                 sector is to manage data sets related to healthcare,
vast stores of hard copy data. The revolutions of                   which are complex and difficult to manage using
data size are actually creating a problem in order                  current hardware, software, and management tools. In
to achieve this goal[6] . The various terminologies                 addition to the burgeoning volume of healthcare data,
and models that have been developed to resolve the                  reimbursement methods are also changing[9] . Therefore,
problems associated with big data focus on solving                  purposeful use and pay based on performance have
four issues known as the four Vs, namely: volume,                   emerged as important factors in the healthcare sector. In
variety, velocity, and veracity. The various classes                2011, organizations working in the field of healthcare
of data in healthcare applications include Electronic               had produced more than 150 exabytes of data[10] ,
Health Records (EHR), machine generated/sensor data,                all of which must be efficiently analyzed to be at
                                                                    all useful to the healthcare system[11] . The storage
health information exchanges, patient registries, portals,
                                                                    of healthcare related data in EHRs occurs in a
genetic databases, and public records. Public records
                                                                    variety of forms. A sudden increase in data related
are major sources of big-data in the healthcare industry
                                                                    to healthcare informatics has also been observed in
and require efficient data analytics to resolve their
                                                                    the field of bioinformatics, where many terabytes of
associated healthcare problems. According to a survey
                                                                    data are generated by genomic sequencing[11] . There
conducted in 2012, healthcare data totaled nearly 550
                                                                    are a variety of analytical techniques available for
petabytes and will reach nearly 26 000 petabytes in
                                                                    interpreting medical, which can then be used for patient
2020[5] . In light of the heterogeneous data formats,
                                                                    care[12] . The diverse origins and forms of big data are
huge volume, and related uncertainties in the big-data
                                                                    challenging the healthcare informatics community to
sources, the task of realizing the transformation of                develop methods for data processing. There is a big
raw data into actionable information is daunting. Being             demand for technique that combines dissimilar data
so complex, the identification of health features in                sources[13] .
medical data and the selection of class attributes                     A number of conceptual approaches can be employed
for health analytics demands highly sophisticated and               to recognize irregularities in vast amounts of data from
architecturalyl specific techniques and tools.                      different datasets. The frameworks available for the
                                                                    analysis of healthcare data are as follows:
2     Big Data Analytics in Health Informatics
                                                                        Predictive Analytics in Healthcare: For the past
The main difference between traditional health analysis             two years, predictive analysis has been recognized
and big-data health analytics is the execution of                   as one of the major business intelligence approaches,
computer programming. In the traditional system, the                but its real world applications extend far beyond the
healthcare industry depended on other industries for                business context. Big data analytics includes various
big data analysis. Many healthcare shareholders trust               methods, including text analytics and multimedia
information technology because of its meaningful                    analytics[14] . However, one of the most crucial
outcomes—their operating systems are functional and                 categories is predictive analytics which includes
they can process the data into standardized forms.                  statistical methods like data mining and machine
Today, the healthcare industry is faced with the                    learning that examine current and historical facts to
challenge of handling rapidly developing big healthcare             predict the future. Predictive methods which are being
data. The field of big data analytics is growing and                used today in the hospital context to determine if
has the potential to provide useful insights for the                patient may be at risk for readmission[15] . This data can
healthcare system. As noted above, most of the massive              help doctors to make important patient care decisions.
amounts of data generated by this system is saved                   Predictive analysis requires an understanding and use
in hard copies, which must then be digitized[7] . Big               of machine learning, which is widely applied in this
    50                                                            Big Data Mining and Analytics, March 2019, 2(1): 48-57
must simply be collected, stored, and processed by a                healthcare system and will directly impact the patient.
particular device. Structured data comprises just 5% to                 Right Living: Right living refers to the patient
10% of healthcare data. Unstructured or semi-structured             living a better and healthier life[15] . By right living,
data includes e-mails, photos, videos, audios, and other            patients could manage themselves by making the best
health related data such as hospital medical reports,               decisions for themselves, based on the utilization of
physician’s notes, paper prescriptions, and radiograph              information mining better choices and enhancing their
films[13] .                                                         wellbeing. By choosing the right path for their daily
    Veracity: The veracity of data is the degree                   health, regarding their diet, preventive care, exercise,
of assurance that the meaning of data is consistent.
                                                                    and other activities of daily living, patients can play an
Different data sources vary in their levels of data
                                                                    active role in realizing a healthy life[16] .
credibility and reliability[9] . The outcomes of big-
                                                                        Right Care: This pathway ensures that patients
data analytics must be credible and error-free, but in
healthcare, unsupervised machine learning algorithms                receive the most appropriate treatment available and
make decisions that are used by automated machines                  that all providers obtain the same data and has the
based on data that may be worthless or misleading[4] .              same objectives to avoid redundancy of planning and
Healthcare analytics are tasked with extracting useful              effort[17] . This aspect has become more viable in the
insights from this data to treat patients and make the              era of big data.
best possible decisions.                                                Right Provider: Healthcare providers in this
                                                                    pathway can obtain an overall view of their patients
4     Impact of Big Data on the Healthcare                          by combining data from various sources such as
      System                                                        medical equipment, public health statistics, and
The potential of big data is that it could revolutionize            socioeconomic data[15] . The accessibility of this
outcomes regarding the most suitable or accurate                    information enables human service providers to conduct
patient diagnosis and the accuracy information used                 targeted investigations and develop the skills and
in the health informatics system[15] . As such, the                 abilities to identify and provide better treatment options
investigation of huge amounts of information will have              to patients[18] .
a powerful effect on medicinal services framework                       Right Innovation: This pathway recognizes
in five respects, or “pathways” (shown in Fig. 2).                  that new disease conditions, new treatments, and
Improving outcomes for patients with respect to these               new medical will continue to evolve[15] . Likewise,
pathways, as described below, will be the focus of the              advancements in the provision of patient services, for
example, upgrading medications and the efficiency of           big-data technology. Various hospitals around the
research and development efforts, will enable new ways         globe use Hadoop-based components in the Hadoop
to promote wellbeing and patient health via national           Distributed File System (HDFS), including the Impala,
social insurance system[17] . The availability of early        HBase, Hive, Spark, and Flume frameworks, to
trial data is important for stakeholders. This data can        convert the huge amount of unstructured data generated
be used to explore high-potential targets and identify         by sensors that take patient vital signs, heartbeats
techniques for improving traditional clinical treatment        per minute, blood pressure, blood sugar level, and
methods.                                                       respiratory rate. Without Hadoop, these healthcare staff
    Right Value: To improve the quality and value of          could not analyze this unstructured data being generated
health-related services, providers must pay careful and        by patient healthcare systems. In Atlanta, Georgia, there
ongoing attention to their patients. Patients must obtain      are 6200 Intensive Care Units (ICUs) for pediatric
the most beneficial results identified by their social         healthcare, where children can stay for more than
insurance system[18] . Measures that could be taken to         one month depending on their problem. These ICUs
ensure the intelligent use of data includes, for example,      are equipped with a sensor technology that tracks the
identifying and destroying data misrepresentation,             child’s health status with respect to heartbeat, blood
manipulations, and waste, and improving resources[19] .        pressure, and other vital signs. If any problem occurs,
                                                               an alert is automatically generated to medical staff to
5    Hadoop-Based Applications for Health
                                                               ensure the child’s safety.
     Industry                                                      Hospital Network: Several hospitals use the
In light of the fact that healthcare data exists primarily     Hadoop ecosystem’s NoSQL database to collect and
in printed form, there is a need for the active digitization   manage their huge amounts of real-time data from
of print form data. The majority of this data is also          diverse sources related to patient care, finances, and a
unstructured, so it is a major challenge for this industry     payroll, which helps them identify high-risk patients
to extract meaningful information regarding patient            while also reducing day-to-day expenditures.
care, clinical operations, and research. The collection of         Healthcare Intelligence: Hadoop technology
software utilities known as the Hadoop ecosystem can           also supports the healthcare intelligence applications
help the healthcare sector to manage this vast amount of       used by hospitals and insurance companies. Hadoop
data. The various applications of the Hadoop ecosystem         ecosystem’s Pig, Hive, and MapReduce technologies
in the healthcare sector are as follows:                       process large datasets related to medicines, diseases,
    Treatment of Cancer and Genomics: We know                 symptoms, opinions, geographic regions, and other
that human DNA contains three billion base pairs. To           factors to extract meaningful information (e.g., desired
fight cancer, it is vital that large amounts of data are       age) for insurance companies.
efficiently organized. The patterns of cancer mutations            Prevention and Detection of Frauds: In
and their reactions vary based on individual genetics,         the early faces of big data analytics, health-based
which explains the non-curability of some cancer.              insurance groups utilize multiple paths to identify
Oncologists have determined that in recognizing the            fraud activity and establish methods to prevent medical
patterns of cancer, it is important to provide specific        fraud. With Hadoop, companies use applications based
treatment for specific cancers, based on the patient’s         on a prediction model to identify those committing
genetic makeup. The Hapdoop technology MapReduce               fraud via data regarding their previous health claims,
facilitates the mapping of three billion DNA base pairs        voice recordings, wages, and demographics. Hadoop’s
to determine the appropriate cancer treatment for each         NoSQL database is also helpful in preventing fraud
particular patient. Arizona State University is working        related to medical claims at an early stage by the use of
on project to develop a healthcare model that takes            real-time Hadoop based health applications, authentic
individual genomic data and selects a treatment based          medical claim bills, weather forecasting data, voice data
on identification of the patient’s cancer gene. This           recordings, and other data sources.
model provides basis for treatment through big data            6    Big Data Analytics Architecture for
analysis to improve the chances of saving patients lives.
                                                                    Health Informatics
    Monitoring of Patient Vitals: Hospital staff
throughout the world connect their work output using           Currently, the main focus in big-data analytics is
 Sunil Kumar et al.: Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools                     53
to gain an in-depth insight and understanding of                 third component, big data analytics applications have
big data rather than to collect it[20] . Data analytics          a storage domain to integrate accessed databases that
involves the development and application of algorithms           use different applications[26] . In the fourth component,
for analyzing various complex data sets to extract               are the most popular big-data analytics applications
meaningful knowledge, patterns, and information. In              in healthcare systems, which include reports, Online
recent years, researchers have begun to consider                 Analytical Processing (OLAP), queries, and data
the appropriate architectural framework for healthcare           mining.
systems that utilize big-data analytics, one of which               As shown in Fig. 3, healthcare data come from a
uses a four-layer architecture that comprises a                  range of sources including EHRs, genome databases,
transformation layer, data-source layer, big data                genome data files, text and imagery (unstructured data
platform layer, and analytical layer[14] . In this layered       sources), clinical decision support systems, government
system, data originates from different sources and has           related sources, medical test labs and pharmacies, and
various formats and storage systems. Each layer has              health insurance companies. These data are frequently
a specific data-processing functionality for performing          available in different scheme tables, and are in
specific tasks on the HDFS, using the MapReduce                  ASCII/text and stored at various locations.
processing model. The other layers perform other tasks,             In the next section, we describe the various big-
i.e., report generation, query passing, data mining              data Hadoop-based processing tools that support the
processing, and online analytical processing.                    development of health-based applications for the health
   The main requirement in big-data analytical                   industry.
processing is to bundle the data at high speed to
minimize the bundling time. The next priority in                 7    Hadoop’s Tools and Techniques for Big
big-data analytical processing is to efficiently update               Data
and transform queries at a constant time[21] . The third
                                                                 To manage unstructured big data that does not fit into
requirement in the big-data analytical processing is
to utilize and efficiently manage the storage area               any database, special tolls are needed. To examine
space. The last specification of big-data analytics is to        this type of big dataset, the IT sector uses the Hadoop
efficiently become familiar with the rapidly progressing         platform for a wide variety of methods that have been
workload notations. Big-data analytics frameworks                developed to record, organize, and analyze this type of
differ from traditional healthcare processing systems            data[27, 28] . More efficient tools are needed to extract
with respect to how they process big data[22] . In the           meaningful output from big data. Most of the tools
current health care system, data is processed using              are implemented in the Apache Hadoop architecture
traditional tools installed in a single stand-alone              including MapReduce, Mahout, Hive, and others[29] .
system like a desktop computer. In contrast, big data            Below, we discuss the various tools used in processing
is processed by clustering and scans multiple nodes of           healthcare big datasets.
clusters in the network[23] . This processing is based on            Apache Hadoop: The name Hadoop has evolved
the concept of parallelism to handle large medical data          to mean many different things[23] . In 2002, it was
sets[24] . Freely available frameworks, such as Hadoop,          established as a single software project to support a
MapReduce, Pig, Sqoop, Hive, and HBase Avro, all                 web search engine. Since that time, it has grown into
have ability to process the health related data sets for         an ecosystem of tools and applications that are used to
healthcare systems.                                              analyze large amounts and types of data[30] . Hadoop
   Big-data technologies broadly refer to scientific             can no longer be considered to be a monolithic single
innovations that mimic those used for large                      project, but rather an approach to data processing
datasets[25] . In the first component is the requirement         that radically differs from the traditional relational
for big data sources for processing. In the second               database model[23] . A more practical definition of the
component clusters with a centralized big-data                   Hadoop ecosystem and framework is the following:
processing infrastructure are at the peak of high                open source tools, libraries, and methodologies for
performance[24] . It has been observed that the tools            “big data” analysis in which a number of data sets are
mainly available for big-data analytics processing               collected from different sources, i.e., Internet images,
provide data security, scalability, and manageability            audios, videos, and sensor records as both structured
with the help of the MapReduce paradigm. In the                  and unstructured data to be processed[22] . Figure 4
  54                                                                  Big Data Mining and Analytics, March 2019, 2(1): 48-57
is complete[26] . The MapReduce programming phase                typically HDFS, due to the tight integration of HBase
also has two stages: a mapping stage that accepts                and HDFS[33] . If there is need for a structured low-
input in key value pairs and generates output in key             latency view of the high-scale data stored via Hadoop,
value pairs and a second reducing stage, in which                then HBase is the correct choice. Its open-source code
each phase consists of key-value pairs as input and              scales linearly to handle petabytes of data on thousands
output[12] . There is a fixed size data segment division         of nodes.
step in Hadoop which is called input splits[20] . The Map            Apache Oozie: To run a complex system or tight
function generates the value pairs and the key, which            system design or if there are a number of interconnected
are stored in the mapper. Any keys that are the same             stations with data dependencies between them, there
are merged. A simplified view of MapReduce is shown              is a need for sophisticated technique called Apache
in Fig. 5.                                                       Oozie. Apache Oozie can handle and run multiple jobs
    Apache Hive: Hive is a data warehousing layer at            related to Hadoop. Oozie has two portions: workflow
the top of Hadoop, in which analyses and queries can             engines that store and execute workflow collections
be performed using SQL-like procedural language[32] .            of Hadoop-based jobs and a coordinator engine that
Apache Hive can be used to perform ad-hoc queries,               processes workflow jobs based on how they are
summarization, and data analysis. Hive is considered             designed in the process schedule. Oozie is designed
to be a de facto standard for SQL based queries over             to construct and manage Hadoop jobs as workflow in
petabytes of data using Hadoop and offers the features           which the output of one job serves as the input for
easy data extraction, transformation, and access to the          a subsequent job[37] . Oozie is not a substitute for the
HDFS comprising data files or other HBase storage                Yarn scheduler. Oozie workflow jobs are represented as
system[33] .                                                     Directed Acyclic Graphs (DAGs) of actions[28] . Oozie
    Apache Pig: Apache Pig is one of the available              plays the role of a service in the cluster and clients
open-source platforms being used to better analyze               submit their jobs for proactive or reactive execution.
big data. Pig is an alternative to the MapReduce                     Apache Avro: Avro is a serialization format
programming tool[34] . First developed by the Yahoo              that makes it possible for data to be exchanged
web service provider as a research project, Pig allows           between programs written in any language[38] . It is
users to develop their own user-define functions and             often used to connect Flume data flows. The Avro
supports many traditional data operations such as join,          system is schema-based, where the role of a scheme
sort, filter, etc.                                               is to perform the read and write operations with the
    Apache HBase: HBase is a column-oriented                    language being independent. Avro serializes the data
NoSQL database used in Hadoop[35] , in which user                that have a built-in schema[33] . It is a framework
can store large numbers of rows and columns. HBase               for the serialization of persistent data and remote
has the functionality of random read/write operations.           procedure calls between Hadoop nodes and between
It also supports record level updates, which is not              client programs and Hadoop services.
possible using HDFS[36] . HBase provides parallel data               Apache Zookeeper: Zookeeper is a centralized
storage via the underlying distributed file systems              system used by applications to maintain a healthcare
across commodity servers. The file system of choice is           system and provide organizing and other elements
on and between nodes[39] . It maintains the common           analytics can lead to treatments that are effective for
objects needed in large cluster environments, including      specific patients by providing the ability to prescribe
configuration information and the hierarchical naming        appropriate medications for each individual, rather than
space. These services can be used by different               those that work for most people. As we know, big
applications to coordinate the distributed processing of     data analytics is in the early stage of development and
Hadoop clusters. Zookeeper also ensures application          current tools and methods cannot solve the problems
reliability[40] . If an application master dies, zookeeper   associated with big data. Big data may be viewed as
generates a new application master to resume the tasks.      big systems, which present huge challenges. Therefore,
    Apache Yarn: Hadoop Yarn is a distributed               a great deal of research in this field will be required to
shell application and is an example of a Hadoop              solve the issues faced by the healthcare system.
non-MapReduce application built on top of Yarn[41] .
Yarn has two components, a Resource Manager (RM)             References
that handles all the resources within a cluster that         [1]  A. Gandomi and M. Haider, Beyond the hype: Big data
are required for the tasks and Node Manager (NM),                 concepts, methods and analytics, International Journal of
located on every host in a cluster and handles the                Information Management, vol. 35, no. 2, pp. 137–144,
available resources on the independent host. Both                 2015.
components handle the scheduling of jobs and manage          [2] A. O’Driscoll, J. Daugelaite, and R. D. Sleator, “Big Data”,
the containers, memory management, CPU throughput,                Hadoop and cloud computing in genomics, Journal of
and I/O system which run the dedicated application                Biomedical Informatics, vol. 46, no. 5, pp. 774–781, 2013.
                                                             [3] C. L. P. Chen and C. Y. Zhang, Data-intensive applications,
code.
                                                                  challenges, techniques and technologies: A survey on big
    Apache Sqoop: Apache Sqoop is a powerful                     data, Information Sciences, vol. 275, pp. 314–347, 2014.
tool that performs the functionality of extracting the       [4] M. Herland, T. M. Khoshgoftaar, and R.Wald, A review of
data from Relational Database Management System                   data mining using big data in health informatics, Journal
(RDMS) and inputting it into Hadoop architecture for              of Big Data, vol. 1, no. 1, p. 2, 2014.
query processing. To do so, this process uses the            [5] D. H. Shin and M. J. Choi, Ecological views of big data:
MapReduce paradigm or other standard level tools, e.g.,           Perspective and issues, Telematics and Informatics, vol.
                                                                  32, no. 2, pp. 311–320, 2015.
Hive[42] . Once placed in HDFS, the data can be used by
                                                             [6] B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. Basha, and
Hadoop applications.
                                                                  P. Dhavachelvan, Big data and Hadoop-A study in security
    Apache Flume: Apache Flume is a highly reliable              perspective, Procedia Computer Science, vol. 50, pp. 596–
service for accurately collecting data and moving                 601, 2015.
large volumes of data from independent machines to           [7] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, Data mining
HDFS[43] . Often data transport involves a number of              with big data, IEEE transactions on Knowledge and Data
flume agents that may traverse a series of machines               Engineering, vol. 26, no. 1, pp. 97–107, 2014.
and locations. Flume is often used for log files, data       [8] S. Sharma and V. Mangat, Technology and trends to handle
                                                                  big data: Survey, in Proc. 5th International Conference
generated by social media, and email messages.
                                                                  on Advanced Computing & Communication Technologies,
                                                                  2015, pp. 266–271.
8    Conclusion
                                                             [9] R. Mehmood and G. Graham, Big data logistics: A health-
In this paper, we have provided an in-depth description           care transport capacity sharing model, Procedia Computer
and a brief overview of big data in general and                   Science, vol. 64, pp. 1107–1114, 2015.
                                                             [10] D. P. Augustine, Leveraging big data analytics and Hadoop
in healthcare system, which plays a significant role
                                                                  in developing India healthcare services, International
in healthcare informatics and greatly influences the
                                                                  Journal of Computer Applications, vol. 89, no. 16, pp. 44–
healthcare system and the big data four Vs in
                                                                  50, 2014.
healthcare. We also proposed the use of a conceptual
                                                             [11] J. A. Patel and P. Sharma, Big data for better health
architecture for solving healthcare problems in big data          planning, in Proc. International Conference on Advances
using Hadoop-based terminologies, which involves the              in Engineering and Technology Research, 2014, pp. 1–5.
utilization of the big data, generated by different levels   [12] A. E. Youssef, A framework for secure healthcare systems
of medical data and the development of methods for                based on big data analytics in mobile cloud computing
analyzing this data and to obtain answers to medical              environments, International Journal of Ambient Systems
questions. The combination of big data and healthcare             and Applications, vol. 2, no. 2, pp. 1–11, 2014.
 Sunil Kumar et al.: Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools                              57
[13] MAPR, Healthcare and life science use cases, https://               using map reduce technique, in Proc. International
     mapr.com/solutions/industry/healthcare-and-lifescience-             Conference       on    Computational     Intelligence   &
     use-cases/, 2018.                                                   Communication Technology, 2015, pp. 703–708.
[14] W. Raghupathi and V. Raghupathi, Big data analytics in       [26]   J. Dean and S. Ghemawat, MapReduce: Simplified data
     healthcare: Promise and potential, Health Information               processing on large clusters, Communications of the ACM,
     Science and Systems, vol. 2, no. 1, p. 3, 2014.                     vol. 51, no. 1, pp. 107–113, 2008.
[15] J. Sun and C. K. Reddy, Big data analytics for healthcare,   [27]   Cloudera, Whole genome research drives healthcare to
     in Proc. 19th ACM SIGKDD International Conference on                Hadoop, https://www.cloudera.com/content/dam/www/
     Knowledge Discovery and Data Mining, 2013, pp. 1525–                marketing/resources/solution-briefs/whole-genome-
     1525.                                                               research-inhealthcare.pdf.landing.html., 2018.
[16] C. Mike, W. Hoover, T. Strome, and S. Kanwal.                [28]   R. Misra, B. Panda, and M. Tiwary, Big data and
     Transforming health care through big data strategies                ICT applications: A study, in Proc. 2nd International
     for leveraging big data in the health care industry,                Conference on Information and Communication
     http://ihealthtran.com/iHT2 BigData 2013.pdf, 2013.                 Technology for Competitive Strategies, 2016, p. 41.
[17] J. Anuradha, A brief introduction on big data 5Vs            [29]   A. G. Picciano, The evolution of big data and learning
     characteristics and Hadoop technology, Procedia                     analytics in american higher education, Journal of
     Computer Science, vol. 48, pp. 319–324, 2015.                       Asynchronous Learning Networks, vol. 16, no. 3, pp. 9–20,
[18] M. Viceconti, P. J. Hunter, and R. D. Hose, Big data, big           2012.
     knowledge: Big data for personalized healthcare, IEEE        [30]   Apache Hadoop, http://hadoop.apache.org/, 2018.
     Journal of Biomedical and Health Informatics, vol. 19, no.   [31]   A. Katal, M. Wazid, R. H. Goudar, and T. Noel, Big data:
     4, pp. 1209–1215, 2015.                                             Issues, challenges, tools and good practices, in Proc. 6th
[19] Y. Sun, H. Song, A. J. Jara, and R. Bie, Internet of                International Conference on Contemporary Computing,
     things and big data analytics for smart and connected               2013, pp. 404–409.
     communities, IEEE Access, vol. 4, pp. 766–773, 2016.         [32]   Apache Hive, https://hive.apache.org/, 2018.
[20] A. Jain and V. Bhatnagar, Crime data analysis using Pig      [33]   K. K. Y. Lee, W. C. Tang, and K. S. Choi, Alternatives
     with Hadoop, Procedia Computer Science, vol. 78, pp.                to relational database: Comparison of NoSQL and XML
     571–578, 2016.                                                      approaches for clinical data storage, Computer Methods
[21] T. Jach, E. Magiera, and W. Froelich, Application of                and Programs in Biomedicine, vol. 110, no. 1, pp. 99–109,
     Hadoop to store and process big data gathered from an               2013.
     urban water distribution system, Procedia Engineering,       [34]   Apache Pig, https://pig.apache.org/, 2018.
     vol. 119, pp. 1375–1380, 2015.                               [35]   E. Dede, B. Sendir, P. Kuzlu, J.Weachock, M. Govindaraju,
[22] C. Uzunkaya, T. Ensari, and Y. Kavurucu, Hadoop                     and L. Ramakrishnan, Processing Cassandra datasets with
     ecosystem and its analysis on tweets, Procedia-Social and           Hadoop-streaming based approaches, IEEE Transactions
     Behavioral Sciences, vol. 195, pp. 1890–1897, 2015.                 on Services Computing, vol. 9, no. 1, pp. 46–58, 2016.
[23] S. G. Manikandan and S. Ravi, Big data analysis using        [36]   Apache HBase, http://hbase.apache.org/, 2018.
     Apache Hadoop, in Proc. International Conference on IT       [37]   Apache Oozie, https://oozie.apache.org/, 2018.
     Convergence and Security, 2014, pp. 1–4.                     [38]   Apache Avro, https://avro.apache.org/, 2018.
[24] V. Ubarhande, A. M. Popescu, and H. Gonzalez-                [39]   Apache Zookeeper, https://zookeeper.apache.org/, 2018.
     Velez, Novel data-distribution technique for Hadoop          [40]   Apache Zookeeper,          https://www.ibm.com/analytics/
     in heterogeneous cloud environment, in Proc. 9th                    hadoop/zookeeper, 2018.
     International Conference on Complex, Intelligent, and        [41]   Apache Yarn, https://yarn.apache.org/, 2018.
     Software Intensive Systems, 2015, pp. 217–224.               [42]   Apache Sqoop, https://sqoop.apache.org/, 2018.
[25] S. Maitrey and C. K. Jha, Handling big data efficiently by   [43]   Apache Flume, https://flume.apache.org/, 2018.