UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics
As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:
• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with
respect to business requirements.
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in business.
• Perform Market Analysis – Market Analysis can be performed to understand the strengths
and the weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Data Analytics refers to the techniques to analyze data to enhance productivity and business
gain. Data is extracted from various sources and is cleaned and categorized to analyze
different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.
Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:
In general data analytics also deals with bit of human knowledge as discussed below in figure
2.2 in this under each type of analytics there is a part of human knowledge required in
prediction. Descriptive analytics requires the highest human input while predictive analytics
requires less human input. In case of prescriptive analytics no human input is required since
all the data is predicted.
Types of Data Analytics
There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Fig 2.3 Data and Human work
2.2 Introduction to Tools and Environment
In general data analytics deals with three main parts, subject knowledge, statistics and person
with computer knowledge to work on a tool to give insight in to the business. In the mainly
used tool is Rand Python as shown in figure 2.3
With the increasing demand for Data Analytics in the market, many tools have emerged with
various functionalities for this purpose. Either open-source or user-friendly, the top tools in
the data analytics market are as follows.
•R programming – This tool is the leading analytics tool used for statistics and data
modelling. R compiles and runs on various platforms such as UNIX, Windows, and Mac OS.
It also provides tools to automatically install all packages as per user-requirement.
• Python – Python is an open-source, object-oriented programming language which is easy to
read, write and maintain. It provides various machine learning and visualization libraries such
as Scikit-learn, Tensor Flow, Matplotlib, Pandas, Keras etc. It also can be assembled on any
platform like SQL server, a Mongo DB database or JSON
•Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with real-
time updates on the web.
• QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
•SAS – A programming language and environment for data manipulation and analytics, this
tool is easily accessible and can analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly
used for clients’ internal data, this tool analyzes the tasks that summarize the data with a
preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its modular
data pipeline concept.
• OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean
up data for analysis. It is used for cleaning messy data, the transformation of data and parsing
data from websites.
• Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This
tool is also popular for data pipelines and machine learning model development
Apart from the above-mentioned capabilities, a Data Analyst should also possess skills such
as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if you
have knowledge of Machine Learning, then that would make you stand out from the crowd.
2.3 Application of modelling a business & Need for business Modelling
Data analytics is mainly involved in field of business in various concerns for the following
purpose and it varies according to business needs and it is discussed below in detail.
Nowadays majority of the business deals with prediction with large amount of data to work
with.
Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:
1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and user
interactivity.
2.3.1 Utilizing Hadoop in Big Data Analytics.
Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Fig 2.4: Working of Hadoop – With Map Reduce Concept
The Map() step: Each worker node applies the Map() function to the local data and writes the
output to a temporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that is organized by key values K2. A master node arranges it so that
for redundant copies of input data only one is processed.
The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that each processor should work on, and provide that processor with all of the map-
generated data associated with that key value, such that all data belonging to one key are
located on the same worker node.
The Reduce() step: Worker nodes process each group of output data(per key) in parallel,
executing the user provided Reduce() code; each function is run exactly once for each K2 key
value produced by the map step.
Produce the final output: The Map Reduce system collects all of the reduce outputs and sorts
them by K2 to produce the final out-come.
Fig.2.4 shows the classical “word count problem” using the Map Reduce paradigm. As
shown in Fig.2.4, initially a process will split the data into a subset of chunks that will later
be processed by the mappers. Once the key/values are generated by mappers, a shuffling
process is used to mix (combine) these key values (combining the same keys in the same
worker node). Finally, the reduce functions are used to count the words that generate a
common output as a result of the algorithm. As a result of the execution or
wrappers/reducers, the out- put will generate a sorted list of word counts from the original
text input
2.3.2 The Employment of Big Data Analytics on IBM.
IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue
2.3.3 The Performance of Data Driven Companies.
Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making
2.4 Databases
Database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database
management system (DBMS)
The database can be divided into various categories such as text databases, desktop database
programs, relational database management systems (RDMS), and No SQL and object-
oriented databases
A text database is a system that maintains a (usually large) text collection and provides fast
and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..
A desktop database is a database system that is made to run on a single computer or PC.
These simpler solutions for data storage are much more limited and constrained than larger
data center or data warehouse systems, where primitive database software is replaced by
sophisticated hardware and networking setups. Eg: Microsoft excel, open access, etc.
A relational database (RDB) is a collective set of multiple data sets organized by tables,
records and columns. RDBs establish a well-defined relationship between database tables.
Tables communicate and share information, which facilitates data search ability, organization
and reporting. Eg: sql, oracle,Db2, DbaaS etc
No SQL databases are non-tabular, and store data differently than relational tables. No SQL
databases come in a variety of types based on their data model. The main types are document,
key-value, wide-column, and graph. Eg: JSON,Mango DB,Couch DB etc
Object-oriented databases (OODB) are databases that represent data in the form of objects
and classes. In object-oriented terminology, an object is a real-world entity, and a class is a
collection of objects. Object-oriented databases follow the fundamental principles of object-
oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc
2.5 Types of Data and variables
In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.
In terms of big data we represent the columns from RDMS as an attribute or a variable. This
variable can be divided in to two types’ categorical data or qualitative data and continuous or
discrete data called as quantitative data. As shown below in figure 2.5.
Qualitative data or Categorical data is normally represented as variable that holds characters.
And this is divided in to two types’ nominal data and ordinal data.
In Nominal Data there is no natural ordering in values in the attribute of the dataset. Eg:
color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
cannot be valid.
In Ordinal Data there is natural ordering in values in the attribute of the dataset. Eg: size (S,
M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.
Fig 2.5: Types of Data Variables
Quantitative data or (discrete or continuous data) can be further divided in to two types’
discrete attribute and continuous attribute.
Discrete Attribute which takes only finite number of numerical values (integers). Eg: number
of buttons, no of days for product delivery etc.. These data can be represented at every
specific interval in case of time series data mining or even in ratio based entries.
Continuous Attribute which takes finite number of fractional values. Eg: price, discount,
height, weight, length, temperature, speed etc. These data can be represented at every specific
interval in case of time series data mining or even in ratio based entries.
2.5 Data Modelling Techniques
Data modelling is nothing but a process through which data is stored structurally in a format
in a database. Data modelling is important because it enables organizations to make data-
driven decisions and meet varied business goals.
The entire process of data modelling is not as easy as it seems, though. You are required to
have a deeper understanding of the structure of an organization and then propose a solution
that aligns with its end-goals and suffices it in achieving the desired objectives.
Types of Data Models
Data modeling can be achieved in various ways. However, the basic concept of each of them
remains the same. Let’s have a look at the commonly used data modeling methods:
Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data in a tree-
like format as shown in figure 2.6. However, retrieving and accessing data is difficult in a
hierarchical database. This is why it is rarely used now.
Fig 2.6: Hierarchical Model Structure
Relational model
Proposed as an alternative to hierarchical model by an IBM researcher, here data is
represented in the form of tables. It reduces the complexity and provides a clear overview of
the data as shown below in figure 2.7
Fig 2.7: Relational Model Structure
Network model
The network model is inspired by the hierarchical model. However, unlike the hierarchical
model, this model makes it easier to convey complex relationships as each record can be
linked with multiple parent records as shown in figure 2.8. In this model data can be shared
easily and the computation becomes easier.
Fig 2.8: Network Model Structure
Object-oriented model
This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.
Fig 2.9: Object-Oriented Model Structure
Entity-relationship model
Entity-relationship model, also known as ER model, represents entities and their relationships
in a graphical format. An entity could be anything – a concept, a piece of data, or an object.
Fig 2.10: Entity Relationship Diagram
The entity relationship diagram explains relation between variables and with their primary
key and foreign key as shown in figure 2.10. Along with this it also explains the multiple
instances of relation between tables.
Now that we have a basic understanding of data modeling, let’s see why it is important.
Importance of Data Modeling
• A clear representation of data makes it easier to analyze the data properly. It provides a
quick overview of the data which can then be used by the developers in varied applications.
• Data modeling represents the data properly in a model. It rules out any chances of data
redundancy and omission. This helps in clear analysis and processing.
• Data modeling improves data quality and enables the concerned stakeholders to make data-
driven decisions.
Since a lot of business processes depend on successful data modeling, it is necessary to adopt
the right data modeling techniques for the best results.
Best Data Modeling Practices to Drive Your Key Business Decisions
Have a clear understanding of your end-goals and results
You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.
Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.
Keep it sweet and simple and scale as you grow
Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.
Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.
Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order
You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.
Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.
In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.
This way, you can organize all your data properly and position yourself to answer an array of
business intelligence questions without breaking a sweat.
Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.
Keep as much as is needed
While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
machines’ performance.
More often than not, just a small yet useful amount of data is enough to answer all the
business-related questions. Spending huge on hosting enormous data of data only leads to
performance issues, sooner or later.
Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.
Keep crosschecking before continuing
Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.
For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.
Let them evolve
Data models are never written in stone. As your business evolves, it is essential to customize
your data modeling accordingly. Thus, it is essential that you keep them updating over time.
The best practice here is to store your data models in as easy-to-manage repository such that
you can make easy adjustments on the go.
Key takeaway: Data models become out dated quicker than you expect. It is necessary that
you keep them updated from time to time.
The Wrap Up
Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.
2.6 Missing Imputations
In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
to avoid pitfalls involved with list-wise deletion of cases that have missing values.
I. Do nothing to missing data
II. Fill the missing values in the dataset using mean, median.
Eg: for sample dataset given below
SNO Column 1 Column 2 Column 3
1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16
Can be replaced as using column mean as follows
SNO Column 1 Column 2 Column 3
1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14
5 6 8.66 9.5
6 10 13 16
Advantages:
• Works well with numerical dataset.
• Very fast and reliable.
Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
III. Imputations using (most frequent) or (zero / constant) values
This can be used for categorical attributes.
Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.
IV. Imputation using KNN
It creates a basic mean impute then uses the resulting complete list to construct a KD Tree.
Then, it uses the resulting KD Tree to compute nearest neighbours (NN). After it finds the k-
NNs, it takes the weighted average of them.
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This
can be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode
Disadvantage:
Sensitive to outliers