Introduction to Analytics
Predictive Analytics is an art of predicting future on the basis of past trend.
It is a branch of Statistics which comprises of Modeling Techniques, Machine Learning
& Data Mining.
Predictive Analytics is primarily used in Decision Making.
What and Why analytics:
Analytics is a journey that involves a combination of potential skills, advanced
technologies, applications, and processes used by firm to gain business insights from data
and statistics.
This is done to perform business planning.
Places where Analytics is used:
Reporting Vs Analytics:
Reporting is presenting result of data analysis
Analytics is process or systems involved in analysis of data to obtain a desired output.
Introduction to tools and Environment:
Analytics is now days used in all the fields ranging from Medical Science to Aero science to
Government Activities.
Data Science and Analytics are used by Manufacturing companies as well as Real
Estate firms to develop their business and solve various issues by the help of historical
data base.
Tools are the software that can be used for Analytics like SAS or R.
While techniques are the procedures to be followed to reach up to a solution.
Various steps involved in Analytics:
1. Access
2. Manage
3. Analyze
4. Report
Various Analytics techniques are:
1.Data Preparation
2. Reporting, Dashboards & Visualization
3. Segmentation Icon
4. Forecasting
5. Descriptive Modeling
6. Predictive Modeling
7. Optimization
Application of Modeling in Business
A statistical model embodies a set of assumptions concerning the generation of the
observed data, and similar data from a larger population.
A model represents, often in considerably idealized form, the data-generating process.
Signal processing is an enabling technology that encompasses the fundamental theory,
applications, algorithms, and implementations of processing or transferring information
contained in many different physical, symbolic, or abstract formats broadly designated as
signals.
It uses mathematical, statistical, computational, heuristic, and linguistic representations,
formalisms, and techniques for representation, modeling, analysis, synthesis, discovery,
recovery, sensing, acquisition, extraction, learning, security, or forensics.
In manufacturing statistical models are used to define Warranty policies, solving various
conveyor related issues, Statistical Process Control etc.
Databases & Type of data and variables
Data dictionary, or metadata repository
"centralized repository of information about data such as meaning, relationships
to other data, origin, usage, and format” as defined in the IBM Dictionary of
Computing
A document describing a database or collection of databases
An integral component of a DBMS that is required to determine its structure
A piece of middleware that extends or supplants the native data dictionary of a
DBMS
Category of Data
Data can be categorized on various parameters like Categorical, Type etc.
Types of Data
Basic 2 types
Numeric
Character.
Numeric data can be further divided into sub group of
Discrete
Continuous.
Again, Data can be divided into 2 categories
Nominal
Ordinal.
Also based on usage data, divided into 2 categories
Quantitative
Qualitative
Manufacturing industry also has their data divided in the groups discussed above.
Like production quantity is a discrete quantity
While production rate is a continuous data.
Similarly quality parameter can be given ratings which ordinal data.
Data Modeling Techniques Overview
Regression analysis mainly focuses on finding a relationship between a dependent
variable and one or more independent variables.
Predict the value of a dependent variable based on the value of at least one independent
variable.
It explains the impact of changes in an independent variable on the dependent variable.
Y = f(X, β) where Y is the dependent variable X is the independent variable β is the
unknown coefficient.
Widely used in prediction and forecasting.
Missing Imputations
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a
number). Unlike SAS, R uses the same symbol for character and numeric data.
To test if there is any missing in the dataset we use is.na () function.
For Example, We have defined “y” and then checked if there is any missing value.
T or True means that there is a missing value. y <- c(1,2,3,NA) is.na(y) # returns a vector
(F FF T)
Arithmetic functions on missing values yield missing values.
For Example, x <- c(1,2,NA,3) mean(x) # returns NA To remove missing values from
our dataset we use na.omit() function.
For Example, We can create new dataset without missing data as below: -
newdata<- na.omit(mydata)
we can also use “na.rm=TRUE” in argument of the operator.
From above example we use na.rm and get desired result. x <- c(1,2,NA,3) mean(x,
na.rm=TRUE)
# returns 2
MICE Package -> Multiple Imputation by Chained Equations MICE uses PMM to
impute missing values in a dataset.
PMM-> Predictive Mean Matching (PMM) is a semi-parametric imputation approach.
It is similar to the regression method except that for each missing value, it fills in a value
randomly from among the observed donor values from an observation whose regression-
predicted values are closest to the regression-predicted value for the missing value from
the simulated regression model.