Build the Models
To build the model, data should be clean and understand the content
properly.
The components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
Model and Variable Selection
Consider model performance and whether project meets all the requirements
1. Must the model be easy to implement
2. The model need to be easy to explain
3. Check difficulty
Commercial tools
1. SAS enterprise miner
2. SPSS modeller
3. Matlab
4. Alpine miner
Open Source tools:
1. R and PL/R
2. Octave
3. WEKA: WEKA can be executed within Java code.
4. Python
5. SQL in-dat
Model Execution
Various programming language is used for implementing the model
For model execution, Python provides libraries like StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
Model Diagnostics and Model Comparison
Try to build multiple model and then select best one based on multiple criteria.
Working with a holdout sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a
training and a testing dataset. This can be a 60/40 or 70/30 or 80/20 split. This
technique is called the hold-out validation technique.
PRESENTING FINDINGS AND BUILDING APPLICATIONS
• The team delivers final reports, briefings, code and technical documents.
In addition, team may run a pilot project to implement the models in a
production environment.
• The last stage of the data science process is where user soft skills will be
most useful.
• Presenting your results to the stakeholders and industrializing your analysis
process for repetitive reuse and integration with other tools.
Data Mining
• Data mining refers to extracting or mining knowledge from large amounts of
data.
It is a process of discovering interesting patterns or Knowledge from a large
amount of data stored either in databases, data warehouses or other information
repositories.
Reasons for using data mining:
1. Knowledge discovery: To identify the invisible correlation, patterns in the
database.
2. Data visualization: To find sensible way of displaying data.
3. Data correction: To identify and correct incomplete and inconsistent data.
Functions of Data Mining
• Different functions of data mining are characterization, association and
correlation analysis, classification, prediction, clustering analysis and
evolution analysis.
1. Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of students
can be produced, generating a profile of all the University in first year
engineering students.
2. Association is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
3. Classification differs from prediction. Classification constructs a set of
models that describe and distinguish data classes and
Prediction builds a model to predict some missing data values.
4. Clustering can also support taxonomy formation. The organization of
observations into a hierarchy of classes that group similar events together.
5. Data evolution analysis describes and models' regularities for objects whose
behaviour changes over time.
Data mining tasks can be classified into two categories: descriptive and
predictive.
Predictive Mining Tasks
To make prediction, predictive mining tasks performs inference on the current
data.
Predictive analysis provides answers of the future queries
Descriptive Mining Task
To provide a depiction or "summary view" of facts and figures in an
understandable format, to either inform or prepare data for further analysis.
Architecture of a Typical Data Mining System
Data warehouse server based on the user's data request, data warehouse server is
responsible for fetching the relevant data.
• Knowledge base is helpful in the whole data mining process. It might be
useful for guiding the search or evaluating the interestingness of the result
patterns. The knowledge base might even contain user beliefs and data from
user experiences that can be useful in the process of data mining.
• The data mining engine is the core component of any data mining system. It
consists of a number of modules for performing data mining tasks including
association, classification, characterization, clustering, prediction, time-series
analysis etc.
• The pattern evaluation module is mainly responsible for the measure of
interestingness of the pattern by using a threshold value. It interacts with the
data mining engine to focus the search towards interesting patterns.
• The graphical user interface module communicates between the user and the
data mining system. This module helps the user use the system easily and
efficiently without knowing the real complexity behind the process.
• When the user specifies a query or a task, this module interacts with the data
mining system and displays the result in an easily understandable manner.
Data Warehousing
Data warehousing is the process of constructing and using a data warehouse
A data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries and decision making.
Data warehousing involves data cleaning, data integration and data
consolidations.
A data warehouse usually stores many months or years of data to support
historical analysis. The data in a data warehouse is typically loaded through an
extraction, transformation and loading (ETL) process from multiple data
sources.
• Databases and data warehouses are related but not the same.
• A database is a way to record and access information from a single source
• A data warehouse is a way to store historical information from multiple
sources to allow you to analyse and report on related data
Goals of data warehousing:
1. To help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
Characteristics of Data Warehouse
1. Subject oriented Data are organized
2. Integrated
3. Non-volatile
4. Time variant
Multitier Architecture of Data Warehouse
a) Single-tier architecture.
b) Two-tier architecture.
c) Three-tier architecture (Multi-tier architecture).
• Single tier warehouse architecture focuses on creating a compact data set and
minimizing the amount of data stored
• Two-tier warehouse structures separate the resources physically available from
the warehouse itself. This is most commonly used in small organizations
• The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded. The bottom tier is a warehouse database server.
• The middle tier is the application layer giving an abstracted view of the
database. It arranges the data to make it more suitable for analysis. This is
done with an OLAP server, implemented using the ROLAP or MOLAP model.
• The top tier is the front-end of an organization's overall business intelligence
suite. The top-tier is where the user accesses and interacts with data via
queries, data visualizations and data analytics tools.