0% found this document useful (0 votes)
27 views12 pages

Dmi Unit 5

The document provides an overview of various open-source data mining tools, including RapidMiner, Orange, SAS, KNIME, H2O, Rattle, and DataMelt, highlighting their features, applications, and programming languages used. It emphasizes the importance of data mining tools for efficient data analysis, pattern identification, and decision-making. Additionally, it outlines the data mining process, including steps such as data collection, preparation, exploration, modeling, and evaluation.

Uploaded by

Sahil Badhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views12 pages

Dmi Unit 5

The document provides an overview of various open-source data mining tools, including RapidMiner, Orange, SAS, KNIME, H2O, Rattle, and DataMelt, highlighting their features, applications, and programming languages used. It emphasizes the importance of data mining tools for efficient data analysis, pattern identification, and decision-making. Additionally, it outlines the data mining process, including steps such as data collection, preparation, exploration, modeling, and evaluation.

Uploaded by

Sahil Badhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CM5107

Unit 5. Open Source Data mining Tool


5.1 Data mining tool introduction
5.2 Installation
5.3 Load data
5.4 File formats
5.5 Preprocessing data
5.6 Classifiers
5.7 Clustering
5.8 Association
5.9 Feature Selection
***************************************************************************
Why do we need Data Mining Tools?
 Data mining tools are needed because they enable us to efficiently and effectively analyze
large and complex data sets.
 Here are some reasons why we need data mining tools:
o Extracting useful information: Extract the information from large data sets that may
not be easily visible or apparent through manual analysis.
o Identifying patterns and trends: Identify patterns and trends in the data that may be
useful for making informed decisions.
o Improving decision-making: Make better decisions based on insights gained from the
data.
o Increasing efficiency: Automate the process of analyzing data, making it faster and
more efficient than manual analysis.
o Reducing costs: By identifying inefficiencies and opportunities for improvement, data
mining tools can help reduce costs and increase profitability.

Rapid Miner DM Tool


 It is developed by a company with the same name Rapid Miner.
 It is written in Java. So it is Platform Independence.
 Rapid Miner offers the server as both on-premise & in public/private cloud
infrastructures. It has a client/server model as its base.
 It is really fast in reading all kinds of database.
 Extensions can be installed from RapidMiner Marketplace.
 A visual-code-free- environment, so no programming needed
 Design of analysis processes
 Data loading , Data transformation and Data Modelling
 Data visualization (with lots of visualizations)
 Allows you to work with different types and sizes of data sources
 It is one of the best predictive analysis tool (with pre-made templates)
 It provides an integrated environment for deep learning, text mining, machine learning
and predictive analysis
 It is Open source Data Mining tool. Link: https://rapidminer.com/

1|Pag e
CM5107

 The tool can be used for over a vast range of applications including for business
applications, commercial applications, training, education, research, application
development, machine learning.
 Rapid Miner comes with template based frameworks that enable speedy delivery with
reduced number of errors (which are quite commonly expected in manual code writing
process).
 It acts as a powerful scripting language engine along with a graphical user
 Modular operator concept.
 Multi-layered data view.
 Rapid Miner constitutes of three modules, namely
1. Rapid Miner Studio: This module is for workflow design, prototyping, validation etc.
2. Rapid Miner Server: To operate predictive data models created in studio
3. Rapid Miner Hadoop: Executes processes directly in the Hadoop cluster to simplify
predictive analysis.

Orange Data Mining Tools


 It is open source softcore written in python.
 It was developed in Ljubljana University, Slovenia
 It also supports data visualization in 3D, 2D and 1D format.
 It is a component-based software, with a Large collection of pre-built ML algorithms.
 For non- programmers data mining tasks can be performed using visual programming drag-and-
drop interface.
 It provides num numerous graphics like silhouette plots and sieve diagrams.

2|Pag e
CM5107

SAS Data Mining Tool


 SAS stands for Statistical Analysis System. Availability: Proprietary License
 It is written in c language.
 It is a product of the SAS Institute created for analytics & data management
 SAS can mine data, change it, manage information from various sources.
 For non-technical users, it offers interactive UI.
 SAS data miner enables users to analyze big data and derives accurate insight to make
timely decisions.
 SAS has a distributed memory processing architecture which is highly scalable. It is well
suited for data mining, text mining & optimization.
 SAS can mine data, alter it, manage data from different sources and perform statistical
analysis

KNIME Data Mining Tool


 It is written in Java.
 It is open Source.
 It operates on the concept of the modular data pipeline.
 It contains various ML and data mining components embedded together
 It is widely used in pharmaceutical research.
 It performs excellently for Business Intelligence.

 KNIME is the best integration platform for data analytics and reporting developed by
KNIME.com AG.
 It operates on the concept of the modular data pipeline. KNIME constitutes of various
machine learning and data mining components embedded together.

3|Pag e
CM5107

 KNIME has been used widely for pharmaceutical research. In addition, it performs
excellently for customer data analysis, financial data analysis, and business intelligence.
 KNIME has some brilliant features like quick deployment and scaling efficiency.
 Users get familiar with KNIME in quite lesser time and it has made predictive analysis
accessible to even naive users.
 KNIME utilizes the assembly of nodes to pre-process the data for analytics and
visualization.

H2O Data Mining Tool


 It is written in Java.
 It is open Source.
 It offers Auto ML functions to help users build and deploy ML models in fast and simple
way.
 Distributed memory computing is supported
 Huge datasets can be handled properly.
 It can be integrated though APIs available in all major programming language.

Rattle Data Mining Tool


 Rattle is a data mining tool based on GUI.
 It uses the R stats programming language.
 Rattle exposes the statical power of R by offering significant data mining features.
 While rattle has a comprehensive and well-developed user interface, It has an integrated
log code tab that produces duplicate code for any GUI operation.
 The data set produced by Rattle can be viewed and edited.
 Rattle gives the other facility to review the code, use it for many purposes, and extend the
code without any restriction.

4|Pag e
CM5107

DataMelt Data Mining Tool


 DataMelt is a computation and visualization environment which offers an interactive
structure for data analysis and visualization.
 It is primarily designed for students, engineers, and scientists.
 It is also known as DMelt. DMelt is a multi-platform utility written in JAVA. It can run on
any operating system which is compatible with JVM (JavaVirtual Machine).
 It consists of Science and mathematics libraries.
 Scientific libraries are used for drawing the 2D/3D plots.
 Mathematical libraries are used for random number generation, algorithms, curve fitting,
etc.
 DMelt can be used for the analysis of the large volume of data, data mining, and
statistical analysis. It is extensively used in natural sciences, financial markets, and
engineering.

Data mining tool introduction


 Data mining techniques are derived that utilize the domain knowledge from statistical
analysis, artificial intelligence, and database systems in order to analyze data in a proper
manner in view of different dimensions and perspectives.
 Data mining tools discover patterns or trends from the collection of large sets of data and
transforming data into useful information for making decisions.
 Several open-source and proprietary-based data mining and data visualization tools exist
which are used for information extraction from large data repositories and for data
analysis.
 Some of the data mining tools which exist in the market are Weka, RapidMiner, Orange,
R, KNIME, ELKI, GNU Octave, Apache Mahout, SCaViS, Natural Language Toolkit,
Tableau, SaS, Rattle Data Mining, DataMelt Data Mining, Oracle BI etc
 Weka has a GUI that facilitates easy access to all its features.
 Data Mining tools have the objective of discovering patterns/trends/groupings among
large sets of data and transforming data into more refined information.
 The key data mining tool is the data mining workbench, which has several characteristics:
 A visual programming interface - which enables users to define workflow by stringing
together icons representing the steps of the process
 Design aimed at business users, rather than programmers or statisticians
 A wide range of functions, from data import and exploration, to preparation, to modeling
and export of results.

5|Pag e
CM5107

 The original data mining workbench was developed in the 1990s & called Clementine.
Two decades, two corporate acquisitions and much product evolution later, Clementine
has become today’s IBM SPSS Modeler.
 Some other data mining workbenches now available include SAS Enterprise Miner,
Knime, Alteryx Designer, University of Ljubjana’sOrange, and University of Waikato’s
Weka.
 Data mining is the process of extracting meaningful patterns and insights from large
datasets using a variety of techniques and tools. The process typically consists of five
steps:
1. Data Collection: Collecting data from multiple sources, such as databases, web
sources, text files, and surveys.
2. Data Preparation: Cleaning and preparing the data for mining by transforming,
combining, and aggregating the data.
3. Data Exploration: Analyzing the data to identify patterns and relationships.
4. Modeling: Constructing models to explain the data and identify useful insights.
5. Interpretation and Evaluation: Interpreting the results and evaluating their accuracy.
 There are many tools available for data mining, including open source tools like R and
Python, as well as commercial tools such as SAS, SPSS, and RapidMiner. Each tool
offers different features and capabilities, so it is important to choose the right tool based
on the specific requirements of the data mining project.

Weka (GUI-based) Data Mining tool


 Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these
functions.
 The original non-Java version of Weka was a Tcl/Tk front-end to (mostly thirdparty)
modelling algorithms implemented in other programming languages, plus data
preprocessing utilities in C and a makefile-based system for running machine learning
experiments.
 This original version was primarily designed as a tool for analyzing data from agricultural
domains. Also used in many different application areas, particularly for educational
purposes and research.
 Weka has the following advantages, such as:
1. Free availability under the GNU General Public License.
2. Portability, since it is fully implemented in the Java programming language and thus
runs on almost any modern computing platform.
3. A comprehensive collection of data preprocessing and modelling techniques.
4. Ease of use due to its graphical user interfaces.
 Weka supports several standard data mining tasks, specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection.
 Input to Weka is expected to be formatted according to the Attribute-Relational File
Format and filename with the .arff extension.
 Weka is an open-source/free tool designed & developed by scientists /researchers at the
University of Waikato, New Zealand.
 WEKA stands of Waikato Environment for knowledge Analysis.

6|Pag e
CM5107

 WEKA is fully developed in Java. It is easy to connect to a database using JDBC driver.
JDBC driver is must to connect to the database.
 It is a collection of machine learning algorithms to implement data mining tasks.
 Their algorithms can either be used directly using the WEKA tool or can be called/
imported from your own Java code.

 It is platform independent
 Using WEKA, users can develop custom Ade for ML.
 WEKA supports with 2D representation of data, 3D visualization with rotation , and 1D
representation of single attribute.

Tabs in WEKA Tool


1. Explorer-
 The WEKA Explorer window shows different tabs follow:
 Preprocess: Choose & modify the loaded data.
 Classify: Apply training & testing algorithm to the data that will clarity the data.
 Cluster: form clusters from the data.
 Associate: Mine out the association rule for the dada.
 Select Attributes: Attribute selection measures are applied here.
 Visualize: 2D representation of data is shown here.
 Log Button: It stones the log of all actions in WEKA with times stamp
 Status Bar: shows status of current task.
2. Experimenter-
 The WEKA Experimenter Window allows the user to create, run & modify different
schemas in one experiment a dataset.
 The Experimenter has 2 types of configuration: Simple and Advanced.
3. Knowledge Flow-
 This window shows a graphical representation of WEKA algorithms.
 The data can be handled batch-wise or incrementally.
 Threads can be created for difference workflows .
4. Workbench:
 WEKA has a work bench module, contains all the GUI in a single window
5. Simple CLI:
 Simple CLI is Weka Shell with command line & output.
 Simple CLI offers access to all classes such as classifiers, clusters, filters, etc.
WEKA contains:
a) 49 data preprocessing tools.

7|Pag e
CM5107

b) 76 classification + Regression algorithm.


c) 8 clustering algorithm.
d) 15 attribute select algorithm.
e) 10 Feature select algorithm
f) 3 association rule creation algorithm

Steps to install WEKA


1. Download the stable version of WEKA from www.cs.Waikato.ac.nz
2. Now open the downloaded file & It will prompt confirmation to make changes to your
system. Click on Yes.
3. Setup Wizard/Screen will appear then Click on “Next”.
4. The License Agreement terms will open after reading dick on “I Agree”
5. According to your requirement, select the components to be installed & click on “Next”.
6. Select the destination folder and click on “Next”.
7. The installation process is complete and click on Finish button to finish installation
process.
8. The user is ready to work with WEKA tool

Requirements and Installation of Weka


 We can install WEKA on Windows, MAC OS, and Linux.
 The minimum requirement is Java 8 /above for the latest stable versions of Weka.
 As shown in the above screenshot, five options are available in the Applications category.
 The Explorer is the central panel where most data mining tasks are performed. We will
further explore this panel in upcoming sections.
 The tool provides an Experimenter In this panel, we can run experiments and also design
them.
 WEKA provides the KnowledgeFlow panel.
 It provides an interface to drag and drop components, connect them to form a knowledge
flow and analyze the data and results.
 The Simple CLI panel provides the command line powers to run WEKA.
 For ex, to fire up the ZeroR classifier on the arff data, we'll run from the command line:
java weka.classifiers.trees.ZeroR -t iris.arff

Load data
 Data loading is the process of copying and loading data or data sets from a source file,
folder and application to a database simpler application.
 It is usually implemented by copying digital data from source and posting or loading the
data to a data storage or processing utility.
 For example: When data is copied from a word processing. file to a database application,
the data format. is changed from .doc or .txt to a .CSV or .DAT. Usually, this process is
performed through or the last phase it of the Extract, Transform and Load (ETL) process.
 WEKA allows you to load data from four types of sources:
1. The local file system
2. A public URL
3. Query to a database

8|Pag e
CM5107

4. Generate artificial data to run models.


 Load data file in formats: ARFF, CSV, C4.5,binary
 Import from URL or SQL database (using JDBC)
 Once data is loaded from different sources, the next step is to preprocess the data. For this
purpose, we can choose any suitable filter technique.
 All the methods come up with default settings that are configurable by clicking on the
name:
 If there are some errors or outliers in one of the attributes, such as sepal length, in that
case, we can remove or update it from the Attributes section.

Types of Algorithms by Weka


 WEKA provides many algorithms for machine learning tasks. Because of their core
nature, all the algorithms are divided into several groups. These are available under the
Explorer tab of the WEKA.
 Let's look at those groups and their core nature:
1. Bayes: consists of algorithms based on Bayes theorem like Naive Bayes
2. functions: comprises the algorithms that estimate a function, including Linear
Regression
3. lazy: covers all algorithms that use lazy learning similar to KStar, LWL
4. meta: consists of those algorithms that use or integrate multiple algorithms for their
work like Stacking, Bagging
5. misc: miscellaneous algorithms that do not fit any of the given categories
6. rules: combines algorithms that use rules such as OneR, ZeroR
7. trees: contains algorithms that use decision trees, such as J48, RandomForest
 Each algorithm has configuration parameters such as batchSize, debug, etc. Some
configuration parameters are common across all the algorithms, while some are specific.
These configurations can be editable once the algorithm is selected to use

File formats
 File formats are designed to store specific types of information, such as CSV. XLSX etc.
 The file format also tells the computer how to display or process its content.
 Common files Formats, such as CSV, XLSX, ZIP, TXT, etc.
 Different type of file formats-
1. CSV - The csv is stand for comma-separated values, as well as this name csv file is
use comma to separated values. In csv file each line is a data record. Each record
consists of one ore more then one data fields, the field is separated by commas.
2. ZIP: zip files are used and data containers, they store one or more then one files in the
compressed form, it widely used in internet. After you downloaded ZIP file, you need
to unpack its content in order to use it.
3. XLSX - The XLSX file is Microsoft Excel open XML format Spreadsheet file. This is
used to store any type of data but it's mainly used to store financial data and to create
mathematical models etc.

9|Pag e
CM5107

Features of Weka Tool:


1. Preprocessing data
2. Classifiers
3. Clustering
4. Association
5. Select Attributes/ Feature Selection
6. Visualize

Preprocessing data
 The preprocessing of data is a crucial task in data mining. Because most of the data is
raw, there are chances that it may contain empty or duplicate values, have garbage values,
outliers, extra columns, or have a different naming convention. All these things degrade
the results.
 To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive
set of options under the filter category. Here, the tool provides both supervised &
unsupervised types of operations.
 Data preprocessing is a step in the data mining and data analysis process that takes raw
data and transforms it into a format that can be understood analyzed by computers and
machine learning.
 Here is the list of some operations for preprocessing:
1. ReplaceMissingWithUserConstant: to fix empty or null value issue.
2. ReservoirSample: to generate a random subset of sample data.
3. NominalToBinary: to convert the data from nominal to binary.
4. RemovePercentage: to remove a given percentage of data.
5. RemoveRange: to remove a given range of data.

Classifiers
 Classifiers is supervised function (where learned attribute is acategorial in order to
classify
 It is used after learning processing to classify new record.
 Click open file button to open dataset.
 Click classify tab. This area for running algorithm against loaded data set in Weka.
 Click Start button to run algorithm.
 The algorithm was run with 10 fold cross-validation this means to it was given
opportunity to make prediction for each instance of dataset & present predicts.
 note- In this way you can use different classifieds like baye’s functions, lazy, meta, misc,
rules, trees etc.
 Classification is one of the essential functions in machine learning, where we assign
classes or categories to items.
 The classic examples of classification are: declaring a brain tumour as "malignant" or
"benign" or assigning an email to a "spam" or "not_spam" class.
 After selecting the desired classifier, we select test options for the training set. Some of
the options are:
 Use training set: the classifier will be tested on the same training set.
 A supplied test set: evaluates the classifier based on a separate test set.

10 | P a g e 186_Shraddha Keshao Bonde_N3


CM5107

 Cross-validation Folds: assessment of the classifier based on cross-validation using the


number of provided folds.
 Percentage split: the classifier will be judged on a specific percentage of data.
 Other than these, we can also use more test options such as Preserve order for % split,
Output source code, etc.

Clustering
 In clustering, a dataset is arranged in different groups/clusters based on some similarities.
Or Clustering allows a user to make groups of data to determine patterns from the data.
 Clustering is an unsupervised Machine learning - based Algorithm that comprises a group
of data points into clusters so that the objects belong to the same group. Clustering helps
to splits data into several subsets.
 In this case, the items within the same cluster are identical but different from other
clusters.
 Examples of clustering include identifying customers with similar behaviors and
organizing the regions according to homogenous land use.
 Clustering has its advantages when the data set is defined and a general pattern needs to
be determined from the data.
 We can create a specific number of groups, depending on business needs.
 Clustering Methods-
1. Model-Based method
2. Hierarchical Method
3. Constraint-Based Method
4. Density -Based Method
5. Partitioning Method
6. Grid-Based Method

Association
 Association is a data mining function/ technique that discovers the probability of the co-
occurrence of items in a collection. The relationships between cooccurring items are
expressed as association rules.
 Association rules are often used to analyze sales transactions. Association is a further
prevalent data mining task.
 The goal of the association task is as follows:
o Finding of frequent item-sets
o Finding of association rules.
 There are few association rules algorithms implemented in WEKA.
 They try to find associations between different attributes instead of trying to predict the
value of the class attribute.
 Association rules highlight all the associations and correlations between items of a
dataset. In short, it is an if-then statement that depicts the probability of relationships
between data items.
 A classic example of association refers to a connection between the sale of milk and
bread. Association is also termed as market basket analysis.

11 | P a g e 186_Shraddha Keshao Bonde_N3


CM5107

 The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association
rules mining in this category.
 Ex- If A then B, for example, Association Rule learning is based on the principle of If and
Else Statements

Feature Selection
 Feature selection refers to the process of reducing the inputs for processing and analysis,
or of finding the most meaningful inputs.
 It intends to select attributes a subset of features that makes the most meaningful
contribution to a machine learning activity.
 For example: Predict weight of students based on the past information about similar
student, which is captured inside a student weight data set.
 The data set has 4 feature like Enroll, Age, height, weight. Roll no has no effect, so will
eliminate this feature, so new data set have 3 only features.
 This subset of data set is expected to give better results than the full set
Age 12, 11, 13, 11, 14, 12
 The above data is reduced data set.
 Every dataset contains a lot of attributes, but several of them may not be significantly
valuable.
 Therefore, removing the unnecessary and keeping the relevant details are very important
for building a good model.
 Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and
Ranker

Visualize
 In the visualize tab, different plot matrices and graphs are available to show the trends
and errors identified by the model.
 Visualization is for the depiction of data and to gain intuition about the data being
observed.
 It assists the analysts in selecting display formats, viewer perspectives, and data
representation schema.
 Data visualization aims to communicate data clearly and effectively through graphical
representation.
 Data visualization has been used extensively in many applications—for example, at work
for reporting, managing business operations, and tracking progress of tasks.
 The basic concept of data visualization has several representative approaches, including
pixel-oriented techniques, geometric projection techniques, icon-based techniques, &
hierarchical & graph-based technique
 WEKA visualizes single dimension (1D) for single attributes & double dimension (2D)
for pairs of attributes. It is to visualize the current relation in 2D plots.
***************************************************************************

12 | P a g e 186_Shraddha Keshao Bonde_N3

You might also like