CM5107
Unit 5. Open Source Data mining Tool
5.1 Data mining tool introduction
5.2 Installation
5.3 Load data
5.4 File formats
5.5 Preprocessing data
5.6 Classifiers
5.7 Clustering
5.8 Association
5.9 Feature Selection
***************************************************************************
Why do we need Data Mining Tools?
   Data mining tools are needed because they enable us to efficiently and effectively analyze
    large and complex data sets.
   Here are some reasons why we need data mining tools:
    o Extracting useful information: Extract the information from large data sets that may
        not be easily visible or apparent through manual analysis.
    o Identifying patterns and trends: Identify patterns and trends in the data that may be
        useful for making informed decisions.
    o Improving decision-making: Make better decisions based on insights gained from the
        data.
    o Increasing efficiency: Automate the process of analyzing data, making it faster and
        more efficient than manual analysis.
    o Reducing costs: By identifying inefficiencies and opportunities for improvement, data
        mining tools can help reduce costs and increase profitability.
Rapid Miner DM Tool
 It is developed by a company with the same name Rapid Miner.
 It is written in Java. So it is Platform Independence.
 Rapid Miner offers the server as both on-premise & in public/private cloud
   infrastructures. It has a client/server model as its base.
 It is really fast in reading all kinds of database.
 Extensions can be installed from RapidMiner Marketplace.
 A visual-code-free- environment, so no programming needed
 Design of analysis processes
 Data loading , Data transformation and Data Modelling
 Data visualization (with lots of visualizations)
 Allows you to work with different types and sizes of data sources
 It is one of the best predictive analysis tool (with pre-made templates)
 It provides an integrated environment for deep learning, text mining, machine learning
   and predictive analysis
 It is Open source Data Mining tool. Link: https://rapidminer.com/
1|Pag e
CM5107
   The tool can be used for over a vast range of applications including for business
    applications, commercial applications, training, education, research, application
    development, machine learning.
   Rapid Miner comes with template based frameworks that enable speedy delivery with
    reduced number of errors (which are quite commonly expected in manual code writing
    process).
   It acts as a powerful scripting language engine along with a graphical user
   Modular operator concept.
   Multi-layered data view.
   Rapid Miner constitutes of three modules, namely
    1. Rapid Miner Studio: This module is for workflow design, prototyping, validation etc.
    2. Rapid Miner Server: To operate predictive data models created in studio
    3. Rapid Miner Hadoop: Executes processes directly in the Hadoop cluster to simplify
    predictive analysis.
Orange Data Mining Tools
   It is open source softcore written in python.
   It was developed in Ljubljana University, Slovenia
   It also supports data visualization in 3D, 2D and 1D format.
   It is a component-based software, with a Large collection of pre-built ML algorithms.
   For non- programmers data mining tasks can be performed using visual programming drag-and-
    drop interface.
   It provides num numerous graphics like silhouette plots and sieve diagrams.
2|Pag e
CM5107
SAS Data Mining Tool
   SAS stands for Statistical Analysis System. Availability: Proprietary License
   It is written in c language.
   It is a product of the SAS Institute created for analytics & data management
   SAS can mine data, change it, manage information from various sources.
   For non-technical users, it offers interactive UI.
   SAS data miner enables users to analyze big data and derives accurate insight to make
    timely decisions.
   SAS has a distributed memory processing architecture which is highly scalable. It is well
    suited for data mining, text mining & optimization.
   SAS can mine data, alter it, manage data from different sources and perform statistical
    analysis
KNIME Data Mining Tool
   It is written in Java.
   It is open Source.
   It operates on the concept of the modular data pipeline.
   It contains various ML and data mining components embedded together
   It is widely used in pharmaceutical research.
   It performs excellently for Business Intelligence.
   KNIME is the best integration platform for data analytics and reporting developed by
    KNIME.com AG.
   It operates on the concept of the modular data pipeline. KNIME constitutes of various
    machine learning and data mining components embedded together.
3|Pag e
CM5107
   KNIME has been used widely for pharmaceutical research. In addition, it performs
    excellently for customer data analysis, financial data analysis, and business intelligence.
   KNIME has some brilliant features like quick deployment and scaling efficiency.
   Users get familiar with KNIME in quite lesser time and it has made predictive analysis
    accessible to even naive users.
   KNIME utilizes the assembly of nodes to pre-process the data for analytics and
    visualization.
H2O Data Mining Tool
   It is written in Java.
   It is open Source.
   It offers Auto ML functions to help users build and deploy ML models in fast and simple
    way.
   Distributed memory computing is supported
   Huge datasets can be handled properly.
   It can be integrated though APIs available in all major programming language.
Rattle Data Mining Tool
   Rattle is a data mining tool based on GUI.
   It uses the R stats programming language.
   Rattle exposes the statical power of R by offering significant data mining features.
   While rattle has a comprehensive and well-developed user interface, It has an integrated
    log code tab that produces duplicate code for any GUI operation.
   The data set produced by Rattle can be viewed and edited.
   Rattle gives the other facility to review the code, use it for many purposes, and extend the
    code without any restriction.
4|Pag e
CM5107
DataMelt Data Mining Tool
   DataMelt is a computation and visualization environment which offers an interactive
    structure for data analysis and visualization.
   It is primarily designed for students, engineers, and scientists.
   It is also known as DMelt. DMelt is a multi-platform utility written in JAVA. It can run on
    any operating system which is compatible with JVM (JavaVirtual Machine).
   It consists of Science and mathematics libraries.
   Scientific libraries are used for drawing the 2D/3D plots.
   Mathematical libraries are used for random number generation, algorithms, curve fitting,
    etc.
   DMelt can be used for the analysis of the large volume of data, data mining, and
    statistical analysis. It is extensively used in natural sciences, financial markets, and
    engineering.
Data mining tool introduction
   Data mining techniques are derived that utilize the domain knowledge from statistical
    analysis, artificial intelligence, and database systems in order to analyze data in a proper
    manner in view of different dimensions and perspectives.
   Data mining tools discover patterns or trends from the collection of large sets of data and
    transforming data into useful information for making decisions.
   Several open-source and proprietary-based data mining and data visualization tools exist
    which are used for information extraction from large data repositories and for data
    analysis.
   Some of the data mining tools which exist in the market are Weka, RapidMiner, Orange,
    R, KNIME, ELKI, GNU Octave, Apache Mahout, SCaViS, Natural Language Toolkit,
    Tableau, SaS, Rattle Data Mining, DataMelt Data Mining, Oracle BI etc
   Weka has a GUI that facilitates easy access to all its features.
   Data Mining tools have the objective of discovering patterns/trends/groupings among
    large sets of data and transforming data into more refined information.
   The key data mining tool is the data mining workbench, which has several characteristics:
   A visual programming interface - which enables users to define workflow by stringing
    together icons representing the steps of the process
   Design aimed at business users, rather than programmers or statisticians
   A wide range of functions, from data import and exploration, to preparation, to modeling
    and export of results.
5|Pag e
CM5107
   The original data mining workbench was developed in the 1990s & called Clementine.
    Two decades, two corporate acquisitions and much product evolution later, Clementine
    has become today’s IBM SPSS Modeler.
   Some other data mining workbenches now available include SAS Enterprise Miner,
    Knime, Alteryx Designer, University of Ljubjana’sOrange, and University of Waikato’s
    Weka.
   Data mining is the process of extracting meaningful patterns and insights from large
    datasets using a variety of techniques and tools. The process typically consists of five
    steps:
    1. Data Collection: Collecting data from multiple sources, such as databases, web
        sources, text files, and surveys.
    2. Data Preparation: Cleaning and preparing the data for mining by transforming,
        combining, and aggregating the data.
    3. Data Exploration: Analyzing the data to identify patterns and relationships.
    4. Modeling: Constructing models to explain the data and identify useful insights.
    5. Interpretation and Evaluation: Interpreting the results and evaluating their accuracy.
   There are many tools available for data mining, including open source tools like R and
    Python, as well as commercial tools such as SAS, SPSS, and RapidMiner. Each tool
    offers different features and capabilities, so it is important to choose the right tool based
    on the specific requirements of the data mining project.
Weka (GUI-based) Data Mining tool
   Weka contains a collection of visualization tools and algorithms for data analysis and
    predictive modelling, together with graphical user interfaces for easy access to these
    functions.
   The original non-Java version of Weka was a Tcl/Tk front-end to (mostly thirdparty)
    modelling algorithms implemented in other programming languages, plus data
    preprocessing utilities in C and a makefile-based system for running machine learning
    experiments.
   This original version was primarily designed as a tool for analyzing data from agricultural
    domains. Also used in many different application areas, particularly for educational
    purposes and research.
   Weka has the following advantages, such as:
    1. Free availability under the GNU General Public License.
    2. Portability, since it is fully implemented in the Java programming language and thus
        runs on almost any modern computing platform.
    3. A comprehensive collection of data preprocessing and modelling techniques.
    4. Ease of use due to its graphical user interfaces.
   Weka supports several standard data mining tasks, specifically, data preprocessing,
    clustering, classification, regression, visualization, and feature selection.
   Input to Weka is expected to be formatted according to the Attribute-Relational File
    Format and filename with the .arff extension.
   Weka is an open-source/free tool designed & developed by scientists /researchers at the
    University of Waikato, New Zealand.
   WEKA stands of Waikato Environment for knowledge Analysis.
6|Pag e
CM5107
   WEKA is fully developed in Java. It is easy to connect to a database using JDBC driver.
    JDBC driver is must to connect to the database.
   It is a collection of machine learning algorithms to implement data mining tasks.
   Their algorithms can either be used directly using the WEKA tool or can be called/
    imported from your own Java code.
   It is platform independent
   Using WEKA, users can develop custom Ade for ML.
   WEKA supports with 2D representation of data, 3D visualization with rotation , and 1D
    representation of single attribute.
Tabs in WEKA Tool
1. Explorer-
    The WEKA Explorer window shows different tabs follow:
    Preprocess: Choose & modify the loaded data.
    Classify: Apply training & testing algorithm to the data that will clarity the data.
    Cluster: form clusters from the data.
    Associate: Mine out the association rule for the dada.
    Select Attributes: Attribute selection measures are applied here.
    Visualize: 2D representation of data is shown here.
    Log Button: It stones the log of all actions in WEKA with times stamp
    Status Bar: shows status of current task.
2. Experimenter-
    The WEKA Experimenter Window allows the user to create, run & modify different
      schemas in one experiment a dataset.
    The Experimenter has 2 types of configuration: Simple and Advanced.
3. Knowledge Flow-
    This window shows a graphical representation of WEKA algorithms.
    The data can be handled batch-wise or incrementally.
    Threads can be created for difference workflows .
4. Workbench:
    WEKA has a work bench module, contains all the GUI in a single window
5. Simple CLI:
    Simple CLI is Weka Shell with command line & output.
    Simple CLI offers access to all classes such as classifiers, clusters, filters, etc.
WEKA contains:
a) 49 data preprocessing tools.
7|Pag e
CM5107
b)   76 classification + Regression algorithm.
c)   8 clustering algorithm.
d)   15 attribute select algorithm.
e)   10 Feature select algorithm
f)   3 association rule creation algorithm
Steps to install WEKA
1. Download the stable version of WEKA from www.cs.Waikato.ac.nz
2. Now open the downloaded file & It will prompt confirmation to make changes to your
   system. Click on Yes.
3. Setup Wizard/Screen will appear then Click on “Next”.
4. The License Agreement terms will open after reading dick on “I Agree”
5. According to your requirement, select the components to be installed & click on “Next”.
6. Select the destination folder and click on “Next”.
7. The installation process is complete and click on Finish button to finish installation
   process.
8. The user is ready to work with WEKA tool
Requirements and Installation of Weka
    We can install WEKA on Windows, MAC OS, and Linux.
    The minimum requirement is Java 8 /above for the latest stable versions of Weka.
    As shown in the above screenshot, five options are available in the Applications category.
    The Explorer is the central panel where most data mining tasks are performed. We will
     further explore this panel in upcoming sections.
    The tool provides an Experimenter In this panel, we can run experiments and also design
     them.
    WEKA provides the KnowledgeFlow panel.
    It provides an interface to drag and drop components, connect them to form a knowledge
     flow and analyze the data and results.
    The Simple CLI panel provides the command line powers to run WEKA.
    For ex, to fire up the ZeroR classifier on the arff data, we'll run from the command line:
     java weka.classifiers.trees.ZeroR -t iris.arff
Load data
    Data loading is the process of copying and loading data or data sets from a source file,
     folder and application to a database simpler application.
    It is usually implemented by copying digital data from source and posting or loading the
     data to a data storage or processing utility.
    For example: When data is copied from a word processing. file to a database application,
     the data format. is changed from .doc or .txt to a .CSV or .DAT. Usually, this process is
     performed through or the last phase it of the Extract, Transform and Load (ETL) process.
    WEKA allows you to load data from four types of sources:
     1. The local file system
     2. A public URL
     3. Query to a database
8|Pag e
CM5107
    4. Generate artificial data to run models.
   Load data file in formats: ARFF, CSV, C4.5,binary
   Import from URL or SQL database (using JDBC)
   Once data is loaded from different sources, the next step is to preprocess the data. For this
    purpose, we can choose any suitable filter technique.
   All the methods come up with default settings that are configurable by clicking on the
    name:
   If there are some errors or outliers in one of the attributes, such as sepal length, in that
    case, we can remove or update it from the Attributes section.
Types of Algorithms by Weka
   WEKA provides many algorithms for machine learning tasks. Because of their core
    nature, all the algorithms are divided into several groups. These are available under the
    Explorer tab of the WEKA.
   Let's look at those groups and their core nature:
    1. Bayes: consists of algorithms based on Bayes theorem like Naive Bayes
    2. functions: comprises the algorithms that estimate a function, including Linear
        Regression
    3. lazy: covers all algorithms that use lazy learning similar to KStar, LWL
    4. meta: consists of those algorithms that use or integrate multiple algorithms for their
        work like Stacking, Bagging
    5. misc: miscellaneous algorithms that do not fit any of the given categories
    6. rules: combines algorithms that use rules such as OneR, ZeroR
    7. trees: contains algorithms that use decision trees, such as J48, RandomForest
   Each algorithm has configuration parameters such as batchSize, debug, etc. Some
    configuration parameters are common across all the algorithms, while some are specific.
    These configurations can be editable once the algorithm is selected to use
File formats
   File formats are designed to store specific types of information, such as CSV. XLSX etc.
   The file format also tells the computer how to display or process its content.
   Common files Formats, such as CSV, XLSX, ZIP, TXT, etc.
   Different type of file formats-
    1. CSV - The csv is stand for comma-separated values, as well as this name csv file is
        use comma to separated values. In csv file each line is a data record. Each record
        consists of one ore more then one data fields, the field is separated by commas.
    2. ZIP: zip files are used and data containers, they store one or more then one files in the
        compressed form, it widely used in internet. After you downloaded ZIP file, you need
        to unpack its content in order to use it.
    3. XLSX - The XLSX file is Microsoft Excel open XML format Spreadsheet file. This is
        used to store any type of data but it's mainly used to store financial data and to create
        mathematical models etc.
9|Pag e
CM5107
Features of Weka Tool:
1.   Preprocessing data
2.   Classifiers
3.   Clustering
4.   Association
5.   Select Attributes/ Feature Selection
6.   Visualize
Preprocessing data
    The preprocessing of data is a crucial task in data mining. Because most of the data is
     raw, there are chances that it may contain empty or duplicate values, have garbage values,
     outliers, extra columns, or have a different naming convention. All these things degrade
     the results.
    To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive
     set of options under the filter category. Here, the tool provides both supervised &
     unsupervised types of operations.
    Data preprocessing is a step in the data mining and data analysis process that takes raw
     data and transforms it into a format that can be understood analyzed by computers and
     machine learning.
    Here is the list of some operations for preprocessing:
     1. ReplaceMissingWithUserConstant: to fix empty or null value issue.
     2. ReservoirSample: to generate a random subset of sample data.
     3. NominalToBinary: to convert the data from nominal to binary.
     4. RemovePercentage: to remove a given percentage of data.
     5. RemoveRange: to remove a given range of data.
Classifiers
    Classifiers is supervised function (where learned attribute is acategorial in order to
     classify
    It is used after learning processing to classify new record.
    Click open file button to open dataset.
    Click classify tab. This area for running algorithm against loaded data set in Weka.
    Click Start button to run algorithm.
    The algorithm was run with 10 fold cross-validation this means to it was given
     opportunity to make prediction for each instance of dataset & present predicts.
    note- In this way you can use different classifieds like baye’s functions, lazy, meta, misc,
     rules, trees etc.
    Classification is one of the essential functions in machine learning, where we assign
     classes or categories to items.
    The classic examples of classification are: declaring a brain tumour as "malignant" or
     "benign" or assigning an email to a "spam" or "not_spam" class.
    After selecting the desired classifier, we select test options for the training set. Some of
     the options are:
    Use training set: the classifier will be tested on the same training set.
    A supplied test set: evaluates the classifier based on a separate test set.
10 | P a g e                                 186_Shraddha Keshao Bonde_N3
CM5107
   Cross-validation Folds: assessment of the classifier based on cross-validation using the
    number of provided folds.
   Percentage split: the classifier will be judged on a specific percentage of data.
   Other than these, we can also use more test options such as Preserve order for % split,
    Output source code, etc.
Clustering
   In clustering, a dataset is arranged in different groups/clusters based on some similarities.
    Or Clustering allows a user to make groups of data to determine patterns from the data.
   Clustering is an unsupervised Machine learning - based Algorithm that comprises a group
    of data points into clusters so that the objects belong to the same group. Clustering helps
    to splits data into several subsets.
   In this case, the items within the same cluster are identical but different from other
    clusters.
   Examples of clustering include identifying customers with similar behaviors and
    organizing the regions according to homogenous land use.
   Clustering has its advantages when the data set is defined and a general pattern needs to
    be determined from the data.
   We can create a specific number of groups, depending on business needs.
   Clustering Methods-
    1. Model-Based method
    2. Hierarchical Method
    3. Constraint-Based Method
    4. Density -Based Method
    5. Partitioning Method
    6. Grid-Based Method
Association
   Association is a data mining function/ technique that discovers the probability of the co-
    occurrence of items in a collection. The relationships between cooccurring items are
    expressed as association rules.
   Association rules are often used to analyze sales transactions. Association is a further
    prevalent data mining task.
   The goal of the association task is as follows:
    o Finding of frequent item-sets
    o Finding of association rules.
   There are few association rules algorithms implemented in WEKA.
   They try to find associations between different attributes instead of trying to predict the
    value of the class attribute.
   Association rules highlight all the associations and correlations between items of a
    dataset. In short, it is an if-then statement that depicts the probability of relationships
    between data items.
   A classic example of association refers to a connection between the sale of milk and
    bread. Association is also termed as market basket analysis.
11 | P a g e                                186_Shraddha Keshao Bonde_N3
CM5107
   The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association
    rules mining in this category.
   Ex- If A then B, for example, Association Rule learning is based on the principle of If and
    Else Statements
Feature Selection
   Feature selection refers to the process of reducing the inputs for processing and analysis,
    or of finding the most meaningful inputs.
   It intends to select attributes a subset of features that makes the most meaningful
    contribution to a machine learning activity.
   For example: Predict weight of students based on the past information about similar
    student, which is captured inside a student weight data set.
   The data set has 4 feature like Enroll, Age, height, weight. Roll no has no effect, so will
    eliminate this feature, so new data set have 3 only features.
   This subset of data set is expected to give better results than the full set
    Age 12, 11, 13, 11, 14, 12
   The above data is reduced data set.
   Every dataset contains a lot of attributes, but several of them may not be significantly
    valuable.
   Therefore, removing the unnecessary and keeping the relevant details are very important
    for building a good model.
   Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and
    Ranker
Visualize
  In the visualize tab, different plot matrices and graphs are available to show the trends
   and errors identified by the model.
 Visualization is for the depiction of data and to gain intuition about the data being
   observed.
 It assists the analysts in selecting display formats, viewer perspectives, and data
   representation schema.
 Data visualization aims to communicate data clearly and effectively through graphical
   representation.
 Data visualization has been used extensively in many applications—for example, at work
   for reporting, managing business operations, and tracking progress of tasks.
 The basic concept of data visualization has several representative approaches, including
   pixel-oriented techniques, geometric projection techniques, icon-based techniques, &
   hierarchical & graph-based technique
 WEKA visualizes single dimension (1D) for single attributes & double dimension (2D)
   for pairs of attributes. It is to visualize the current relation in 2D plots.
***************************************************************************
12 | P a g e                               186_Shraddha Keshao Bonde_N3