0% found this document useful (0 votes)
20 views9 pages

Li 2007

The document proposes an extended process model for knowledge discovery in databases (KDD) that incorporates data collection into the KDD process. It discusses existing KDD process models and their limitations. A case study using a reduct method from rough set theory is presented to illustrate when and how the proposed extended process model can be useful in practice.

Uploaded by

Sandra Parara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Li 2007

The document proposes an extended process model for knowledge discovery in databases (KDD) that incorporates data collection into the KDD process. It discusses existing KDD process models and their limitations. A case study using a reduct method from rough set theory is presented to illustrate when and how the proposed extended process model can be useful in practice.

Uploaded by

Sandra Parara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/1741-0398.htm

Knowledge
An extended process model of discovery
knowledge discovery in database
Tianrui Li and Da Ruan
Reactor Physics and Myrrha Department, 169
Belgian Nuclear Research Centre (SCK-CEN), Mol, Belgium

Abstract
Purpose – Much research on knowledge discovery in database (KDD) merely pays attention to data
mining, one of many interacting steps in the process of discovering previously unknown and
potentially interesting patterns in large databases, but little to the whole process. However, such
approaches cannot satisfy the need of real applications of KDD. The purpose of this work is to extend a
process model of KDD in practice at large.
Design/methodology/approach – A new model based on research experiences of the knowledge
discovery process is formalized as an extension of the model by Fayyad et al. A case study by a reduct
method from rough set theory is to illustrate why the process model is proposed and in what situation
it can be used in practice.
Findings – This model incorporates data collection in the KDD process to supply a sound framework
to better support KDD applications.
Research limitations/implications – This model reflects the native of KDD in some tested cases.
It may need further research to be used in all other situations.
Practical implications – It can be used in the area of information security, medical treatment and
other information management.
Originality/value – Using this model, one can directly collect data that are essential and useful for
the mining results. It also offers practical help to those KDD researchers both from industry and
academia.
Keywords Databases, Process planning
Paper type Research paper

Introduction
Knowledge discovery in database (KDD) is the process of discovering previously
unknown and potentially interesting patterns in large databases. It is currently a fast
growing field both from an application and from a research point of view. The reason is
that companies see a high chance for deriving valuable information from huge amount
of available data that can then be used for improving their business. Many successful
applications have been reported from varied sectors such as marketing, finance,
banking, manufacturing, and telecommunications. Different from other fields, such as
statistics, machine learning and artificial intelligence, knowledge discovery focuses on
the overall process of knowledge discovery from large volumes of data, including the
storage and accessing of such data, scaling of algorithms to massive data sets,
interpretation and visualization of results, and the modeling and support of the overall Journal of Enterprise Information
human machine interaction (Fayyad et al., 1996). Management
Vol. 20 No. 2, 2007
pp. 169-177
q Emerald Group Publishing Limited
This work was partially supported by the National Natural Science Foundation of China (NSFC) 1741-0398
under the grant No. 60474022. DOI 10.1108/17410390710725751
JEIM Data mining is a step in the KDD process consisting of an enumeration of patterns
20,2 over the data, subject to some acceptable computational-efficiency limitations. Since
the patterns enumerable over any finite dataset are potentially infinite and the
enumeration of patterns involves some form of search in a large space, computational
constraints place severe limits on the subspace that can be explored by a data mining
algorithm (Fayyad et al., 1996). Therefore, presently in the literature much attention
170 was paid to the data mining, while little was to the whole process. However, it is critical
to recognize that KDD firstly is a process. If we only emphasize data mining in practice,
we may not see the difficulty of data selection, organization and presentation in the
KDD process. Anecdotal evidence (and our own experience within the KDD Project)
suggests that the other steps account for up to 95 per cent of the effort—it is important
then to understand the whole KDD process.
Presently, although several process models are available, they cannot satisfy the
real need of applications to be illustrated in Section 4. In this paper, we characterize our
experiences of the knowledge discovery process and formalize a new model, an
extension of the model in Fayyad et al. (1996). A case study in Section 4 will illustrate
why the process model is proposed and in what situation it can be used in practice. A
demonstration of the application of process model is given by using a reduct method
from rough set theory (Pawlak, 1982).

Related work
Knowledge discovery in database is aimed at the development of methods, techniques,
and tools that support human analysts in the overall process of discovering using
information and knowledge in databases. Many real-world knowledge discovery tasks
are both too complex to be accessing by simply applying a single learning or data
mining algorithm and too knowledge-intensive to be performed with repeated
participation of the domain expert. Therefore, knowledge discovery is considered an
interactive and iterative between the user and a database that may strongly involve
background knowledge of the analyzing domain expert. This process-centered view of
KDD is accepted by many researchers. Fayyad et al. (1996), for instance, identified the
nine steps in the KDD process shown in Figure 1, which is the most authoritative
description of the data mining process.
Brachman and Anand (1996) gave a practical view of the KDD process emphasizing
the interactive nature of the process. Following their model (see Figure 2), it is clear to
emphasize the process orientation of KDD tasks and argue in favor of a more
human-centered approach for a successful development of knowledge discovery
support tools.
John (1997) addressed another process model (see Figure 3), which emphasized the
cooperation of the data mining analyst and the domain expert in the whole KDD
process. Usually a problem to be solved is well understood by an expert, and the KDD

Figure 1.
Process-centered view
model
Knowledge
discovery

171

Figure 2.
Human-centered process
model

Figure 3.
John’s process model

process begins with the expert attempting to explain the problem to the analyst and the
analyst explaining the KDD process to the expert. Once they have jointly defined the
problem to be solved, the remaining steps are as follows: the analyst must collect
relevant data if it does not already exist, clean the data, engineer the data to be
maximally useful with the problem at hand, engineer a mining algorithm, run the
mining algorithm, and explain the results to the experts.
Cross Industry Standard Process Model for Data Mining (CRISP-DM) (they regard
data mining as KDD) was developed as an open standard by leading KDD appliers and
a tool supplier (Chapman et al. 2000). The current CRISP-DM Process Model for KDD
provides an overview of the life cycle of a KDD project (see Figure 4). It contains the
corresponding phases of a project, their respective tasks, and relationships between
these tasks.
Moreover, there are many other process models successfully applied in real world
problems that do not include in this paper, please refer to Williams and Huang (1996),
Zhu et al. (1998), Witten and Frank (2000) and Gao (2000).

A new KDD process model


The above models all reflect that the native of the KDD process, consisting of a number
of interacting, iterative steps involving various data manipulation and transformation
JEIM
20,2

172

Figure 4.
CRISP-DM’s process
model

operations. Information flows from one step onto the next, as well as backwards to
previous steps. However, they all do not pay attention to the data collection, which is
very important to apply the KDD techniques in some real applications such as
information security and medical treatment, illustrated in the next section. Therefore,
this new KDD process model incorporates the data collection as one part in the KDD
process, which is depicted in Figure 5 and consists of an iterative sequence of the
following steps:
.
Data collection: using techniques to collect data from real applications according
to current discovery task and previous mining results. This step avoids
non-relevant features (attributes) to be collected in data set as soon as possible,
reduces the time required to execute a discovery and help to enhance the
efficiency of knowledge discovery.

Figure 5.
A new KDD process model
.
Data selection: the formulation of a data set that is appropriate for the current Knowledge
discovery task. This step may require joining together multiple data sources in discovery
order to obtain an appropriate set of examples.
.
Preprocessing: eliminating or modifying examples from the selected data set that
are either noisy, inconsistent or contain missing values, e.g. blood
pressure ¼ 240. This step improves the overall quality of the discovered
information. 173
.
Transformation: data are transformed or consolidated into forms appropriate for
the mining step.
.
Data mining: the selection of a data mining type (clustering, classification,
association, sequence analysis, etc.) as well as a specific method to extract data
pattern.
.
Interpretation/evaluation: visualizing and interpreting the discovered knowledge
to the user and evaluation of the discovered information with respect to validity,
novelty, usefulness and simplicity.

This process is iterative in the sense that each step can inspire rectifications to
preceding steps. It is interactive in the sense that a user must be able to limit the
amount of work done by the system to what he is really interested in (Fayyad et al.
1996). Moreover, the key point of this proposed model is that one can use the mining
results to direct how to collect data, which data should be collected and which not. We
believe that will reflect the native of KDD really in some cases even though it may not
be suitable to be used in all situations like the discovery task of sequence databases in
bioinformatics. For example, this model is suitable to the areas that the huge cost of
collecting data, the case that correct or satisfied decision information could not be
obtained by data collected and so forth.

Case study
Along with the development of computer network technology, an intrusion detection
used to capture malicious activities occurring in computer network systems is
becoming more and more important. Intrusion detection techniques fall into two
general categories: misuse detection and anomaly detection, which complement each
other. Misuse detection is based on knowledge of system vulnerabilities and known
attack patterns, while anomaly detection assumes that an intrusion will always reflect
some deviation from normal patterns (Sundaram, 1996). Our case study focuses on
anomaly detection techniques for intrusion detection. Presently, data mining methods
have been used to build automatic anomaly detection systems, performing well in the
latest DARPA (Defense Advanced Research Projects Agency) evaluations, that profile
the normal system activities so that abnormal activities can be detected by comparing
the current activities with the profile (Kemmerer and Vigna, 2002). However, today’s
anomaly detection systems have achieved to detect the known intrusion methods well,
they do not perform well for those new intrusion techniques. From some literatures and
our experience, the main reason is that the audit techniques are ineffective, in other
words, the techniques of collecting data should be improved. Therefore, if we can use
the mining results to direct us how to collect data, which data should be collected and
which not, the ability of intrusion detection system will be improved because we can
JEIM make the intrusion detection system adjust their audit methods as soon as possible,
20,2 collect network activity data and analyze the information to determine whether there is
an attack occurring.
The other reason to address the current model is that many research areas are
strongly related to various kinds of investigations. It is high costly to make different
kinds of investigations such as nuclear safeguards information management (Ruan
174 et al., 2003). As we know that what kinds of information should be collected in the
process of investigations is due to the domain experts. However, it can be improved if
we use the KDD techniques and mining results to assist the domain experts to make
decision, namely how to perform data collection.

Example
An information system (see Table I) from a medical domain is devised to demonstrate
how to apply this model in real applications, which has four conditional attributes
(symptoms) a1, a2, a3, a4, one decision condition (disease) d1 and six objects (cases) x1,
. . . , x6 (Pawlak, 1982). Namely, the information of four condition attributes and one
decision attribute of all objects are collected during the phase of collecting data. From
this information system, clearly, the attribute set {tired, fever, sneeze} is a reduct of
condition attribute set C (Li and Xu, 2000). Therefore, we may not collect the
information of attribute “headache” in the following process of collecting data since it
does not affect our decision making whether it exists or not in this system.
Moreover, according to Li et al. (2004b), two incompatible decision rules can be
obtained as follows from the reducted information system (see Table II):
(1) tired ¼ “Yes,” fever ¼ “Very high,” sneeze ¼ “Yes” ) flu ¼ “Yes”.
(2) tired ¼ “Yes,” fever ¼ “Very high,” sneeze ¼ “Yes” ) flu ¼ “No”.

At this time, the information collected previously is not suitable for our exact decision
making. We need to update the content of data collection, just like collecting the
symptom information of angina of the new coming objects shown in Table III, which is
called as an incomplete information system (Li et al., 2004b). In this table, the above
two incompatible decision rules do not exist because we can make sure that whether flu

A
C D

U headache tired fever sneeze flu


x1 Yes Yes Normal Yes No
x2 No Yes High No Yes
x3 No Yes Very high Yes No
x4 Yes Yes Normal Yes No
Table I. x5 No No High No Yes
Information system x6 No Yes Very high Yes Yes
Knowledge
A
discovery
C D

U tired fever sneeze flu


x1 Yes Normal Yes No
x2 Yes High No Yes 175
x3 Yes Very high Yes No
x4 Yes Normal Yes No
Table II.
x5 No High No Yes Reducted information
x6 Yes Very high Yes Yes system

A
C D

U tired fever sneeze angina flu


x1 Yes Normal Yes No
x2 Yes High No Yes
x3 Yes Very high Yes No
x4 Yes Normal Yes No
x5 No High No Yes
x6 Yes Very high Yes Yes
Table III.
x7 Yes Very high Yes Yes Yes Updated information
x8 Yes Very high Yes No No system

or not by using the condition attribute “angina.” Now the decision rules that can be
obtained by using the method proposed in Li et al. (2004a) are as follows:

(1) tired ¼ “Yes,” fever ¼ “Very high,” sneeze ¼ “Yes,” angina ¼ “Yes” )
flu ¼ “Yes”.
(2) tired ¼ “Yes,” fever ¼ “Very high,” sneeze ¼ “Yes,” angina ¼ “No” )
flu ¼ “No”.

Conclusions
Data collection directly affects the mining results. Generally, the cost of data collection
is huge. If we can use the KDD techniques and mining results to improve data
collection, not only we can save the cost including savings, preprocessing, etc, but also
directly affect the ability of knowledge discovery. Our model incorporates data
collection in the KDD process to supply a framework to support KDD applications
better while the previous models do not emphasize enough on this point or pay no
attention to this step. Yet, building models is only one step towards knowledge
discovery. Our future work will realize this model into some real-world applications.
JEIM References
20,2 Brachman, R.J. and Anand, T. (1996), “The process of knowledge discovery in databases: a
human-centered approach”, Advance in Knowledge Discovery and Data Mining,
AAAI/MIT Press, Menlo Park, CA/Cambridge, MA, pp. 33-58.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000),
“CRISP 1.0 process and user guide”, CRISP-DM Consortium, pp.1-15, available at: www.
176 crisp-dm.org
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996), “The KDD process for extracting useful
knowledge from volumes of data”, Communications of the ACM, Vol. 39 No. 11, pp. 27-34.
Gao, Y.R. (2000), “Data mining and its applications to engineering diagnosis”, PhD thesis, Xi’an
Jiaotong University, Xi’an.
John, G.H. (1997), “Enhancements to the data mining process”, PhD thesis, Stanford University,
Palo Alto, CA.
Kemmerer, R.A. and Vigna, G. (2002), “Intrusion detection: a brief history and overview”,
Computer, Vol. 35 No. 4, pp. 27-30.
Li, T.R. and Xu, Y. (2000), “A generalization rough set approach to attribute generalization in
data mining”, Proceedings of the Fourth International FLINS Conference on Intelligent
Techniques and Soft Computing in Nuclear Science and Engineering, World Scientific,
pp. 126-33.
Li, T.R., Qing, K.Y., Yang, N. and Xu, Y. (2004a), “Study on reduct and core computation in
incompatible information systems”, Lecture Notes in Artificial Intelligence, Vol. 3066,
pp. 471-6.
Li, T.R., Yang, N., Xu, Y. and Ma, J. (2004b), “An incremental algorithm for mining classification
rules in incomplete information systems”, Annual Meeting of the North American Fuzzy
Information Processing Society, IEEE Press, Piscataway, NJ, pp. 446-449.
Pawlak, Z. (1982), “Rough Sets”, International Journal of Computer and Sciences, Vol. 11 No. 5,
pp. 341-56.
Ruan, D., Liu, J. and Carchon, R. (2003), “Linguistic assessment approach for managing nuclear
safeguards indicator information”, Logistics Information Management, Vol. 16 No. 6,
pp. 401-19.
Sundaram, A. (1996), “An introduction to intrusion detection”, Crossroads: The ACM Student
Magazine, Vol. 2 No. 4, pp. 3-7.
Williams, G. and Huang, Z.H. (1996), “Modelling the KDD process”, CSIRO DIT Data Mining
Technical Report, TR-DM-96013, available at: www.act.cmis.csiro.au/edm/papers/
kddmodel.pdf
Witten, I.H. and Frank, E. (2000), “Data mining: practical machine learning tools with Java
implementations”, Morgan Kaufmann, San Francisco, CA.
Zhu, T.S., Gao, W., Ling, C.X., Gao, Z.Q. and Li, J.T. (1998), “Research on KDD process model”,
Proceedings of the Sixth China Workshop on Machine Learning, Beijing, available at:
www.cs.ualberta.ca/ , tszhu/paper/CWML98.doc

Further reading
Li, T.R. and Ruan, D. (2004), “A revised process model of knowledge discovery in database”,
Proceedings of the sixth International FLINS Conference on Applied Computational
Intelligence, World Scientific, pp. 185-8.
About the authors Knowledge
Tianrui Li is an Associate Professor at the Department of Mathematics, Southwest Jiaotong
University, Chengdu, P. R. China, and a Postdoctoral Researcher (2005-2006) at the Belgian discovery
Nuclear Research Centre (SCK-CEN), Mol, Belgium. He received his PhD degree in Southwest
Jiaotong University in 2002. His major research interests include mathematical modeling, data
mining, granular computing and bioinformatics. He has published over 20 research papers and
co-edited two books. Tianrui Li is the corresponding author and can be contacted at:
tli@sckcen.be 177
Da Ruan is a scientific staff member at the Belgian Nuclear Research Centre (SCK-CEN), and
Guest Professor in the Dept. of Applied Math. and CS at Ghent University, Belgium. He gained a
PhD degree in Mathematics from Ghent University, Belgium in 1990. He was a Post-Doctoral
Researcher at SCK-CEN from 1991-93 and since 1994 has been a senior researcher and FLINS
Project Leader at SCK-CEN. He is the principal investigator for the research project on intelligent
control for nuclear reactors at SCK-CEN. He was a guest research scientist at the OECD Halden
Reactor Project (HRP), Norway from April 2001 to September 2002 as a principal investigator for
the research project on computational intelligent systems for feed-water flow measurements at
HRP. His major research interests lie in the areas of mathematical modeling, computational
intelligence methods, uncertainty analysis and information/sensor fusion, decision support
systems and soft computing applications to information management, cost/benefit analysis
under uncertainty, nuclear reactors, and safety related engineering fields. He has authored and/or
co-authored over 60 peer-reviewed journal articles, two text books, and about 100 book chapters
and conference papers on his research topics.

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints

You might also like