Egyptian Informatics Journal: O.G. El Barbary, A.S. Salama
Egyptian Informatics Journal: O.G. El Barbary, A.S. Salama
a r t i c l e i n f o a b s t r a c t
Article history: Feature selection is the method of how to select the best subset of the document occurring in data core
Received 11 June 2017 for using it in purposes of data mining or applications. In this paper, we introduced a new technique using
Revised 3 November 2017 topological spaces for developing Information Retrieval System (IRS). First, we introduced the definition
Accepted 5 January 2018
of topological information retrieval systems (TIRS) as a generalization of the information retrieval system.
Available online xxxx
Second, we applied some topological near open sets to these systems for feature selection. Indiscernibility
of keywords in these systems are discussed and their applications are given. We suggested and examined
Keywords:
the order relation that representing the relationships among documents of the document space.
Information retrieval system
Document classification
Ó 2018 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information, Cairo
Topological space University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
Feature selection licenses/by-nc-nd/4.0/).
Near open sets
Rough sets
1. Introduction ments. The occurrences of the word in the document, in the group,
and in the whole gathering is very important for information
The beginning of the global network has greater than before, retrieval process.
through the evolution of information retrieval technique. As a good There are several methods for term selection, in this paper; we
replacement for going to the local library to look for information, will present new method for term selection using topology.
people can search on the Web. Therefore, the virtual number of The concepts of topological space (near open sets) are one of the
manual versus computer-assisted searches for information has recently mainly powerful tools of data analysis. Many researchers
shifted radically in the past few years. In addition, the automated have appeared lately, they are applied the near open sets in their
information retrieval for many document collections helped in fields, for instance information analysis in chemistry and in phy-
reading, understanding, indexing and tracking a large amount of sics. The principle of the present work is to put a starting point
data. For this cause, researchers in fields of document retrieval, using topological structures for the applications in information
computational linguistics, and textual data mining are working retrieval. Rough set theory is a mathematical tool introduced by
hard on development new methods to process these data [1–6]. Pawlak in 1982 [7], it supports the uncertainty reasoning although
This representation suffers from two major challenges the prob- qualitatively. The basic concepts and relations of this theory have
lem of feature selection, and the problem of high dimensionality. In been studied in [8,9].
the bag-of-words model, every word in the document can be Topology is the rich field of mathematics that exist in nearly all
selected as a feature, and the dimension of the feature space is branches of mathematics; in addition, it is used in many real life
equivalent to the number of different words in all of the docu- applications. We consider that the topological near open sets are
the central base for knowledge extraction from incomplete infor-
mation tables and in data processing [10–15].
⇑ Corresponding author. In this paper, we proposed the topological information retrieval
E-mail addresses: ualbarbari@su.edu.sa (O.G. El Barbary), dr_salama75@yahoo. system based on the notion of some topological near open sets. The
com (A.S. Salama). knowledge used in these systems consist of an information retrie-
Peer review under responsibility of Faculty of Computers and Information, Cairo val system. In this system, each document is represented by its
University.
values on a finite set of keywords. We defined the topological bases
on the set of keywords of this system. With the topological infor-
mation retrieval system, we can able to perform approximate
retrieval. We introduced the basic mathematical operations on
Production and hosting by Elsevier
topological systems based on general topology. We suggested
https://doi.org/10.1016/j.eij.2018.01.001
1110-8665/Ó 2018 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information, Cairo University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
2 O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx
and examined the order relation that representing the relation- attribute values. Finally, f s is the information function of the
ships among documents of the document space. The approximate system.
retrieval is carried out by the reduction of the unique query. This In multi information retrieval system (MIRS), each attribute
is finished using topological methods, such as topological near s 2 KW defines a relation Rs # DS C s by ðd; cÞ 2 Rs () c 2 f s ðxÞ.
open sets and their generalizations. By this way, each element of the document space can be descrip-
tion by means of a subset SB 2 KW of keywords, called the SB
description and denoted by SBðdÞ. The SB description SBðdÞ is
2. Feature selection
defined as follows:
Q
Feature selection is the process that leads to the reduction of SBðdÞ ¼ s2SB f s ðdÞ, such that SB 2 2C s .
dimensionality of the original data set. The selection term set The extended information retrieval system to topological infor-
should contain enough or more reliable information about the mation retrieval system (TIRS) is done by familiarizing a general
original data set. To this end, many criteria are used [16–18]. For relation R # C s C s on the range C s .
apply the feature selection there are two ways to select it. The first By important a general relation on the set C s we can make topo-
is forward selection starts with no terms and adds them one by logical constructions on the keyword values. For every
one, at each one adding the one that reductions the mistakes. Rs ðCÞ; C # C s ; s 2 SB, we define the topology ss which has
The second is the backward selection that starts with all the terms fRs ðCÞ; C # C s g as a sub-base. TMIRS ¼ ðDS; SB;fCs : s 2 SBg;
and eliminates them one by one. Hence, eliminate the one that ff s : s 2 SBg; fss : s 2 SBgÞ is a topological information retrieval sys-
reductions the most error, in hopes no further elimination up to tem? Topology systems of attribute values SB describe the
the error. semantic closeness of attribute values and provide a simple and
convenient instrument for telling a finite set of pamphlets by a
finite and non-empty set of keywords.
3. Feature selection methods In TMIRS, we can distinguish among attribute values C # C s
topologically. If an element d # DS of the universe has f s ðdÞ # C,
Many of feature selection methods contain relied greatly on the we say that the keywords of the document d is discerned by the
analysis of the character of a particular data set through statistical
topological rough pair ðintðCÞ; clðCÞÞ with respect to the topology
or information-theoretical procedures. For text learning tasks,
ss .
there are mainly calculation on the vocabulary-specific character-
For C # C s the set intðCÞ is the set of documents, which certainly
istics of given textual data set to spot excellent term features. Even
belong to C. Also, clðCÞ is the set of documents, which possibly
though the statistics itself do not care concerning the meaning of
belong to C. The set C s clðCÞ is the negative region of those doc-
the text, but these methods are useful for text learning tasks [19].
uments that certainly does not belong to C. This interpretation
Many feature selection methods described a statistical feature
extends to elements of the universe as shown below.
selection algorithm call RELIEF that uses instance base learning
If d is a document of the universe DS such as f s ðdÞ ¼ C, then:
to hand over a relevance weight to each feature [20].
Furthermore, that feature selection should depend not only on
The attribute values of the intðCÞ are certainly belong to the
the features and the goal concept, but also on the induction
document values of d. We say that intðCÞ is the certain values
algorithm.
of d.
The attribute values of clðCÞ are possibly belong to the attribute
4. Basic concepts of topology and rough sets values of d. We say that clðCÞ are the possible attribute
values of d.
A family s of subset U is a topological space be topological space The attribute values of C s clðCÞ are certainly do not belong to
if it satisfying the following conditions: the attribute values of d.
Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx 3
Example 6.1. Suppose a set of documents (DS) such that each 8. Experimental
document has a number of keywords (KW). Hence, a given
document may characterize by several keywords. 8.1. Preprocess
Suppose, DS= {d1, d2, d3, d4} and KW = {{KW11, KW12, KW13}, Before extracting features of testing data, some preprocessing in
{KW21, KW22}, {KW31}, {KW41}, {KW51}} such thatd1= {KW13, the text being performed. All the experiments are performed after
KW21, KW31}, d2= {KW12, KW21, KW41}, d3= {KW12, KW22, normalizing the text. In normalization process the text is converted
KW31, KW51} and d4= {KW11, KW22, KW41, KW51}. to UTF-8 encoded and punctuations and non-letters are removed.
As shown above the Rw blocks are named Rs ðd1Þ, Rs ðd2Þ, Rs ðd3Þ They are very common words that appear in the text that carry
and Rs ðd4Þ. To obtain the topological elementary sets which little meaning; they serve only a syntactic function but do not indi-
discernible by Rw , we construct the topological space ðC s ; ss Þ where cate the subject matter. These stop words have two different
ss is the topology generated by the subbase S ¼ fRs ðdiÞ : impacts on information retrieval process. They can affect the
i ¼ 1; 2; . . . ; 4g. retrieval effectiveness because they have a very high frequency
The base of this topology is given by b ¼ fRsðdiÞ\ and tend to diminish the impact of the frequency difference among
RsðdjÞ : i; j ¼ 1; 2; 3; 4g, hence we have b = {d1, d2, d3, d4, {KW21}, less common words. Identifying a stop words list or a stop-list that
{KW13}, u,{KW12}, {KW41}, {KW22, KW51}}. contains such words in order to eliminate them from text process-
Now if we consider a document di with the keyword ing is essential to an information retrieval system. We explore the
C0 ¼ fKW11; KW51g, then the interior approximation intðC0Þ ¼ u use of stop words and their effect on Arabic information retrieval. A
and clðC0Þ ¼ d4 hence, C0 is not definable in our topological space. general stop-list is created, based on the Arabic language structure
But C0 can be approximated using the topological rough pair and characteristics without any additions.
ðu; d4Þ. Hence, we can only state that the attribute values
‘‘KW11”, ‘‘KW21”, ‘‘KW41” and ‘‘KW51” are possibly keywords of 8.2. Experimental evaluation
di according to the interpretation of topological rough pairs.
8.2.1. Simulation results for classification
7. Document classification in topological information retrieval
systems Input data: (keywords, text),
Output: classified documents according to feature selection by
A TIRC has a purpose to group documents of a set into classes or topology.
categories according to some knowledge.
Given a topological single–granular information retrieval sys- We have used about 130 keywords selected by a human from
tem S = (DS, {ss : s 2 KW}, {Cs: s 2 KW}, f s ) and an attribute w of the corpus.
KW, a topological classification over Cs, or class topological clas- In information retrieval content, precision and recall are defined
sification, represents a type of knowledge about the information in expressions of a set of retrieved documents (e.g. the list of doc-
retrieval system S. This specific type of knowledge about the uments created by a web search engine for a query) and a set of
information retrieval system S with respect to an attribute s 2 related documents (e.g. the list of all documents on the internet
KW is defined to be a set of subsets of C s , denoted LI , and that are relevant for a certain topic), cf. relevance.
defined as: Precision is the number of correct outcomes divided by the num-
ber of all returned outcomes. Namely, precision (PðdiÞ) of documents
LI ¼ fC s : s 2 I; 8s – m; intðC s Þ \ clðC m Þ ¼ ug
di is the division of the cardinality of the intersection of retrieved
By forming the subsets C s we are able to express that some doc- documents (RetðdiÞ) with the relevant documents (RelðdiÞ) that are
uments are reclassified or assigned to a category S in the index relevant to the query by the cardinality of retrieved documents:
set I.
We defined the set LS as: PðdiÞ ¼ jRelðdiÞ \ RetðdiÞj=jRetðdiÞj:
LS ¼ fE \ C s : C s 2 LI ; E 2 C a = ss g; where E \ C s ¼ [fC \ C s : C 2 Eg: Precision is second-hand with recall, the percent of all relevant
documents that is returned by the search. The two measures are
The set LS is the set of classes that are assigned to category sometimes used together to present a single measurement for a
n in our topological single–granular information retrieval system called F measure.
system. Recall is the number of correct outcomes divided by the number
Let LS be a topological classification and C 2 LS for a fixed set S. of outcomes that should have been returned. Recall (RðdiÞ) of doc-
We use the usual rough set interpretation; hence, given an attri- uments di is the division of the cardinality of the intersection of
bute value cs 2 C the following interpretation is used: retrieved documents (RetðdiÞ) with the relevant documents
(RelðdiÞ) that are relevant to the query by the cardinality of rele-
If cs 2 intðCÞ then we say that cs is certainly belong to the vant documents:
category S, denoted by the topological classification rule
Ok RðdiÞ ¼ jRelðdiÞ \ RetðdiÞj=jRelðdiÞj:
cs ! n.
If cs 2 clðCÞ then we say that cs is possibly belong to the category The F measure (FðdiÞ) is the division of twice precision and
n, denoted by the topological classification rule cs ! n.
? recall by the sum of precision and recall.
2 PðdiÞ RðdiÞ
Note that the condition intðC s Þ \ clðC m Þ ¼ u, Used in the defini- FðdiÞ ¼ :
PðdiÞ þ RðdiÞ
tion of LS , ensures that an attribute value cannot certainly assign to
category S and at the same time, possibly assigned to another cat-
egory m. Consequently, we cannot certainly assign an attribute 8.3. Experimental data
value to two different categories. Topological classification rules
allow us to represent knowledge without the need to discern all For evaluating the performance of our approach, we take on the
keywords of an information retrieval system. performance measures Precision (P) and Recall (R) in our system.
Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
4 O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx
Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001