0% found this document useful (0 votes)
229 views4 pages

Egyptian Informatics Journal: O.G. El Barbary, A.S. Salama

This document proposes a new technique for feature selection in document classification using topological spaces. It introduces the concept of a topological information retrieval system (TIRS) which generalizes a traditional information retrieval system. In a TIRS, each document is represented by values on a set of keywords. Topological bases are defined on the keyword sets, allowing approximate retrieval using topological near open sets and generalizations. The relationships between documents are represented using an order relation based on these topological structures. The goal is to reduce a unique query to relevant documents using topological methods for approximate retrieval.

Uploaded by

arciblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views4 pages

Egyptian Informatics Journal: O.G. El Barbary, A.S. Salama

This document proposes a new technique for feature selection in document classification using topological spaces. It introduces the concept of a topological information retrieval system (TIRS) which generalizes a traditional information retrieval system. In a TIRS, each document is represented by values on a set of keywords. Topological bases are defined on the keyword sets, allowing approximate retrieval using topological near open sets and generalizations. The relationships between documents are represented using an order relation based on these topological structures. The goal is to reduce a unique query to relevant documents using topological methods for approximate retrieval.

Uploaded by

arciblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Egyptian Informatics Journal xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Egyptian Informatics Journal


journal homepage: www.sciencedirect.com

Full length article

Feature selection for document classification based on topology


O.G. El Barbary ⇑, A.S. Salama
Mathematics Department, Faculty of Science, Tanta University, Egypt

a r t i c l e i n f o a b s t r a c t

Article history: Feature selection is the method of how to select the best subset of the document occurring in data core
Received 11 June 2017 for using it in purposes of data mining or applications. In this paper, we introduced a new technique using
Revised 3 November 2017 topological spaces for developing Information Retrieval System (IRS). First, we introduced the definition
Accepted 5 January 2018
of topological information retrieval systems (TIRS) as a generalization of the information retrieval system.
Available online xxxx
Second, we applied some topological near open sets to these systems for feature selection. Indiscernibility
of keywords in these systems are discussed and their applications are given. We suggested and examined
Keywords:
the order relation that representing the relationships among documents of the document space.
Information retrieval system
Document classification
Ó 2018 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information, Cairo
Topological space University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
Feature selection licenses/by-nc-nd/4.0/).
Near open sets
Rough sets

1. Introduction ments. The occurrences of the word in the document, in the group,
and in the whole gathering is very important for information
The beginning of the global network has greater than before, retrieval process.
through the evolution of information retrieval technique. As a good There are several methods for term selection, in this paper; we
replacement for going to the local library to look for information, will present new method for term selection using topology.
people can search on the Web. Therefore, the virtual number of The concepts of topological space (near open sets) are one of the
manual versus computer-assisted searches for information has recently mainly powerful tools of data analysis. Many researchers
shifted radically in the past few years. In addition, the automated have appeared lately, they are applied the near open sets in their
information retrieval for many document collections helped in fields, for instance information analysis in chemistry and in phy-
reading, understanding, indexing and tracking a large amount of sics. The principle of the present work is to put a starting point
data. For this cause, researchers in fields of document retrieval, using topological structures for the applications in information
computational linguistics, and textual data mining are working retrieval. Rough set theory is a mathematical tool introduced by
hard on development new methods to process these data [1–6]. Pawlak in 1982 [7], it supports the uncertainty reasoning although
This representation suffers from two major challenges the prob- qualitatively. The basic concepts and relations of this theory have
lem of feature selection, and the problem of high dimensionality. In been studied in [8,9].
the bag-of-words model, every word in the document can be Topology is the rich field of mathematics that exist in nearly all
selected as a feature, and the dimension of the feature space is branches of mathematics; in addition, it is used in many real life
equivalent to the number of different words in all of the docu- applications. We consider that the topological near open sets are
the central base for knowledge extraction from incomplete infor-
mation tables and in data processing [10–15].
⇑ Corresponding author. In this paper, we proposed the topological information retrieval
E-mail addresses: ualbarbari@su.edu.sa (O.G. El Barbary), dr_salama75@yahoo. system based on the notion of some topological near open sets. The
com (A.S. Salama). knowledge used in these systems consist of an information retrie-
Peer review under responsibility of Faculty of Computers and Information, Cairo val system. In this system, each document is represented by its
University.
values on a finite set of keywords. We defined the topological bases
on the set of keywords of this system. With the topological infor-
mation retrieval system, we can able to perform approximate
retrieval. We introduced the basic mathematical operations on
Production and hosting by Elsevier
topological systems based on general topology. We suggested

https://doi.org/10.1016/j.eij.2018.01.001
1110-8665/Ó 2018 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information, Cairo University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
2 O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx

and examined the order relation that representing the relation- attribute values. Finally, f s is the information function of the
ships among documents of the document space. The approximate system.
retrieval is carried out by the reduction of the unique query. This In multi information retrieval system (MIRS), each attribute
is finished using topological methods, such as topological near s 2 KW defines a relation Rs # DS  C s by ðd; cÞ 2 Rs () c 2 f s ðxÞ.
open sets and their generalizations. By this way, each element of the document space can be descrip-
tion by means of a subset SB 2 KW of keywords, called the SB
description and denoted by SBðdÞ. The SB description SBðdÞ is
2. Feature selection
defined as follows:
Q
Feature selection is the process that leads to the reduction of SBðdÞ ¼ s2SB f s ðdÞ, such that SB 2 2C s .
dimensionality of the original data set. The selection term set The extended information retrieval system to topological infor-
should contain enough or more reliable information about the mation retrieval system (TIRS) is done by familiarizing a general
original data set. To this end, many criteria are used [16–18]. For relation R # C s  C s on the range C s .
apply the feature selection there are two ways to select it. The first By important a general relation on the set C s we can make topo-
is forward selection starts with no terms and adds them one by logical constructions on the keyword values. For every
one, at each one adding the one that reductions the mistakes. Rs ðCÞ; C # C s ; s 2 SB, we define the topology ss which has
The second is the backward selection that starts with all the terms fRs ðCÞ; C # C s g as a sub-base. TMIRS ¼ ðDS; SB;fCs : s 2 SBg;
and eliminates them one by one. Hence, eliminate the one that ff s : s 2 SBg; fss : s 2 SBgÞ is a topological information retrieval sys-
reductions the most error, in hopes no further elimination up to tem? Topology systems of attribute values SB describe the
the error. semantic closeness of attribute values and provide a simple and
convenient instrument for telling a finite set of pamphlets by a
finite and non-empty set of keywords.
3. Feature selection methods In TMIRS, we can distinguish among attribute values C # C s
topologically. If an element d # DS of the universe has f s ðdÞ # C,
Many of feature selection methods contain relied greatly on the we say that the keywords of the document d is discerned by the
analysis of the character of a particular data set through statistical
topological rough pair ðintðCÞ; clðCÞÞ with respect to the topology
or information-theoretical procedures. For text learning tasks,
ss .
there are mainly calculation on the vocabulary-specific character-
For C # C s the set intðCÞ is the set of documents, which certainly
istics of given textual data set to spot excellent term features. Even
belong to C. Also, clðCÞ is the set of documents, which possibly
though the statistics itself do not care concerning the meaning of
belong to C. The set C s  clðCÞ is the negative region of those doc-
the text, but these methods are useful for text learning tasks [19].
uments that certainly does not belong to C. This interpretation
Many feature selection methods described a statistical feature
extends to elements of the universe as shown below.
selection algorithm call RELIEF that uses instance base learning
If d is a document of the universe DS such as f s ðdÞ ¼ C, then:
to hand over a relevance weight to each feature [20].
Furthermore, that feature selection should depend not only on
 The attribute values of the intðCÞ are certainly belong to the
the features and the goal concept, but also on the induction
document values of d. We say that intðCÞ is the certain values
algorithm.
of d.
 The attribute values of clðCÞ are possibly belong to the attribute
4. Basic concepts of topology and rough sets values of d. We say that clðCÞ are the possible attribute
values of d.
A family s of subset U is a topological space be topological space  The attribute values of C s  clðCÞ are certainly do not belong to
if it satisfying the following conditions: the attribute values of d.

1. u; U 2 s. Let us classify topological multi-valued information retrieval


2. s is closed under uninformed union. systems into single– granular topological information retrieval sys-
3. s is closed under limited intersection. tems and multi-granular topological information retrieval systems.
In single–granular topological systems, we restrict elements of the
The subsets of U belong to s are called the open sets. universe to have their documents in one and only one class Rs ðCÞ
It often happens that the open sets of a topological space can be i.e., f s ðdÞ ¼ Rs ðCÞ; C # C s is a singleton for every document and for
complicated and yet they can have described using a selection of each keyword s .We do not have such a restriction in multi–gran-
simple special ones. In addition, it is chance that many topological ular topological information retrieval systems.
concepts can be characterized in terms of these simpler bases or
subbase elements. Officially, b # s is a base for ðU; sÞ if the non-
empty open subbase of U represent a union of a subfamily of b. A
family d # s is a subbase if all finite intersections construct a base. 6. Indiscernibility of keywords in topological information
clðXÞ ¼ \fY # U : X # Y; U  Y 2 sg and intðXÞ ¼ [fY # U : Y # X; retrieval systems
Y 2 sg are closure and interior of X # U, respectively.
The approximation space is a pair A ¼ ðU; RÞ, where R is called Approximately, of the keywords may not be apparent from each
equivalence relation or indiscernibility relation. Furthermore, other in the wisdom that they may have identical interior and clo-
½xR ; x 2 U is the equivalence class containing the element x. sure approximations in the topological space ðC s ; ss Þ. More for-
mally, two attributes values C; C0 # C s are indiscernible from each
5. Topological information retrieval systems other, denoted Cts C0 if and only if intðCÞ ¼ intðC0Þ and
clðCÞ ¼ clðC0Þ.
We define the information retrieval system as follows: It is easy to see that ts is an equivalence relation on the set 2C s .
IRS ¼ ðDS; KW; fC s : s 2 KWg; ff s : s 2 KWgÞ, where DS is the uni- The equivalence class of a subset C # C s in ts is denoted ½Cts and
verse of documents. KW is the set of attribute where C s is the set of is an element of the quotient set 2C s =ts

Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx 3

Example 6.1. Suppose a set of documents (DS) such that each 8. Experimental
document has a number of keywords (KW). Hence, a given
document may characterize by several keywords. 8.1. Preprocess

Suppose, DS= {d1, d2, d3, d4} and KW = {{KW11, KW12, KW13}, Before extracting features of testing data, some preprocessing in
{KW21, KW22}, {KW31}, {KW41}, {KW51}} such thatd1= {KW13, the text being performed. All the experiments are performed after
KW21, KW31}, d2= {KW12, KW21, KW41}, d3= {KW12, KW22, normalizing the text. In normalization process the text is converted
KW31, KW51} and d4= {KW11, KW22, KW41, KW51}. to UTF-8 encoded and punctuations and non-letters are removed.
As shown above the Rw blocks are named Rs ðd1Þ, Rs ðd2Þ, Rs ðd3Þ They are very common words that appear in the text that carry
and Rs ðd4Þ. To obtain the topological elementary sets which little meaning; they serve only a syntactic function but do not indi-
discernible by Rw , we construct the topological space ðC s ; ss Þ where cate the subject matter. These stop words have two different
ss is the topology generated by the subbase S ¼ fRs ðdiÞ : impacts on information retrieval process. They can affect the
i ¼ 1; 2; . . . ; 4g. retrieval effectiveness because they have a very high frequency
The base of this topology is given by b ¼ fRsðdiÞ\ and tend to diminish the impact of the frequency difference among
RsðdjÞ : i; j ¼ 1; 2; 3; 4g, hence we have b = {d1, d2, d3, d4, {KW21}, less common words. Identifying a stop words list or a stop-list that
{KW13}, u,{KW12}, {KW41}, {KW22, KW51}}. contains such words in order to eliminate them from text process-
Now if we consider a document di with the keyword ing is essential to an information retrieval system. We explore the
C0 ¼ fKW11; KW51g, then the interior approximation intðC0Þ ¼ u use of stop words and their effect on Arabic information retrieval. A
and clðC0Þ ¼ d4 hence, C0 is not definable in our topological space. general stop-list is created, based on the Arabic language structure
But C0 can be approximated using the topological rough pair and characteristics without any additions.
ðu; d4Þ. Hence, we can only state that the attribute values
‘‘KW11”, ‘‘KW21”, ‘‘KW41” and ‘‘KW51” are possibly keywords of 8.2. Experimental evaluation
di according to the interpretation of topological rough pairs.
8.2.1. Simulation results for classification
7. Document classification in topological information retrieval
systems  Input data: (keywords, text),
 Output: classified documents according to feature selection by
A TIRC has a purpose to group documents of a set into classes or topology.
categories according to some knowledge.
Given a topological single–granular information retrieval sys- We have used about 130 keywords selected by a human from
tem S = (DS, {ss : s 2 KW}, {Cs: s 2 KW}, f s ) and an attribute w of the corpus.
KW, a topological classification over Cs, or class topological clas- In information retrieval content, precision and recall are defined
sification, represents a type of knowledge about the information in expressions of a set of retrieved documents (e.g. the list of doc-
retrieval system S. This specific type of knowledge about the uments created by a web search engine for a query) and a set of
information retrieval system S with respect to an attribute s 2 related documents (e.g. the list of all documents on the internet
KW is defined to be a set of subsets of C s , denoted LI , and that are relevant for a certain topic), cf. relevance.
defined as: Precision is the number of correct outcomes divided by the num-
ber of all returned outcomes. Namely, precision (PðdiÞ) of documents
LI ¼ fC s : s 2 I; 8s – m; intðC s Þ \ clðC m Þ ¼ ug
di is the division of the cardinality of the intersection of retrieved
By forming the subsets C s we are able to express that some doc- documents (RetðdiÞ) with the relevant documents (RelðdiÞ) that are
uments are reclassified or assigned to a category S in the index relevant to the query by the cardinality of retrieved documents:
set I.
We defined the set LS as: PðdiÞ ¼ jRelðdiÞ \ RetðdiÞj=jRetðdiÞj:

LS ¼ fE \ C s : C s 2 LI ; E 2 C a =  ss g; where E \ C s ¼ [fC \ C s : C 2 Eg: Precision is second-hand with recall, the percent of all relevant
documents that is returned by the search. The two measures are
The set LS is the set of classes that are assigned to category sometimes used together to present a single measurement for a
n in our topological single–granular information retrieval system called F measure.
system. Recall is the number of correct outcomes divided by the number
Let LS be a topological classification and C 2 LS for a fixed set S. of outcomes that should have been returned. Recall (RðdiÞ) of doc-
We use the usual rough set interpretation; hence, given an attri- uments di is the division of the cardinality of the intersection of
bute value cs 2 C the following interpretation is used: retrieved documents (RetðdiÞ) with the relevant documents
(RelðdiÞ) that are relevant to the query by the cardinality of rele-
 If cs 2 intðCÞ then we say that cs is certainly belong to the vant documents:
category S, denoted by the topological classification rule
Ok RðdiÞ ¼ jRelðdiÞ \ RetðdiÞj=jRelðdiÞj:
cs ! n.
 If cs 2 clðCÞ then we say that cs is possibly belong to the category The F measure (FðdiÞ) is the division of twice precision and
n, denoted by the topological classification rule cs ! n.
? recall by the sum of precision and recall.

2  PðdiÞ  RðdiÞ
Note that the condition intðC s Þ \ clðC m Þ ¼ u, Used in the defini- FðdiÞ ¼ :
PðdiÞ þ RðdiÞ
tion of LS , ensures that an attribute value cannot certainly assign to
category S and at the same time, possibly assigned to another cat-
egory m. Consequently, we cannot certainly assign an attribute 8.3. Experimental data
value to two different categories. Topological classification rules
allow us to represent knowledge without the need to discern all For evaluating the performance of our approach, we take on the
keywords of an information retrieval system. performance measures Precision (P) and Recall (R) in our system.

Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001
4 O.G. El Barbary, A.S. Salama / Egyptian Informatics Journal xxx (2018) xxx–xxx

Table 1 one (using a subset of the features). We present a new technique


Precision, recall and F- measure for document classification using TIRC method. for term selection using topological space. Moreover, introduced
Name of field Precision Recall F-measure the topological classification method.
medicine 0.67 0.91 0.77
Arts 0.74 0.66 0.69 References
Commercial 0.72 0.84 0.77
Politics 0.98 0.79 0.87 [1] Abd El-Monsef ME, El-Sayed Atlam, Amin M, El-Barbary O. Arabic document
Technology 0.57 0.86 0.68 classification: a comparative study. J Comput 2011; 3(4).
Science 0.44 0.75 0.55 [2] Abd El-Monsef ME, El-Sayed Atlam, Amin M, El-Barbary O. Field association
History 0.67 0.84 0.75 words with Naive Bayes classifier base Arabic document classification. IJCSI
Sport 0.59 0.64 0.73 2011; 8(3).
Health 0.81 0.95 0.87 [3] El-Sayed Atlam, El-Barbary O. Combining FA words with vector space models
for Arabic text categorization. Inf-Int Interdisciplinary J 2013; 16(6) (A): 3517–
Economics 0.59 0.79 0.69
28.
Biology 0.78 0.98 0.88
[4] Atlam El-Sayed, El-Barbary O. Arabic document summarization using fuzzy
ontology. Int J Innovative Comput 2014;10(4):1351–67.
[5] El-Barbary OG. Using field association words and K-means cluster for Arabic
document classification. Ciniea ea Tecnia J 2015;30(3):287–99.
Our experiments, trained the system using documents collected [6] El-Barbary OG. Using Arabic Skelton morphology and maximum entropy for
from the Internet. Our data are composed of Al-Jazeera news, Al- Arabic document classification. Br J Math Comput Sci 2016;14(3).
Ahram broadsheet, Al-Watan paper, Al Akhbar, Al Arabiya and [7] Pawlak Z. Rough sets. IJCIS 1982;11:341–56.
[8] Pawlak Z, Skowron A. Rough sets: some extensions. Inf Sci 2007;177:28–40.
Wikipedia the free encyclopedia and more. The number of files in [9] Pawlak Z, Skowron A. Rough sets and Boolean reasoning information sciences
our corpus is 1819 files and it is about 26.4 megabytes. There were 2007;177:41–73.
various topics related to Sports, Computers, Politics, Economics, [10] Rosario SF, Thangadurai K. Karur and Tamil Nadu, RELIEF: Feature Selection
Approach. ijird, 2015; 4(11).
etc. The corpus partitioned into 11 super fields and 24 subfields.
[11] Salama AS, El-Barbary OG. New topological approaches for data granulation. J
Precision, Recall, and F-measure are measured extracted from the Software Eng Appl 2013. doi: https://doi.org/10.4236/jsea. 2013.67B001,1-6.
evaluation results, it turns out as the results given in Table 1. Effec- [12] Salama AS, El-Barbary OG. Future applications of topology by computer
programming. Life Sci J 2014; 11(4): 168–72.
tiveness of document retrieval system is evaluated by using pair
[13] Salama AS, El-Barbary OG. New approach to vocabulary mining and document
wise document comparison. Comparison of Recall and Precision classification. Life Sci J 2014: 84–91.
allows ranking of the retrieval efficiency. [14] Salama AS, El-Barbary OG. Multi topological approximations of rough set
The F measure accuracy of Table 1 record the maximum result theory. Int J Granular Comput, Rough Sets Intell Syst 2012: 1–19. http://doi.
org/10.1504/IJGCRSIS.2013.054120.
in the super field Biology, its reach to 0.88. This is due to the high [15] Salama AS, El-Barbary OG. Fuzzy-rough set and fuzzy ID3 decision approaches
recall that is 0.98. If we calculate the average mean for all experi- to knowledge discovery in datasets. J Fuzzy Set Valued Anal 2012: 1–25.
ment, it will be about 0.75 accuracy for the F measure. This is con- http://doi.org/10.5899/2012/jfsva-00118.
[16] Jannik Strtgen, Leon Derczynski, Ricardo Campos, Omar Alonso. Time and
sider a very good result for document classification. information retrieval: introduction to the special issue. Inf Process Manage
2015; 51(6): 786–90.
[17] Silvestri Fabrizio, De Francisci Gianmarco, Morales Roi Blanco. Into the news:
9. Conclusion and future work
online news retrieval using closed captions. Inf Process Manage 2015;51
(1):148–62.
We suggested a new method depending on the neighborhoods [18] Shakery Azadeh, Rahimi Razieh, King Irwin. Extracting translations from
for improving information retrieval. The proposed model has many comparable corpora for Cross-Language Information Retrieval using the
language modeling framework. Inf Process Manage 2016;52(2):299–318.
advantages such as topological information retrieval systems that [19] Zhu William. Topological approaches to covering rough sets. Inf Sci, Inf
afford approximate retrieval may play an important role in the Comput Sci Intell Syst Appl 2007;177:1499–508.
wild development of existing information. [20] Zhenjun Zhang, Bo Jiang, Feiyue Qiu, Liping Wang, Bi-level weighted multi-
view clustering via hybrid particle swarm optimization. Inf Process Manage
In addition, we can view feature selection as a technique for 2016; 52(3): 387–398.
replacing a complex classifier (using all features) with a simpler

Please cite this article in press as: El Barbary OG, Salama AS. Feature selection for document classification based on topology. Egyptian Informatics J (2018),
https://doi.org/10.1016/j.eij.2018.01.001

You might also like