International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 1
ISSN 2229-5518
Evolving Data Mining Algorithms on the
Prevailing Crime Trend – An Intelligent Crime
Prediction Model
A. Malathi and Dr. S. Santhosh Baboo
Abstract— Crime is a behavior deviation from normal activity of the norms giving people losses and harms. Crimes are a social nuisance
and cost our society dearly in several ways. In this paper we look at use of missing value and clustering algorithm for crime data using data
mining. We will look at MV algorithm and Apriori algorithm with some enhancements to aid in the process of filling the missing value and iden-
tification of crime patterns. We applied these techniques to real crime data. Crime prevention is a significant issue that people are dealing with
for centuries. We also use semi-supervised learning technique in this paper for knowledge discovery from the crime records and to help in-
crease the predictive accuracy.
Index Terms— Crime-patterns, clustering, data mining, law-enforcement, Apriori.
—————————— ——————————
1 INTRODUCTION
C rime is a behavior disorder that is an integrated re-
sult of social, economical and environmental factors.
Later crime data is divided into daily apoch, to observe
spatiotemporal distribution of crime. In order to predict
Crimes are social nuisance and cost our society in several crime in time dimension is fitted for each week day, then
ways. In the world today crime analysis is gaining signi- the forecasted crime occurrences in time are disaggre-
ficance and one of the most popular disciplines is crime gated according to spatial crime cluster patterns. Hence
prediction. Stakeholders of crime intend to forecast the the model proposed in this thesis can give crime predic-
place, time, number of crimes and crime types to get pre- tion in both space and time to help police departments in
cautions. With respect to these intentions, in this paper a tactical and planning operations.
crime prediction model is generated. . The high volume of crime datasets and also the
complexity of relationships between these kinds of data
We today, security are considered to be one of the major have made criminology an appropriate field for applying
concerns and the issue is continuing to grow in intensity data mining techniques. Identifying crime characteristics
and complexity. Security is an aspect that is given top is the first step for developing further analysis. The know-
priority by all political and government worldwide and ledge that is gained from data mining approaches is a
are aiming to reduce crime incidence[ 5]. Reflecting to very useful tool which can help and support police forces
many serious situations like September 11, 2001 attack, [8]. According to[9], solving crimes is a complex task that
Indian Parliament Attack, 2001, Taj Hotel Attack, 2006 requires human intelligence and experience and data
and amid growing concerns about theft, arms trafficking, mining is a technique that can assist them with crime de-
tection problems. The idea here is to try to capture years
murders, the importance for crime analysis from previous
of human experience into computer models via data min-
history is growing. The law enforcement agencies are ac-
ing.
tively collecting domestic and foreign intelligence to pre-
vent future attacks.
In the present scenario, the criminals are becom-
The model is generated by utilizing crime data for few
ing technologically sophisticated in committing crimes
years from the years 2006 to 2010. Methodology starts [1]. Therefore, police needs such a crime analysis tool to
with obtaining clusters with different clustering algo- catch criminals and to remain ahead in the eternal race
rithms. Then clustering methods are compared to select between the criminals and the law enforcement. The po-
the most appropriate clustering algorithms. lice should use the current technologies [4] to give them-
selves the much-needed edge. Availability of relevant and
timely information is of utmost necessity in conducting of
A. Malathi, Assistant Professor, PG and Research Department of Com- daily business and activities by the police, particularly in
puter Science, Government Arts College, Coimbatore. She is currently crime investigation and detection of criminals. Police or-
pursuing Docrate program in Research and Developemnt centre, Bhara- ganizations everywhere have been handling a large
thiar University, India, PH-09942526000.
E-mail: malathi.arunachalam@yahoo.com amount of such information and huge volume of records.
. There is an urgent need to analyzing the increasing num-
Dr. S. Santhosh Baboo, Reader, Post Graduate and Research Department ber of crimes as approximately 17 lakhs Indian Penal
of Computer Science, D. G. Vaishnav College, Chennai. India. PH-0999.
E-mail: santhos2001@sify.com
Code (IPC) crime, and 38 lakhs local and Special Law
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 2
ISSN 2229-5518
crimes per year. o To explore and enhance clustering algo-
rithms to identify crime patterns from
An ideal crime analysis tool should be able to historical data
identify crime patterns quickly and in an efficient manner
o To explore and enhance classification al-
for future crime pattern detection and action. However, in
the present scenario, the following major challenges are gorithms to predict future crime beha-
encountered. viour based on previous crime trends
o To develop anomalies detection algo-
Increase in the size of crime information that has to rithms to identify change in crime pat-
be stored and analyzed. terns
Problem of identifying techniques that can accurately
and efficiently analyze this growing volumes of crime These techniques do not have a set of predefined
data classes for assigning items. Some researchers use the
Different methods and structures used for recording statistics-based concept space algorithm to automati-
crime data. cally associate different objects such as persons, or-
The data available is inconsistent and are incomplete ganizations, and vehicles in crime records [7]. Using
thus making the task of formal analysis a far more link analysis techniques to identify similar transac-
difficult. tions, the Financial Crimes Enforcement Network AI
Investigation of the crime takes longer duration due System [10] exploits Bank Secrecy Act data to support
to complexity of issues the detection and analysis of money laundering and
other financial crimes. Clustering crime incidents can
All the above challenges motivated this research automate a major part of crime analysis but is limited
work to focus on providing solutions that can enhance the by the high computational intensity typically re-
process of crime analysis for identifying and reducing quired.
crime in India. The main aim of this research work consist
of developing analytical data mining methods that can
systematically address the complex problem related to 2 LITERATURE REVIEW
various form of crime. Thus, the main focus is to develop Data mining in the study and analysis of criminology can
a crime analysis tool that assists the police in be categorized into main areas, crime control and crime
suppression. Crime control tends to use knowledge from
o Detecting crime patterns and perform the analyzed data to control and prevent the occurrence
crime analysis of crime, while the criminal suppression tries to catch a
o Provide information to formulate strate- criminal by using his/her history recorded in data min-
ing.
gies for crime prevention and reduction
o Identify and analyze common crime pat- constructed a software framework called ReCAP
terns to reduce further occurrences of (Regional Crime Analysis Program) for mining data in
similar incidence order to catch professional criminals using data mining
and data fusion techniques[3]. Data fusion was used to
The present research work proposes the use of an amal- manage, fuse and interprets information from multiple
gamation of data mining techniques that are linked with a sources. The main purpose was to overcome confusion
common aim of developing such a crime analysis tool. from conflicting reports and cluttered or noisy back-
For this purpose, the following specific objectives were grounds. Data mining was used to automatically discover
formulated. patterns and relationships in large databases.
o To develop a data cleaning algorithm Crime detection and prevention techniques are
applied to different applications ranging from cross-
that
border security, Internet security to household crimes.
cleans the crime dataset, by re- Proposed a method to employ computer log files as histo-
moving unwanted data ry data to search some relationships by using the fre-
Use techniques to fill missing quency occurrence of incidents[2]. Then, they analyzed
values in an efficient manner the result to produce profiles, which can be used to perce-
ive the behavior of criminal.
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 3
ISSN 2229-5518
Introduced a framework for crime trends using a While taking individual attributes into considera-
new distance measure for comparing all individuals tion, a novel KNN-based imputation method is
based on their profiles and then clustering them accor- proposed. In this method, the missing values of
dingly[6]. This method also provided a visual clustering
an instance are imputed by considering a given
of criminal careers and identification of classes of crimi-
nals. number of instances that are most similar to the
instance of interest. The similarity of two in-
From the literature study, it could be concluded that stances is determined using a distance function.
crime data is increasing to very large quantities running
into zota bytes (1024bytes). This in turn is increasing the The new algorithm is as follows
need for advanced and efficient techniques for analysis.
Data mining as an analysis and knowledge discovery tool
1. Divide the data set D into two parts. Let Dm be
has immense potential for crime data analysis. As is the
case with any other new technology, the requirement of the set containing the instances in which at least
such tool changes, which is further augmented by the one of the features is missing. The remaining in-
new and advanced technologies used by criminals. All stances will complete feature information form a
these facts confirm that the field is not yet mature and set called Dc.
needs further investigations.
3. PREPROCESSING For each vector x in Dm:
Data preprocessing is a process that consists of
data cleaning, data integration and data transformation a. Divide the instance vector into observed
which is usually processed by a computer program. It and missing parts as x = [xo, xm].
intends to reduce some noises, incomplete and inconsis-
tent data. The results from preprocessing step can be later b. Calculate the distance between the xo
proceeding by data mining algorithm. and all the instance vectors from the set
Dc.
The dataset used in experiment contains various
items like year, state code, status of administrative unit,
c. Use only those features in the instance
name of the administrative unit, number of crimes with
respect to murder, dacoity, riots and Arson, area in sq. vectors from the complete set Dc, which
meters of the administrative unit, Estimated Mid-Year are observed in the vector x.
Population of the Administrative Unit in 1000s (begins in
1964), Actual Civil Police Strength (numbers of person- d. Use the P closest instances vectors and
nel), Actual Armed Police Strength (numbers of person- perform a majority voting estimate of
nel) and Total Police Strength (Civil and Armed Police).
the missing values for categorical
attributes. For continuous attributes re-
3.1 Missing value handling place the missing value using the mean
The experiment concentrate on only those value of the attribute in the P (related
attributes that are related to crime data, that is year, state, instances)
administrative name, number of crimes for the years 1971
to 2006. The quality of the results of the mining process is
directly proportional to the quality of the preprocessed The challenging decisions that have to be carefully chosen
data. Careful scrutiny revealed that the dataset have miss- are:
ing data in state and number of crimes attributes.
(i) The choice of the distance function. In the
3.2 Missing value handling for number of crimes present work, four distance measures, Eucli-
ocurred attribute dean, Manhattan, Mahalanobis and Pearson,
are considered and the one that produced
In the present research work, while considering best result is considered.
filling missing number of crimes related murder, dacoity,
(ii) The KNN algorithm searches through all the
riots and arson, two methods were used. Initially, all the
four fields are analyzed for empty values. If all the four dataset looking for the most similar in-
attributes have empty values for a particular record, then stances. This is a very time consuming
the entire record is considered as irrelevant information process and it can be very critical in data
and is deleted. mining where large databases are analyzed.
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 4
ISSN 2229-5518
To speed up this process a method that com- ters and the classes are unknown beforehand. Two clus-
bines missing value handling process with tering techniques, K-means and DBScan (Density-Based
classification is proposed. Spatial Clustering Application with Noise) algorithm are
(iii) The choice of k, the number of neighbors. considered for this purpose. The algorithm for k-means is
Experiments showed that a value of 10 pro- given below.
duce best results in terms of accuracy and
hence is used in further experimentation. The HYB algorithm is given below.
Thus, the traditional KNN Imputation method The HYB algorithm clusters the data m groups where
was enhanced in two manners. The first enhancement is m is predefined
achieved by proposing a new distance metric and the Input – Crime type, Number of Clusters, Number of
second enhancement is achieved by using LVQ (Learning Iteration
Vector Quantization) methods combined with genera-
Initial seeds might produce an important role in the
lized relevance learning to perform the classification and
missing value treatment simultaneously. Both these en- final result
hancement when combined together produces a model Step 1: Randomly Choose cluster centers;
(E-KDD) that is efficient in terms of speed and accuracy. Step 2: Assign instances to clusters based on their dis-
tance to the cluster centers
3.3 Missing value handling in the prediction of Step 3: centers of clusters are adjusted
the size of Population of the city Step 4: go to Step 1 until convergence
Step 5: Output C0, C1, C2, C3
The first task is the prediction of the size of the popu- From the clustering result, the city crime trend for each
lation of a city. The calculation of per capita crime type of crime was identified for each year. Further, by
statistics helps to put crime statistics into proportion. slightly modifying the clustering seed, the various states
However, some of the records were missing one or were grouped as high crime zone, medium crime zone
more values. Worse yet, half the time, the missing and low crime zone. From these homogeneous groups,
value was the "city population size", the efficiencies of police administration units i.e. states
which means there was no per capita statistics for the can be measured and the method used is given below.
entire record. Over some of the cities did not report
any population data for any of their records. To im- Output Function of Crime Rate = 1/Crime Rate
prove the calculation of "yearly average per capita
crime rates", and to ensure the detection of all "per Here, crime rate is obtained by dividing total crime densi-
capita outliers", it was necessary to fill in the missing ty of the state with total population of that state since the
values. The basic approach to do this was to cluster police of a state are called efficient if its crime rate is low
population sizes, create classes from the clusters, and i.e. the output function of crime rate is high.
then classify records with unknown population sizes.
The justification for using clustering is as follows: Thus the two clustering techniques were ana-
Classes from clusters are more likely to represent the lyzed in their efficiency in forming accurate clusters,
actual population size of the cities. The only value speed of creating clusters, efficiency in identifying crime
needed to cluster population sizes was the popula- trend, identifying crime zones, crime density of a state
tion size of each record. These values were clustered and efficiency of a state in controlling crime rate. Experi-
using EM algorithm and initially 10 clusters were mental results showed that HYB algorithm show im-
chosen because it produced clusters with mean val- proved results when compared with k-means algorithm
ues that would produce per capita calculations close and therefore was used in further investigations.
to the actual value
Crime Trend Prediction
4 CRME PREDICTION MODEL
The next task is the prediction of future crime trends. This
Given a set of objects, clustering is the process of involves tracking crime rate changes from one year to the
next and used data mining to project those changes into
class discovery, where the objects are grouped into clus-
the future. The basic method involves cluster the states
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 5
ISSN 2229-5518
having the same crime trend and then using ”next year”
cluster information to classify records. This is combined
with the state poverty data to create a classifier that will
predict future crime trends.
The Major crimes under property crime are discussed
here. There are many categories of crimes like Crime
against women, property crime, Road Accident.
Murder
Murder for Gain Dacoity
Robbery
Burglary
Theft
To the clustered results, a classification algorithm was
applied to predict the future crime pattern. The classifica-
tion was performed to find in which category a cluster
Fig. 1. Crime Burglary Analysis
would be in the next year. This allows us to build a pre-
dictive model on predicting next year’s records using this
year’s data. The C4.5 decision tree algorithm was used for
this purpose. The generalized tree was used to predict the
unknown crime trend for the next year. Experimental
results proved that the technique used for prediction is
accurate and fast. The following are four different clusters
produced depends upon the crime nature
C0: Crime is steady or dropping. Theft is the
primary crime little increased and dropping.
C1: Crime is rising or in flux. Dacoity is the pri-
mary crime rates changing..
Fig. 2. Crime Murder Analysis
C2: Crime is generally increasing. Robbery,
Murder, Murder for gain, and Burglery are the The Murder crime was taken to analyse the future
primary crime on the rise. crime prediction. This crime was analysed for the pe-
riod 2006 to 2009. Both existing algorithm and the
C3: Few crimes are in flux. Dacoity is in flux. It new algorithm are executed for the same data set.
has gone down and increased then once again The existing algorithm predicted the crime as 83%.
gone down. The new algorithm predicted the crime as 89%.
5. IMPLEMENTATION 6 CONLUSION
Major two crimes Burglary and Murder were taken to A major challenge facing all law-enforcement and
analyse the existing crime. Crime Burglary was in in- intelligence-gathering organizations is accurately and effi-
creasing, In the year 2006 it got decreased, then it ciently analyzing the growing volumes of crime data. As
keeps increasing till 2010. Crime Murder kept in- information science and technology progress, sophisticated
data mining and artificial intelligence tools are increasingly
creasing from 2006 to 2010. The sample crimes Bur-
accessible to the law enforcement community. These tech-
glary and Murder belong to the cluster C2.
niques combined with state-of-the-art Computers can
process thousands of instructions in seconds, saving precious
time. In addition, installing and running software often costs
less than hiring and training personnel. Computers are also
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 6
ISSN 2229-5518
less prone to errors than human investigators, especially Changed?, Annual meeting of the International
those who work long hours. Studies Association, California, USA,
http://www.allacademic.com/meta/
This research work focus on developing a crime p98627_index.html.
analysis tool for Indian scenario using different data mining
techniques that can help law enforcement department to 6. de Bruin, J.S. , Cocx, T.K. , Kosters, W.A. , Laros, J.
efficiently handle crime investigation. The proposed tool and Kok, J.N. (2006) Data mining approaches to
enables agencies to easily and economically clean, character- criminal career analysis,” in Proceedings of the
ize and analyze crime data to identify actionable patterns Sixth International Conference on Data Mining
and trends. The proposed tool, applied to crime data, can be (ICDM’06), Pp. 171-177.
used as a knowledge discovery tool that can be used to re-
view extremely large datasets and incorporate a vast array of 7. Hauck, R.V.Atabakhsh, H., Ongvasith, P., Gupta,
methods for accurate handling of security issues. H. and Chen, H. (2002) Using Coplink to Analyze
Criminal-Justice Data, Computer, Volume 35 Issue
The development of the crime analysis tool has four 3, Pp. 30-37.
steps, namely, data cleaning, clustering, classification and
outlier detection. The data cleaning stage removed unwanted 8. Keyvanpour, M.R., Javideh, M. and Ebrahimi, M.R.
(2010) Detecting and investigating crime by means
records and predicted missing values. The clustering tech-
of data mining: a general crime matching frame-
nique is used to group data according to the different type of
work, Procedia Computer Science, World Confe-
crime. From the clustered results it is easy to identify crime
rence on Information Technology, Elsvier B.V., Vol.
trend over years and can be used to design precaution me-
3, Pp. 872-830.
thods for future. The classification of data is mainly used
predict future crime trend. The last step is mainly used to
9. Nath, S. (2007) Crime data mining, Advances and
identify future crimes that are emerging newly by using out-
innovations in systems, K. Elleithy (ed.), Computing
lier detection on crime data.
Sciences and Software Engineering, Pp. 405-409.
Experimental results prove that the tool is effective
10. Senator, T.E., Goldberg, H.G., Wooton, J., Cottini,
in terms of analysis speed, identifying common crime pat-
M.A., Khan, A.F.U., Klinger, C.D., Llamas, W.M.,
terns and future prediction. The developed tool has promis- Marrone, M.P. and Wong, R.W.H. (1995) The Fin-
ing value in the current changing crime scenario and can be CEN Artificial Intelligence System: Identifying Po-
used as an effective tool by Indian police and enforcement of tential Money Laundering from Reports of Large
law organizations for crime detection and prevention. Cash Transactions, AI Magazine, Vol.16, No. 4, Pp.
21-39.
REFERENCES
1. Amarnathan, L.C. (2003) Technological Advance-
ment: Implications for Crime, The Indian Police
Journal, April June.
2. Abraham, T. and de Vel, O. (2006) Investigative
profiling with computer forensic log data and asso-
ciation rules," in Proceedings of the IEEE Interna-
tional Conference on Data Mining (ICDM'02), Pp. 11
– 18.
3. Brown, D.E. (1998) The regional crime analysis pro-
gram (RECAP): A frame work for mining data to
catch criminals," in Proceedings of the IEEE Interna-
tional Conference on Systems, Man, and Cybernet-
ics, Vol. 3, Pp. 2848-2853.
4. Corcoran J.J., Wilson I.D. AND Ware J.A. (2003)
Predicting the geo-temporal variations of crime and
disorder, International Journal of Forecasting, Vol.
19, Pp.623–634.
5. David, G. (2006) Globalization and International Se-
curity: Have the Rules of the Game
IJSER © 2011
http://www.ijser.org