Socio-Economical Status of India Using Machine Learning Algorithms
Socio-Economical Status of India Using Machine Learning Algorithms
                                                                             Published By:
    Retrieval Number: E6610018520 /2020©BEIESP                               Blue Eyes Intelligence Engineering
    DOI:10.35940/ijrte.E6610.018520                                  3804    & Sciences Publication
                          Socio-Economical Status of India using Machine Learning Algorithms
agencies and Also discussed about census by products and           very close to the global means, high income areas of city
the need for census based area classification, publishing          were generated and the zones were analysed This work was
sample data for public use and the related confidentiality         mainly provides the analysis about the relationship between
issues, alternatives to the census like national surveys were      the population and transportation trips in MZMV area.
discussed. Even though there are developments in census                Jose cazal et al[2015] Explored predictive models for
data dissemination. And alternative sources of information,        economic system. It was observed that selection of what data
the basic published volumes are sufficient for general             to be included and what to be discarded is a big challenge
analysis.                                                          .There are some variables that are complex and many factors
   Bin sheng et.al [2010] presented the importance of data         affect these variables. Data mining techniques scientifically
mining technique to analyze census data. Census data               proven and gives reliable results in analyzing, predicting
contains rich set of information and many meaningful               socio economic studies. In this work data mining techniques
insights are hidden in it. CART (Classification and regression     were proposed as a valid option which is better than
trees) is a decision tree based classification model, using this   traditional econometrics methodology. SEMMA model was
model census data was analysed and predictions were made           used in data mining which stands for Simple, Explore,
on socio economic conditions of the people. Decisions trees        Modify, Model and Assess. Here based on type of data set
of classification gives high accurate prediction results, and      (time series or cross sectional) predictive algorithms were
the advantage is the results are easy to understand. CART has      chosen. Experiments were done on two cultures namely data
sound theoretical foundation , it has three phases like tree       modelling culture (here data is generated by stochastic given
building, tree pruning, and tree estimation. Using c++             model) and algorithmic model culture (which assumes the
language CART model was implemented on fifth census data           data as complex and unknown). Author did analysis and
of Chengy and Laixi. GINI index was used as a evolution            prediction using both traditional econometrics with E-views
function in this work. Even though CART is a recursive             tool and data mining techniques with EMMA tool. The
algorithm, non-recursive version was used to improve the           results show that data mining techniques were more effective
performance. The main aim of this work was classifying the         and efficient than the techniques of econometrics.
residents in Chengy and Laixi into four categories like poor,         Joon heo et al[2014] proposed user driven economic data
general, better and best. The results showed that middle level     analysis by using a mobile app. The user sends the data and
per capita income people were more in this area. Based on the      the analysis parameters through mobile phone to the server.
results local authority can plan for economic development.         The server uses big data and mathematical algorithms to
   Jian ming et.al [2018] presented the implementation of          perform analysis on given parameters. Normally Big data
artificial neural networks on economical and technical data of     analysis is done by only few companies because they can
mining enterprise. Generally mining enterprise data is             afford for it and user-oriented analysis were generally
multi-dimensional and nonlinear. One of the important              neglected. This framework contains two entities one is server
indicators of mining enterprise is mineral products sales price    and the other one is user application running in the mobile
data. Due to some technical limitations in the environment         phone. Stock data of 8 countries over a period of 33 years
the geological data was lost and the author reconstructed this     were used for this work. This data was first cleaned and next
data using artificial neural networks and geo statistics. Here     it was processed. Android SDK was used to build this mobile
back propagation algorithm of artificial neural network            app. At the server side three algorithms were implemented
model was used to predict the mineral product price. Neural        (minimum spanning tree, principal component analysis and
network was created with a single hidden layer and three           clustering). User selects economic parameters from the
–layered neural network with 5input neurons and 1 output           mobile and transmits it to the server and the server sends the
neuron was built. For reconstruction of geographic data            results back to the mobile device.
artificial neural networks and geo statistics were used. The          Sharath R et al[2016] studied and analysed
predictions showed that the model is strong and prediction         socio-economical conditions from US household data. In this
accuracy is high, for geological missing data the prediction       work the size of dataset is huge i.e., 3.5 million household
results and interpolation were reliable.                           information. The major conclusion was income of
   Maria Beatriz Bernabe Lacranca et.al.[2004] described a         individuals decides various aspects of life like education,
statistical classification approach on subset of data from the     health, standard of living, household decisions, and economic
XII house hold and population census data of metropolitan          status and so on. In this work five modules were implemented
zone of the maxico valley(MZMV) to present the properties          to study the importance of income domain. The five modules
of the population data. The main idea was classification of        are gender distribution in occupation (male vs female),
zones in MZMV region. in this work K-Means procedure was           education-salary relation (higher degrees and their income vs
implemented to create the clusters. To analyse this population     professional degrees and their income), economic hierarchies
data cluster analysis was applied. To represent the area of        to find the economic classes using classifiers, Benford’s law
study 57 variables were used which came from correlation           of US income (outlines the frequency distribution in many
analysis using k means the number of clusters maintained           datasets), mean and median of income distribution across all
going from 8 to 70 range. K means needs the n value for the        the states. These economic hierarchy predictions are useful in
number of clusters to be formed in advance and it also needs       many areas like planning houses for poor and middle class
k value for k initial centroid in advance. The statistical R       people, better pension plans for retired people, planning for
software was used to classify 4925 census zones. Zones             various welfare programs to
belong to clusters having greater size, lowest average in all      poor and so on. To conduct this
variables, zones with lowest population ,zones mean values         work five tools are used, they
                                                                       Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                            Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                               3805    & Sciences Publication
                                                    International Journal of Recent Technology and Engineering (IJRTE)
                                                                       ISSN: 2277-3878, Volume-8 Issue-5, January 2020
are Hadoop, Java1.7, Python, R, and Pig. The data was first        was implemented. Final results showed that manufacturing,
normalised by eliminating less important attributes, null          building trade, wholesale and retail markets are more
values. After normalisation economic hierarchies were              developed than other trades.
created using K-means and predictions of those economic                 Avery sandborn et.al.[2016] described the relationship
classes are done using classifiers. Three classifiers were         between special features derived from high resolution
applied to improve the efficiency .All of them performed           satellite imagery and Census data of Accra Ghana. Special
nearly same on the dataset. The accuracy by those classifiers      features are the metrics that analyze pixel groups for
were as follows:                                                   describing geometry, orientation, patterns of objects in an
                                                                   image. Such special features can be used to find housing
                                                                   conditions and living standards in a city. To see the
              Table 1: Classifiers and Accuracy
          Classifiers                 Accuracy
                                                                   association between demographical variables and special
                                                                   features five special features (panTex,Line support regions,
         Naïve Bayes                   48.13%
                                                                   histograms of oriented gradients, Fourier transform ,local
             C4.5                      51.3%
                                                                   binary patters) were selected and extracted from image and
         Boosted C5.0                  53.7%                       then related to census variables. This method was proposed
                                                                   as a alternative methodology for census data. The special
     Finally using relevant attributes demographic graphs          feature and spectral information normalized difference
were plotted. Octavio juraz Espinosa et al[1999] described         vegetation index.(NDVI).were computed and correlated to
visual techniques for input/output data. These visualisations      census data and there exists a high correlation between LBP
are necessary because the data was represented in matrix           and census derived variables. final results shows that the
form. So data navigation becomes a problem because of              special features can be used to find the socio economic
screen limitation. Here the techniques were designed based         conditions of the population.
on user tasks like querying about interaction of two sectors,           Ferdin joe j et.al[2011]. Differentiated Horizontal,
labelling data points, matrix area magnification, comparing        Vertical representation of data set with a new representation
two industries based on the goods they produced, comparing         called Hover(Horizontal over Vertical). In many applications
two sectors based on the co efficient values, finding patterns     sparse data is common issue. This sparse data is not managed
for matrix, modifying values and re-computing the matrix. In       well in both Horizontal and Vertical models in the context of
this work better visualisation techniques were created to          performance, storage, and query processing .The author
represent the economic input output data. The main need of         worked with a new method called Hover on Census data and
analysis is to study the interdependencies between industries      ecommerce data in sparse form. Hover representation
in regional economy. The data was maintained in four               contains two steps generating correlation table and
different matrices like make matrix(commodities produced           generating subspaces the sub spaces are generated from the
by industries), total requirements matrix(direct and indirect      correlation table in a Heuristic manner. In this work the
interactions between industrial sectors),use matrix(inter          author used rapid miner tool and measured subspaces, space
industrial activities and commodity inputs for industrial          usage, execution time, running time in a systematic way and
production),direct matrix(based on use matrix and total            the and the changes in parameters with the schema changes
output). All these matrices are displayed with the help of a       are analysed and differences are observed.
window using a pixel for each cell , colours were assigned              Zhuang et.al.[2014] dealt with the analysis of regional
based on its category. Each window is partitioned into three.      economic indicators. Traditional economic methods are not
First one is a large window which contains matrix, second the      effective in finding the factors influencing economy. In this
bottom part contains the detailed information and third left       work the analysis on key factors of regional economy were
window represents the data to be visualised. Apart from these      done with the help of k-means and CADD algorithms. High
matrices economic IO data was also represented as                  –dimensional data is sparse in nature, when processing such
geographical information. This way the visualisation makes         data based on distance and density the clustering was not
it easy to perform analysis on economic input output data.         efficient. In order to improve efficiency of clustering
      Yin Cai et.al.[2010] studied online analytical mining for    weighted CADD was proposed by the author. This partition
analyzing regional economic data. Regional economic data is        the cluster based on adaptive density reachable ideas. Here
gathered from statistical data. As per this work old statistical   the natural, economic attributes of the regional economic data
data was maintained in the form of word or excel format such       was clustered using k-means but the results were not ideal, so
a data is structured in non hierarchical manner. This creates a    CADD algorithm was used which reduced the total
problem in analysis and research. The regional economic data       dimensions. Finally comparative analysis was done on
contains large number of industries in various regions. In this    Chinese regional economics.
work the data was analysed with the help of online analytical
mining (OLAM) it was designed with the help of Data                                   III. METHODOLOGY
warehouse, Analytical processing and data mining. The are
                                                                      In this work we studied the socio economic status of
three main components namely data layer, middle layer,
                                                                   various states and union territories of India.
client layer. data layer was built with MS SQLServer 2005 as
                                                                   We have taken India-district-census 2011 data set for this
data warehouse. Middle layer built with MS Analysis service
                                                                   work. This data set contains 640 rows and 118 attributes of
and created data cubes, metadata, OLAP engine, data mining
                                                                   different districts from 28 states and 7 union territories.
engine. At the client layer web application was developed
                                                                   Except three columns all the columns contain numerical
using OLAP service and different operations like data cube
                                                                   values.
browsing, rotation, slicing, and drilling can be done. Here
                                                                    The data in this data set
zhenjiong province economic data was used and
                                                                   contains values in aggregated
multidimensional cubes are created and clustering algorithm
                                                                       Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                            Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                               3806    & Sciences Publication
                          Socio-Economical Status of India using Machine Learning Algorithms
form and the usage of this data is limited. Form this data we
want to analyze socio economical condition of India. The
proposed work is divided into 2 main modules
                                                                     Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                          Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                             3807    & Sciences Publication
                                                        International Journal of Recent Technology and Engineering (IJRTE)
                                                                           ISSN: 2277-3878, Volume-8 Issue-5, January 2020
Pearson-correlation-coefficient is defined as                              A mediod is nothing but any object or data point of the
                                                                      cluster, whose average dissimilarity to all other bojects or
                                                                      data points in the cluster is minimal.
                                                                          Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                               Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                                  3808    & Sciences Publication
                          Socio-Economical Status of India using Machine Learning Algorithms
                                                                        Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                             Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                                3809    & Sciences Publication
                                                      International Journal of Recent Technology and Engineering (IJRTE)
                                                                         ISSN: 2277-3878, Volume-8 Issue-5, January 2020
1)Other_Workers,       2)LPG_or_PNG_Households,
3)Households_with_Internet, 4)Households_with_Computer
5)Urban_Households, 6)Graduate_Education,
7)Households_with_Car_Jeep_Van
8)Households_with_Radio_Transistor9)Households_with_S
cooter_Motorcycle_Moped
10)Households_with_TV_Computer_Laptop_Telephone_m
obile_phone_and_Scooter_Car,                                          Fig. 3(b) The experimental results for poverty in India
11)Households_with_Telephone_Mobile_Phone_Both                                             with K-means
12)Ownership_Rented_Households'                                          fig.4(a) shows the actual poverty state wise in bar
The above 12 fields are more correlated and they are                diagram fig.4(b) shows the experimental values of different
important for calculating poverty.                                  states in bar diagram.
                                                                        Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                             Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                                3810    & Sciences Publication
                          Socio-Economical Status of India using Machine Learning Algorithms
                                                                      Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                           Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                             3811     & Sciences Publication
                                                     International Journal of Recent Technology and Engineering (IJRTE)
                                                                        ISSN: 2277-3878, Volume-8 Issue-5, January 2020
                                                                        Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                             Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                                3812    & Sciences Publication
                              Socio-Economical Status of India using Machine Learning Algorithms
10. Ying Cai, Jiangping Chen, Xiaoqing Fan, Zhenguo Yu,”Study on the
    Regional Economic Data Analysis and Mining Platform Base on
    OLAM”,2010 Second International Workshop on Education
    Technology and Computer Science,ETCS 2010 (2010),vol 3,p.g
    817-820.
11. Avery Sandborn and Ryan N. Engstrom,”Determining the Relationship
    Between Census Data and Spatial Features Derived From
    High-Resolution Imagery in Accra, Ghana”,IEEE Journal of Selected
    Topics in Applied Earth Observations and Remote Sensing (2016),vol
    9,issue 5,p.g1970-1977.
12. Ferdin Joe J,Dr. T. Ravi,John Justus C”Classification of Correlated
    Subspaces Using HoVer Representation of Census Data”,2011
    International Conference on Emerging Trends in Electrical and
    Computer Technology, ICETECT 2011 (2011),p.g 906-911.
13. Zhuang Cheng,”Regional Economic Indicators Analysis Based on Data
    Mining”,Proceedings - 2014 5th International Conference on Intelligent
    Systems Design and Engineering Applications, ISDEA 2014
    (2014),issue 2, p.g 726-730.
14. Vital, T. P., Lakshmi, B. G., Rekha, H. S., & DhanaLakshmi, M. (2019).
    Student Performance Analysis with Using Statistical and Cluster
    Studies. In Soft Computing in Data Analytics (pp. 743-757). Springer,
    Singapore.
AUTHORS PROFILE
                                                                                Published By:
 Retrieval Number: E6610018520 /2020©BEIESP                                     Blue Eyes Intelligence Engineering
 DOI:10.35940/ijrte.E6610.018520                                         3813   & Sciences Publication