0% found this document useful (0 votes)

21 views22 pages

Menendez Llorente

This document discusses the integration of Graph Theory and Unsupervised Learning in the context of Social Data Mining, highlighting their applications in analyzing user behavior and relationships within social networks. It outlines the significance of graph models in data mining processes, including data extraction, preprocessing, model generation, validation, and application. The work also presents various techniques and algorithms that leverage Graph Theory to enhance data analysis, with practical examples from popular social networks like Facebook and Twitter.

Uploaded by

ramaaslam259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views22 pages

Menendez Llorente

Uploaded by

ramaaslam259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

The combination of Graph Theory and Unsupervised Learning applied to Social Data

Mining 1

Chapter 1

T HE COMBINATION OF G RAPH T HEORY AND

U NSUPERVISED L EARNING APPLIED TO S OCIAL
D ATA M INING
Héctor D. Menéndez∗ and José Luis Llorente
Universidad Autónoma de Madrid
Escuela Politécnica Superior. C/ Tomás y Valiente 11, 28049, Cantoblanco, Madrid

Abstract

Over the last few years, Social Data Mining has become an important field inside
Data Analysis. These techniques are rapidly finding applications in a variety of do-
mains including artificial intelligence, economics and marketing amongst others. They
are based on knowledge extraction from the users, focusing on their behaviour and re-
lationships inside a system which can be modeled as a Social Network where they
act independly or establishing relationships. The Social Network studies have been
oriented from different points of view, however, the most representatives come from
Graph Theory. On the one hand, the Network is usually represented as a Graph where
the users are considered nodes and their relationships are the graph edges. Different
approximations of Complex Network Analysis are used to described the Network and
its features. On the other hand, the Graph Theory can also be used to analyses the
behaviour of the users, not only from a relationships point of view, instead, it can be
used to analyse the information that they generates, creating an independent profile of
the user. A representative selection of these techniques is discussed in detail in this
work, showing how different methods extracted from Graph Theory can be combined
with different approaches of Unsupervised Learning to analyse Social Behaviour from
different perspectives.

1. Introduction
Social Data Mining is one of the most innovative areas in Data Mining [28]. This new
field combines different techniques which are related to Data Mining and Complex
Networks [42]. Several of these approaches have been focused on graph models,
∗
Corresponding Author. E-mail address: hector.menendez@uam.es
2 Héctor D. Menéndez and José Luis Llorente

specially unsupervised learning, where graph models have proved to improve the
results of the most classical algorithm [53].

Different steps of the Data Mining processes can be focused from a graph-based
perspective. For example, the data structure can be consider as a Manifold generating
a topology over the data instances considering them as nodes of a graph and the
edges similarity measures amongst the nodes. New Data Mining algorithms from
unsupervised learning based on Spectral Analysis [58] takes a graph as the topology
of the data distribution and uses its spectrum for the model generation process of the
Machine Learning algorithm.

Complex Network are usually used to represent a Social Network [43] (for
example, Facebook or Twitter) as a graph where the users are represented as the nodes
of the graph and the relations between the users are represented as the edges of the
graph. This representation provides a lot of information about the network related
to the type of network (Small-World [60], Random [19], Scale-free [5],...), and its
features (strength, paths, authorities, hubs,...). It also has several applications related
to really separated fields such as Marketing [54] and Medicine [56], amongst others.

This work is focused on give a general perspective about the influence of Graph
Theory in Data Mining algorithms, presenting new techniques and algorithms which
have been developed over the last few years. It also introduces some analytical meth-
ods extracted from Complex Networks and provides some examples of important al-
gorithms and network structures in this field. Finally, it shows two famous Social
Networks where some of theses techniques have been successfully applied (Facebook
and Twitter) and recommends some software tools which might be helpful to apply
these processes.
The chapter is structure as follows: Section 2 provides some basic definitions of
Graph Theory which are useful to understand the rest of the paper, people with knowl-
edge about Graph and Complex Network Theory can missed this section. Section 3
gives an overview of the Data Mining process and the different steps used during the
analysis. This section introduces the steps of Data Mining. Next, the Machine Learn-
ing methods which can use Graph Theory are detailed in Sections 4. Section 5 is
focused on the Complex Network analysis techniques which are also important in the
Social Analysis, it explains some issues about the Network structure and presents two
important algorithms of the literature: PageRank and HITS. Section 6 shows some
practical application of these approaches in real-world Social Networks: Facebook
and Twitter. Section 7 summarizes some tools which can be used in all the analysis
processes. Finally, the last section, explains the conclusions.

2. Basic Definitions from Graph Theory

Some algorithms use concepts and metrics extracted from graph theory. For this rea-
son, and before describing them, some of those basic concepts are briefly introduced.

Definition 2..1 (Graph). A graph G = (V, E) is a set of vertices or nodes V denoted

by {v1 , . . . , vn } and a set of edges E where each edge is denoted by eij if there is a
connection between the vertices vi and vj .
Graphs can be directed or undirected. If all edges satisfy the equality ∀i, j,
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 3

eij = eji , the graph is said to be undirected.

The graph can also be represented through its adjacency matrix (the most usual
approach) which can be defined as:

Definition 2..2 (Adjacency Matrix). An adjacency matrix of G, AG , is a square n × n

matrix where each coefficient satisfies:

1, if eij ∈ E
(aij ) =
0, otherwise

When it is necessary to work with weighted edges, a new kind of graph needs to
be defined:

Definition 2..3 (Weighted Graph). G is a weighted graph if there is a function

w : E → R which assigns a real value to each edge.

Any algorithm that works with the vertices of a graph needs to analyse each node
neighbours. The neighbourhood of a node is defined as follows:

Definition 2..4 (Neighbourhood). If the edge eij ∈ E and eji ∈ E we say

that vj is a neighbour of vi . The neighbourhood of vi Γvi is defined as
Γvi = {vj | eij ∈ E and eji ∈ E}. Then, the number of neighbours of a
vertex vi is ki = |Γvi |

Also nodes can generate paths between them through their edges, a path is defined
as follows:
Definition 2..5 (Path). A Path of a graph between the nodes vi and vj is a set of edges
which connects these two nodes. It will be denoted by Pij .
And its length is:
Definition 2..6 (Path Length). The Path Length is defined as the number of edges
contained in the path. It will be denoted by |Pij |.
It is also important to know the shortest path between two nodes, usually defined
by:
Definition 2..7 (Shortest Path). The Shortest Path is a minimum Path between two
nodes. It should satisfy:
min|Pij | {Pij | Pij ∈ G} (1)
One is most important metrics of the graph is defined by its diameter:
Definition 2..8 (Graph Diameter). The Graph Diameter is defined as the maximum
shortest path of the graph.
Once the most general and simple concepts from graph theory are defined, we can
proceed with the definition of some basic measures related to any node in a graph: the
average path length the clustering coefficient and the weighted clustering coefficient.
4 Héctor D. Menéndez and José Luis Llorente

Definition 2..9 (Average Path Length). Let G be a Graph and V its set of vertices.
Let d(vi , vj ) be the shortest distance between vi and vj . The Average Path Length is
defined by:
1 X
lG = · d(vi , vj ) (2)
n · (n − 1) i,j

Definition 2..10 (Local CC [14]). Let G = (V, E) be a graph where E is the set of
edges and V the set of vertices and A its adjacency matrix with elements aij . Let Γvi
be the neighbourhood of the vertex vi . If ki is considered as the number of neighbours
of a vertex, we can define the clustering coefficient (CC) of a vertex as follows:
1 X
Ci = ajh aij aih aji ahi (3)
ki (ki − 1)
j,h

The Local CC measure provides values ranging from 1 to 0. Where 0 means

that the node and its neighbours do not have clustering features, so they do not
share connections between them. Therefore, value 1 means that they are completely
connected. This definition of CC can be extended to weighted graphs as follows:

Definition 2..11 (Local Weighted CC [6]). Following the same assumption of Local
Clustering Coefficient definition, let W be the weight matrix with coefficients wij and
A be the adjacency matrix with coefficients aij , if we define:
|V |
X
Si = aij aji wij (4)
j=1

Then, the Local Weighted Clustering Coefficient can be defined as:

1 X (wij + wih )
Ciw = ajh aij aih aji ahi (5)
Si (ki − 1) 2
j,h

For this new definition, we are considering the connections between the neigh-
bours of a particular node, but now we add information about the weights related to
the original node. This new measure calculates the distribution of the weights of the
node that we are analysing, and shows how good the connections of that cluster are.
The following theorem proves that the weighted CC has the same value than the CC
when all the weights are set to the same value:

Theorem 2..1. Let G be a graph, A its adjacency matrix and W its weight matrix. If
we set wij = ω ∀i, j, them Ci = Ciw .

Proof. Following the definition of Ciw we have:

1 X 2ω
Ciw = ajh aij aih aji ahi
Si (ki − 1) 2
j,h

P|V |
Where Si = j=1 aij aji ω. Replacing Si , we have:

1 X
Ciw = P|V | ωajh aij aih aji ahi
j=1 aij aji ω(ki − 1) j,h
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 5
1 X
Ciw = P|V | ajh aij aih aji ahi
j=1 aij aji (ki − 1) j,h

We also know that following the neighbour definition and the adjacency matrix defini-
P|V |
tion: ki = j=1 aij aji = |Γvi | = |{vj | eij ∈ E and eji ∈ E}| And finally:

1 X
Ciw = ajh aij aih aji ahi
ki (ki − 1)
j,h

Which proves theorem 1

As a corollary to this theorem, if CCiw = 1 ⇒ CCi = 1.

Finally, if we want to study a general graph, we should study its Global CC:

Definition 2..12 (Global CC [14, 6]). The clustering coefficient of a graph can be
defined as:

|V |
1 X
C= Ci (6)
|V | i=0

Where |V | is the number of vertices.

The Global Weighted Clustering Coefficient is:

|V |
1 X w
Cw = C (7)
|V | i=0 i

The main difference between Local CC, Local Weighted CC and Global CC is that,
the first one can be used to represent how connected is a node locally in a graph, the
second one is used to calculate the density of these connections using the edge weights,
and the last one provides us with global information about of the connectivity in a
graph. In real complex problems only the two initial measures can be used, whereas
the third one is usually estimated [57].

3. Data Mining
Data Mining is “the process of discovering meaningful new correlations, patterns and
trends by sifting through large amounts of data stored in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques” [34]. The
Data Mining techniques are divided in 5 main steps:
1. Data Extraction: The data extraction problem consists on obtain the datasets
which will be analysed.
2. Data Preprocessing and Normalization: The data preprocessing methods pre-
pare the data to be analysed. There are three main steps [34]: avoid misclassifica-
tion, dimensionality reduce (through projections or feature selection techniques)
and range normalization.
6 Héctor D. Menéndez and José Luis Llorente

3. Model Generation: This is the most important part of the data analysis. The
model is created to find the patterns in the data. It is usual to use Machine
Learning or other statistical techniques to generate the model [34].
4. Model Validation: Depending on the type of model, the validation process is
different. This process gives the confidence of the model. It is usual to use
validation with classifiers [34].
5. Model Application: The goal of the model is to be applied, for example, to
predict the behaviour of new inputs.

3.1. Data Extraction

There are several public databases where the data can be taken according to the analy-
sis goal. The most used in Data mining works is the UCI Machine Learning Repository
[2] which contains several databases to test the algorithms. Also, there are some ap-
plications for Social Network analysis which allow researchers to extract information
from Twitter (such as Twitter API [38]) or Facebook (Facebook API [24]), amongst
others.

3.2. Data Preprocessing

Data Mining techniques need an intensive phase of data preprocessing. Initially the
information must be analysed and stored in some kind of database system, cleaned
and separated. This preprocessing phase is used to avoid outliers, missclassifications
and missing data. Methods such as histogram and statistical correlation are used to
clean the dataset and reduce the number of variables [34]. Projections are also usual
in dimension reduction, however, projection methods [15] such as PCA (Principal
Component Analysis) or LDA (Lineal Discriminant Analysis) do not offer a complete
perspective of the problem. These methods create new variables which are estimated
from principal components or lineal projections trying to separate the data and reduce
its dimension. Usually, these techniques lose the original information of the features
which is unrecoverable once it is projected. It produces a reduction of the human
interpretation of Data Mining techniques applied and, sometimes, it is preferable to
avoid them.

There are several techniques which reduce the feature sets to avoid projections.
These methods apply a guided search among the different attributes looking for the
most useful variables for the analysis. These methods are usually known as feature
selection methods [32]. Curiel et al. [13] apply genetic algorithms to simplify
prognosis of endocarditis using a codification where each individual of the population
is based on a set of features. Blum and Langley [8] show some examples of relevant
features selections in different datasets and applied them to different machine learning
techniques. They define different degrees of relevant features such as strong or weak
relevant features. They also study some methodologies such as heuristic search,
filters and wrapper approaches which are automatic feature selection methods usually
validated by classification techniques. Some of these techniques usually introduced
over-fitting to the model and are computationally expensive. Roth and Lange [52]
apply these techniques to the clustering problem.

Finally, the last step is related to normalization. It allows to compare data features
with different kind of range of values. Z-Score [10] and Min-Max [26] normalization
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 7

methods are commonly used for preprocessing the data. Both normalization algo-
rithms takes the attribute records and they find a standard range for them. Min-Max
has a fixed range, [0,1] (it is sensitive to outliers), while Z-Score depends on the mean
and the standard deviation (it approximates the distribution to a normal distribution, it
is usually used to avoid outliers). These algorithms obtain the normalized values from
data using the following equations:

• Min-max: It computes maximum and minimum values of the attributes apply-

ing:
x − min(X)
x′ =
max(X) − min(X)
• Z-Score: It computes mean and standard deviation of the values applying:

x − mean(X)
x′ =
SD(X)

Once the data is ready for the analysis, the model generation phase begins. This
work is focused on unsupervised learning techniques for model generation, specially
clustering techniques, that are presented in the following section.

4. Model Generation: Clustering

The most important part in the Data Mining processes is the Model Generation step
where the Machine Learning algorithms are applied. In this section, several algo-
rithms related to Graph Theory and Unsupervised Learning techniques are explained,
specially one of the most important clustering algorithms related to Graph Spectrum
Theory: Spectral Clustering. Finally, it presents the Community-finding problem, re-
lated to identify communities of users in the Network.

4.1. Graph Clustering

Graph theory has also proved to be an area of important contribution for research in
data analysis, especially in the last years with its application to manifold reconstruc-
tion [23] using data distance and graph representation to create a structure which can
be considered as an Euclidean space (which is the manifold).

Graph models are useful for diverse types of data representation. They have
become especially popular over the last years, being widely applied in the Social
Networks area. Graph models can be naturally used in these domains, where each
node or vertex can be used to represent an agent, and each edge is used to represent
their interactions. Later, algorithms, methods and Graph Theory have been used to
analyse different aspects of the network, such as: structure, behaviour, stability or
even community evolution inside the graph [14, 21, 41, 60].

A complete roadmap to Graph Clustering can be found in [53] where different

clustering methods are described and compared using different kinds of graphs:
weighted, directed, undirected. These methods are: Cutting, Spectral Analysis and
Degree Connectivity (an exhaustive analysis of connectivity methods can be found in
Hartuv and Shamir[27]), amongst others. This roadmap also provides an overview of
computational complexity from a theoretical and experimental point of view of the
8 Héctor D. Menéndez and José Luis Llorente

studied methods.

From previously described graph clustering techniques, a recent and really power-
ful ones are those based on Spectral Clustering which is introduced in the following
section.

4.2. Spectral Clustering

Spectral clustering methods are based on a straightforward interpretation of weighted
undirected graphs as can be seen in [1, 40, 45, 58]. The Spectral Clustering approach
is based on a Similarity Graph which can be formulated in three different ways (all of
them equivalent [58]) of graphs:
1. The ǫ-neighbourhood graph: all the components whose pairwise distance is
smaller than ǫ are connected.
2. The k-nearest neighbour graphs: the vertex vi is connected with vertex vj if
vj is among the k-nearest neighbours of vi .
3. The fully connected graph: all points with positive similarity are connected
with each other.
The main problem is how to compute the eigenvector and the eigenvalues of the
Laplacian matrix of this Similarity Graph. For example, when large datasets are
analysed, the Similarity Graph of the Spectral Clustering algorithm takes too much
memory, it makes difficult the eigenvalues and eigenvectors computation. Some
works are focused on this problem: von Luxburg et al. [58] present the problem,
Ng et al.[45] apply an approximation to a specific case, and Nadler et al.[40] apply
operators to get better results. The classical algorithms can be found in [58].

The theoretical analysis of the observed good behaviour of SC is justified using

the perturbation theory [58, 40], random walks and graph cut [58]. The perturbation
theory also explains, through the eigengap, the behaviour of Spectral Clustering.

Some of the main problems of Spectral Clustering are related to the consistency
of the two classical methods used in the analysis: normalized and un-normalized
spectral clustering. A deep analysis about the theoretical effectiveness of normalized
clustering over un-normalized can be found in [59].

4.2.1. The Spectral Clustering Algorithm

Spectral Clustering methods were introduced by Ng et al. in [45]. These methods
apply the knowledge extracted from graph spectral theory to clustering techniques.
These algorithms are divided in three main steps:
1. The algorithm constructs a graph using the data instance as nodes and applying
a similarity measure to define the edges weights (see Algorithm 1 line 1). The
different types of graphs are explained above. The measure which is usually used
is the Radial Basis Function (RBF) Kernel (which is the most usual approach
taken in the literature) defined by:
2
s(xi , xj ) = e−σ||xi −xj || (8)
where σ is used to control the width of the neighbourhood.
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 9

2. It studies the graph spectrum calculating the Laplacian Matrix associated to the
graph (see Algorithm 1, lines 2 and 3) . There are different definitions of the
Laplacian Matrix. These definitions achieved different results when they are
applied to the Spectral Clustering algorithm. They are used to categorize the
Spectral Clustering techniques as follows [58]:
• Unnormalized Spectral Clustering It defines the Laplacian matrix as:
L=D−W (9)
• Normalized Spectral Clustering It defines the Laplacian matrix as:
Lsym = D−1/2 LD−1/2 = I − D−1/2 W D−1/2 (10)
• Normalized Spectral Clustering (related to Random Walks) It defines
the Laplacian matrix as:
Lrw = D−1 L = I − D−1 W (11)
In these formulas I is the identity matrix, D represents the diagonal matrix
whose (i, i)-element is the sum of the similarity matrix ith row and W represents
the Similarity Graph (see Algorithm 1 line 2). Once the Laplacian is calculated
(in Algorithm 1 the Normalized Spectral Clustering algorithm is used, however,
in this case, to simplify, the eigenvalues which are calculated are 1 − λi instead
of λi , the eigenvectors do not change), its eigenvectors are extracted (see lines 4
and 5 of Algorithm 1).
The three Laplacian Matrices have been deeply studied in the related litera-
ture [58, 40, 59]. They are connected to the graph cut problem, which looks
for the best way to cut a graph keeping a high connection amongst the elements
which belongs to each partition, and a low connection between the elements of
different partitions.
The graph cut problem is closely related to clustering. In the graph cut liter-
ature this problem has two classical solutions[58]: RadioCut and NCut. Von
Luxburg et al. [58] describe the connection between the different approaches of
SC (focused on the Laplacian Matrices), RadioCut and NCut. They also show
that Unnormalized Spectral Clustering converges to RadioCut and the Normal-
ized method converges to the NCut. On the other hand, a deep analysis about
the theoretical effectiveness of Normalized clustering over Unnormalized can be
found in [59].
3. The eigenvectors of the Laplacian Matrix are considered as points and a cluster-
ing algorithm, such as K-means [37], is applied over them to define the clusters
(see Algorithm 1 lines 7 and 8).

4.3. Community Finding Approach

The main application of the communities approach are Social Networks. The
clustering problem is more complex when is applied to find communities in networks
(subgraph identifications). A community can be considered as a subset of individuals
with relatively strong, direct, and intensive connections [21] between them. Some
algorithms such as Edge Betweenness Centrality (EBC) [22] or Clique Percolation
Method (CPM) [16] have been designed to solve this problem following a determinis-
tic process. EBC algorithm [22] is based on finding the edges of the network which
connect communities and removing them to determine a good definition of these
communities. CPM [16] finds communities using k-cliques (where k is a fixed value
of connections in a graph) which are defined as complete (fully connected) subgraphs
of k vertices. It defines a community as the highest union of k-cliques. CPM has two
10 Héctor D. Menéndez and José Luis Llorente

Algorithm 1 Normalized Spectral Clustering according to Ng et al. (2001)[45]

Require: A dataset of n elements X = {x1 , . . . , xn } and a fix number of clusters k.
Ensure: A set of clusters C = {C1 , . . . , Ck } which partitionate X
2 2
1: Form the affinity matrix W ∈ Rn×n defined by Wij = e−||xi −xj || /2σ if i 6= j, and
Wii = 0.
2: Define D to be the diagonal matrix whose (i, i)-element is the sum of the i-th row of
W.
3: Construct the matrix L = D −1/2 W D −1/2 .
4: Find v1 , . . . , vk , the k largest eigenvectors of L (chosen to be orthogonal to each other
in the case of repeated eigenvalues) and form the matrix V = [v1 v2 . . . vk ] ∈ Rn×k by
stacking the eigenvectors in columns.
5: Form the matrix Y from V by renormalizing each row of V to have unit length (i.e.
Yij = Vij /( j Vij2 )1/2 ).
P

6: Apply K-means (or any other algorithm) treating each row of Y as a point in Rk .
7: Assign the points xi to cluster Cj if and only if the row i of the matrix Y was assigned
to cluster j.
8: return C

variants: directed graphs and weighted graphs [48].

5. Complex Network Analysis

The analysis of complex networks gas become a very important field, specially in
physics. It is used to analyse Social Networks which are usually represented as a
Complex Network.

5.1. Types of Networks

There are four basic types of Networks which are generally considered:

• Random Network [19]: This network is based on random connections. Given

a connection probability, the graph is usually generated asigning edges between
nodes with a predefined probability which is usually small. This kind of graphs
is also called as Erdős and Rényi graphs because the model to generated the
graphs was introduced by these authors in 1959. One of the main properties
of Random Graphs is about the Clustering Coefficient, this metric has usually
small values which is not usual in real networks [4]. It gives the intuition that real
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 11

Figure 1. Example of a Random Network with 50 nodes and a connection probability of

p = 0.05

networks have a correlation factor amongst their connections. Figure 1 shows an

example of a Random Graph generated by the Erdős and Rényi using 50 nodes
and a connection probability of p = 0.05.
• Regular Network [62]: This kind of networks satisfies that each vertex has the
same number of neighbours. The consequence is that every node has the same
degree. These networks have been studied theoretically but is difficult to find
a real-world regular network. However, the Watts and Strogatz algorithm [61]
takes advantage of this network structure to generate Small-World networks.
An example of this kind of network can be found in Figure 2 where a regular
network has been generated with 20 nodes and the degree of each node is 4.
• Scale-Free Network [5]: A scale-free network has a degree distribution where
the probability of a node having a given degree has a scale-invariant decay as
degree grows. Hence, it follows a power-law of the form
P (n) ∼ n−γ , (12)
where γ > 1 is a constant and n = 1, 2, . . . , N . The were introduced by
Barabasi and Albert in [5] and are currently known as the BarabasiAlbert (BA)
or preferential attachment model [43]. This kind of distribution is very different
to that of homogeneous random networks. They are neither random nor small-
world networks. The features observe in this networks are frequently observed
in real-world networks. Figure 3 shows an example of this kind of networks for
45 nodes.
• Small World Network [61]: Small-World networks are characterized by the
“the small-world effect”. This term is used to describe networks whose aver-
age path length is comparable with a Regular Network without any regard to
the clustering coefficient. Small-world networks were introduced by Watts and
Strogatz [61]. They can be obtained via the following algorithm [14]:
12 Héctor D. Menéndez and José Luis Llorente

Figure 2. Example of a Regular Network with 20 nodes and where all nodes have degree 4

Figure 3. Example of a Scale-Free Network where N = 45

The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 13

Figure 4. Example of a Small-World Network generated with the algorithm of Watts and
Strogatz [61] where p = 0.05, k = 4 (neighbours), and N = 20 (nodes) and taking the
regular graph of Figure 2 as starting point.

1. First, arrange all nodes on a ring and connect each node with its k = 2
nearest neighbours (see Figure 2).
2. Second, start with an arbitrary node i and rewire its connection to its nearest
neighbour on, for example, the left side with probability p to any other node
j in the network. Choose the next node and repeat the process.
3. Third, after all next neighbour connections have been checked repeat this
procedure for the second and all higher next neighbours successively.
This algorithm guarantees that each connection occurring in the network is cho-
sen exactly once to test for a rewiring with a fixed probability which controls the
disorder of the resulting topology.
Taking a regular graph as an starting point, in which the diameter is proportional
to the size of the network, it can be transformed into a “small world” in which
the average number of edges between any two vertices is very small, while the
clustering coefficient stays large. Figure 4 shows an example of how the Regular
Network of Figure 2 can be transformed in a “Small-World” Network.

5.2. Page Rank and HITS

The analysis of a Complex or Social Network can be focused in all the information
that has been mentioned above, however, there are other algorithms which deserve to
be explained in this work. These algorithm are PageRank [9] and HITS [31]. They
can be used to take information about the most representative nodes of the network
and how they affect to it.

5.2.1. PageRank
PageRank [9] is a link analysis algorithm initially used by the Google web search
engine. It assigns a numerical weigh to each element of a linked set of nodes (which
14 Héctor D. Menéndez and José Luis Llorente

Figure 5. Example of PageRank application

in the original implementation was though as a hyperlinked set of web pages such as
the World Wide Web). The purpose is to measure the importance of each node within
the graph. The numerical weight assigned to each node ni is referred to the PageRank
value of ni denoted by P R(ni ).

The PageRank algorithm is an iterative algorithm which calculates recurrently the

following values:

1−d X P R(nj )
P R(ni ) = +d (13)
N L(nj )
nj ∈M(ni )

Where P R(nj ) is the PageRank value of node nj , d is the damping factor which
is used to adjusts the algorithm, N is the number of nodes, L(nj ) is the number of
outbounds links on node nj and M (ni ) is the set of nodes with inbound links to ni .
This algorithm is usually solved using an algebraic process or an iterative process.
In addition, when the iterative process is used, the PageRank values are usually
normalized.

Figure 5 shows an example of the PageRank application to a directed graph. The

biggest nodes represents the higher PageRank values and vice-versa.

5.2.2. HITS
Hyperlink-Induced Topic Search (HITS) [31] (also known as hubs and authorities) is
a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a
precursor to PageRank.

The HITS algorithm calculates two main values: hub and authority. These values
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 15

(a) Authorities (b) Hubs

Figure 6. Example of HITS algorithm application

are calculated iteratively. The authority value is calculated as follows:

X
P A(ni ) = P H(nj ) (14)
{nj |eij ∈G}

Where eij is an edge from node ni to node nj and G is the graph. P H(nj ) is the
hub value of nj . This value is calculated as follows:

X
P H(ni ) = P A(nj ) (15)
{nj |eij ∈G}

The algorithm usually begins with P A(ni ) = P H(ni ) = 1 ∀i. Figure 6 shows an
example of the HITS application to a directed graph. The biggest nodes in sub-figure
a) represents the highest Authorities values while in figure b) represents the highest
Hubs values. The smallest represents the lowest values respectively.

6. Applications in Social Networks Analysis

This section tries to give some practical applications for Social Data Mining. Social
Analysis is one of most challenge fields since the Web 2.0 started in 1999. This type of
web sites generates a more interactive source of information which promotes the inter-
actions between the users using forums and chats originally. In 2004 Mark Zuckerberg
founded Facebook, one of the most relevant Social Networks at present. Facebook al-
lows the users to share comments and opinions. Two years later, in 2006, Jack Dorsey
created Twitter. This Social Networking and Microblogging services is currently one
of the most famous and challenge Social Network for text analysis.
16 Héctor D. Menéndez and José Luis Llorente

6.1. Facebook
A social network can be analysed from different perspectives which have been de-
scribed above. A good example is Facebook. Facebook is almost the most important
Social Networks. It was originally created to interchange photos between users which
are friends inside the Social Context, however, it currently is used for videos, mes-
sages, games, etc. The most important features of Facebook as a Social Network are:
• The ‘Friendship’ structure, where the users belongs to a friends community
formed by people of his human context.
• The ‘Like’ button, which express the interest of different users in photos, videos,
comments, etc, which are posted by other users or by themselves.
• The comment option which allows the users to comment everything (also com-
ments) and generates interactions between them.
Using this structure as an starting point it is possible to analyse the Network
generated. The analysis can be focused from several points of view, for example, in
[20] discuss the mesoscopic features of the community structure of this network, after
they unveiled the communities representing the aggregation units among which users
gather and interact; they analyse the statistical features of that network communities,
discovering and characterizing some specific organization patterns followed by
individuals interacting in online social networks.

In [11] they focused their work on participants of online Social Networks. The
data is anonymous and organized as an undirected graph. They develop a set of tools
to analyse specific properties such as degree distribution, centrality measures, scaling
laws and distribution of friendship.

In [3] they face a link prediction problem. Given a snapshot of a network they
infer which interactions among existing members are likely to occur in the near future
or which existing interactions are we missing.

Finally, [35] introduces a new public dataset based on manipulations and embel-
lishments of Facebook. In the second half of this paper, they use a community finding
algorithm to find subgroups defined by gender, race/ethnicity, and socioeconomic.

6.2. Twitter
Twitter is a Social Network where people usually publish information about personal
opinions. It is divided in two kind of users behaviour: follower and following. As a
follower, the user receives information of people which is followed by him, and as a
following, the user information is sent to its followers. The information that the users
share is called Tweets. Tweets are sentences limited by 140 characters which can
contain information about personal opinions of the users, photos, links, etc. A user
can also re-tweet the information of other users and share it.

The information of Twitter or other networks based on text interchange (such as

forums or Facebook) can be used for analysis.

In [55] they examine the influence of geographic distance, national boundaries,

language, and frequency of air travel on the formation of social ties on Twitter. They
show that a substantial share of ties lies within the same metropolitan region, and
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 17

that between regional clusters, distance, national borders and language differences all
predict Twitter ties.

In [29] they analyse the usage of Twitter. Also they present a taxonomy charac-
terizing the the underlying intentions users have in making microblogging posts. By
aggregating the apparent intentions of users in implicit communities extracted from
the data, they show that users with similar intentions connect with each other.

In [39] they propose a method for Twitter Social Network that takes a single static
snapshot of network edges and user account creation times to accurately infer when
these edges were formed. This method can be exact in theory, and they demonstrate
empirically for a large subset of Twitter relationships that it is accurate to within a few
hours in practice.

Finally, [33] studies the topological characteristics of Twitter and its power as a
new medium of information sharing. they have found a non-power-law follower dis-
tribution, a short effective diameter, and low reciprocity. In order to identify influential
users on Twitter, they have ranked them by the number of followers and by PageRank
and found two rankings to be similar.

7. Software Tools
There are several tools used in Data Mining and Social Data Mining analysis. Some
relevant and straightforward tools are the following:
• Gephi 1 : Gephi [7] is a visualization and graph analysis software oriented to all
kinds of networks and complex systems, dynamic and hierarchical graphs. It has
several algorithms implemented for Social Network analysis such as PageRank,
HITS, etc. Also it has algorithms to improve the graph visualization using differ-
ent layouts and is able to calculate different metrics of the graph such as degree
(power-law), betweenness, closeness, density, path length, diameter, modularity,
clustering coefficient.
• Graphviz 2 : Graphviz [18] (Graph Visualization Software) is open source graph
visualization software. It can represent diagrams, graphs and networks. It has
been applied to networking, bioinformatics, software engineering, database and
web design, machine learning and visual interfaces for other technical domains.
• JUNG 3 : JUNG [46] (the Java Universal Network/Graph Framework) is a Java
library that provides some tools for graph or network modeling, analysis and
visualization. It has been designed to support a variety of representations and
analytic tools for complex data sets. It also provides a visualization framework
with different layouts.
• Mahout4 : The Apache MahoutTM[47] machine learning library was designed
to build scalable machine learning libraries. It has several Machine Learning
tools for clustering, classfication and batch based collaborative filtering which
are implemented on top of Apache Hadoop5 using the map/reduce paradigm.
1
https://gephi.org
2
http://www.graphviz.org
3
http://jung.sourceforge.net
4
http://mahout.apache.org
5
http://hadoop.apache.org/
18 Héctor D. Menéndez and José Luis Llorente

• Matlab 6 : MATLAB R [63] is a high-level language and interactive environment

for numerical computation, visualization, and programming. It has several tools
for analyze data, develop algorithms, and create models and applications.
• Octave 7 : Octave [17] is a high-level interpreted language for numerical com-
putations. It provides extensive graphics capabilities for data visualization and
manipulation. The Octave language is similar to Matlab so that most programs
are easily portable.
• R 8 : R [50] is a language and environment for statistical computing and graphics.
It is a GNU project which provides a wide variety of statistical (such as mod-
elling, statistical tests, time-series analysis, classification, clustering amongst
others) and graphical techniques, and is highly extensible.
• Weka 9 : Weka [25] is a collection of machine learning algorithms for data min-
ing tasks. It contains tools for data pre-processing, classification, regression,
clustering, association rules, and visualization. It is open source software issued
under the GNU General Public License.

8. Conclusions
Graph Theory have become an important tool in different Social Data Mining
methods. These techniques have influenced not only in the analysis procedures such
as Complex Network methods but also in the creation of new techniques such ad
Spectral Clustering.

The Analysis can also be focused on both a Data Mining approach and a Complex
Network approach depending of the interest of the analysis. From the Data Mining
point of view, unsupervised methods such as clustering algorithm or community
finding algorithms can be used to find groups of similar users within the Social
Network while from the Complex Network point of view, HITS and PageRank
algorithms can be used to find the most relevant users and the network analysis can be
used to define the nature and robustness of the network.

Finally, these methods have several applications specially for current Social Net-
works such as Facebook and Twitter using different software tools developed by the
research communities and companies.

References
[1] Francis Bach and Michael Jordan. Learning Spectral Clustering, With Applica-
tion To Speech Separation. Journal of Machine Learning Research, 7:1963 –
2001, October 2006.
[2] K. Bache and M. Lichman. UCI machine learning repository, 2013.
[3] L. Backstrom and J. Leskovec. Supervised Random Walks: Predicting and Rec-
ommending Links in Social Networks. November 2010.
6
http://www.mathworks.co.uk/products/matlab/index.html
7
http://www.gnu.org/software/octave
8
http://www.r-project.org
9
http://www.cs.waikato.ac.nz/ml/weka/
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 19

[4] A.L. Barabási. Linked: how everything is connected to everything else and what
it means for business, science, and everyday life. Plume book. Plume, 2003.
[5] Albert-László Barabási and Réka Albert. Emergence of Scaling in Random Net-
works. Science, 286(5439):509–512, October 1999.
[6] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani. The architec-
ture of complex weighted networks. Proceedings of the National Academy of
Sciences of the United States of America, 101(11):3747–3752, March 2004.
[7] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An open
source software for exploring and manipulating networks. 2009.
[8] Avrim L. Blum and Pat Langley. Selection of relevant features and examples in
machine learning. Artif. Intell., 97:245–271, December 1997.
[9] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual
web search engine. In Proceedings of the seventh international conference on
World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The
Netherlands, 1998. Elsevier Science Publishers B. V.
[10] Susan Rovezzi Carroll and David J. Carroll. Statistics Made Simple for School
Leaders. Rowman & Littlefield, 2002.
[11] Salvatore A. Catanese, Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and
Alessandro Provetti. Crawling Facebook for Social Network Analysis Purposes.
pages 1+, May 2011.
[12] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community
structure in very large networks. Physical Review E, 70(6):066111+, December
2004.
[13] Leticia Curiel, Bruno Baruque, Carlos Dueñas, Emilio Corchado, and Cristina
Pérez-Tárrago. Genetic algorithms to simplify prognosis of endocarditis. In
Proceedings of the 12th international conference on Intelligent data engineering
and automated learning, IDEAL’11, pages 454–462, Berlin, Heidelberg, 2011.
Springer-Verlag.
[14] M. Dehmer, editor. Structural Analysis of Complex Networks. Birkhäuser Pub-
lishing, 2010. in press.
[15] K. Delac, M. Grgic, and S. Grgic. Independent comparative study of PCA, ICA,
and LDA on the FERET data set. International Journal of Imaging Systems and
Technology, 15(5):252–260, 2005.
[16] Imre Derényi, Gergely Palla, and Tamás Vicsek. Clique Percolation in Random
Networks. Physical Review Letters, 94(16):160202–1 – 160202–4, Apr 2005.
[17] John W. Eaton. GNU Octave Manual. Network Theory Limited, 2002.
[18] John Ellson, Emden R. Gansner, Eleftherios Koutsofios, Stephen C. North, and
Gordon Woodhull. Graphviz - Open Source Graph Drawing Tools. Graph Draw-
ing, pages 483–484, 2001.
[19] P. Erdős and A. Rényi. On random graphs. I. Publ. Math. Debrecen, 6:290–297,
1959.
[20] Emilio Ferrara. A Large-Scale Community Structure Analysis In Facebook,
March 2012.
[21] Santo Fortunato, Vito Latora, and Massimo Marchiori. Method to find commu-
nity structures based on information centrality. Physical Review E (Statistical,
Nonlinear, and Soft Matter Physics), 70(5):056104, 2004.
20 Héctor D. Menéndez and José Luis Llorente

[22] M. Girvan and M. E. J. Newman. Community structure in social and biological

networks. Proceedings of the National Academy of Sciences of the United States
of America, 99(12):7821–7826, June 2002.
[23] Alexander N. Gorban and Andrei Zinovyev. Principal manifolds and graphs in
practice: From molecular biology to dynamical systems. International Journal
of Neural Systems, 20(3):219 – 232, 2010.
[24] W. Graham. Facebook API Developers Guide. Apresspod Series. Apress, 2008.
[25] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-
mann, and Ian H. Witten. The weka data mining software: an update. SIGKDD
Explor. Newsl., 11(1):10–18, November 2009.
[26] Jiawei Han and Micheline Kamber. Data mining: concepts and techniques. Mor-
gan Kaufmann, 2006.
[27] Erez Hartuv and Ron Shamir. A clustering algorithm based on graph connectiv-
ity. Information Processing Letters, 76(4–6):175–181, 2000.
[28] D. Helbing and S. Balietti. From social data mining to forecasting socio-
economic crises. The European Physical Journal - Special Topics, 195(1):3–68,
May 2011.
[29] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why We Twit-
ter: An Analysis of a Microblogging Community. In Haizheng Zhang, Myra
Spiliopoulou, Bamshad Mobasher, Giles, Andrew McCallum, Olfa Nasraoui,
Jaideep Srivastava, and John Yen, editors, Advances in Web Mining and Web
Usage Analysis, volume 5439 of Lecture Notes in Computer Science, chapter 7,
pages 118–138. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
[30] Keehyung Kim, RI (Bob) McKay, and Byung-Ro Moon. Multiobjective evolu-
tionary algorithms for dynamic social network clustering. In Proceedings of the
12th annual conference on Genetic and evolutionary computation, GECCO ’10,
pages 1179–1186, New York, NY, USA, 2010. ACM.
[31] Jon M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv.,
31(4es), December 1999.
[32] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artif.
Intell., 97:273–324, December 1997.
[33] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter,
a social network or a news media? In Proceedings of the 19th international
conference on World wide web, WWW ’10, pages 591–600, New York, NY,
USA, 2010. ACM.
[34] Daniel T. Larose. Discovering Knowledge in Data. John Wiley and Sons, 2005.
[35] Kevin Lewis, Jason Kaufman, Marco Gonzalez, Andreas Wimmer, and Nicholas
Christakis. Tastes, ties, and time: A new social network dataset using Face-
book.com. Social Networks, 30(4):330–342, October 2008.
[36] Marek Lipczak and Evangelos Milios. Agglomerative genetic algorithm for clus-
tering in social networks. In Proceedings of the 11th Annual conference on Ge-
netic and evolutionary computation, GECCO ’09, pages 1243–1250, New York,
NY, USA, 2009. ACM.
[37] J. B. Macqueen. Some methods of classification and analysis of multivariate
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, pages 281–297, 1967.
The combination of Graph Theory and Unsupervised Learning applied to Social Data
Mining 21

[38] Kevin Makice. Twitter API: Up and Running: Learn How to Build Applications
with the Twitter API. O’Reilly Media, Inc., 1 edition, April 2009.
[39] Brendan Meeder, Brian Karrer, Amin Sayedi, R. Ravi, Christian Borgs, and Jen-
nifer Chayes. We know who you followed last summer: inferring social link
creation times in twitter. In Proceedings of the 20th international conference on
World wide web, WWW ’11, pages 517–526, New York, NY, USA, 2011. ACM.
[40] Boaz Nadler, Stéphane Lafon, Ronald Coifman, and Ioannis G. Kevrekidis. Dif-
fusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Opera-
tors. pages 955 – 962, 2005.
[41] Mariá Cristina Vasconcelos Nascimento and André C. P. L. F. Carvalho. A graph
clustering algorithm based on a clustering coefficient for weighted graphs. J.
Braz. Comp. Soc., 17(1):19–29, 2011.
[42] Tamás Nepusz. Data mining in complex networks: Missing link prediction and
fuzzy communities. PhD thesis, Budapest University of Technology and Eco-
nomics, 2008.
[43] M. E. J. Newman. The Structure and Function of Complex Networks. SIAM
Review, 45(2):167–256, 2003.
[44] M. E. J. Newman and M. Girvan. Finding and evaluating community structure
in networks. Phys. Rev. E, 69:026113, Feb 2004.
[45] A. Ng, M. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an al-
gorithm. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in
Neural Information Processing Systems, pages 849–856. MIT Press, 2001.
[46] J. O’Madadhain, D. Fisher, S. White, and Y. Boey. The JUNG (Java Universal
Network/Graph) Framework. Technical report, UCI-ICS, October 2003.
[47] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action.
Manning Publications, 1 edition, January 2011.
[48] Gergely Palla, Imre Derenyi, Illes Farkas, and Tamas Vicsek. Uncovering the
overlapping community structure of complex networks in nature and society.
Nature, 435(7043):814–818, June 2005.
[49] P. Pons and M. Latapy. Computing communities in large networks using random
walks (long version). ArXiv Physics e-prints, December 2005.
[50] R Development Core Team. R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria, 2010. ISBN
3-900051-07-0.
[51] Jörg Reichardt and Stefan Bornholdt. Statistical mechanics of community detec-
tion. Phys. Rev. E, 74:016110, Jul 2006.
[52] Volker Roth and Tilman Lange. Feature selection in clustering problems. In
Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in
Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
[53] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64,
2007.
[54] Tom Smith. The social media revolution. International Journal of Market Re-
search, 51(4):559–561, July 2009.
[55] Yuri Takhteyev, Anatoliy Gruzd, and Barry Wellman. Geography of Twitter
networks. Social Networks, 34(1):73–81, January 2012.
22 Héctor D. Menéndez and José Luis Llorente

[56] Lindsay A. Thompson, Kara Dawson, Richard Ferdig, Erik W. Black, J. Boyer,
Jade Coutts, and Nicole Paradise P. Black. The intersection of online social
networking with medical professionalism. Journal of general internal medicine,
23(7):954–957, July 2008.
[57] A Tsonis, K Swanson, and G Wang. Estimating the clustering coefficient in
scale-free networks on lattices with local spatial correlation structure. Physica
A: Statistical Mechanics and its Applications, 387(21):5287–5294, 2008.
[58] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing,
17(4):395–416, December 2007.
[59] Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spec-
tral clustering. The Annals of Statistics, 36(2):555–586, April 2008.
[60] D. J. Watts. Small worlds : the dynamics of networks between order and ran-
domness. 1999.
[61] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks.
Nature, 393(6684):409–10, 1998.
[62] Waigai Zhen. Graph Theory and its Engineering Applications. Advanced series
in electrical and computing engineering. World Scientific Publishing Company,
Incorporated, 1997.
[63] Dongli Zhou, Wesley K. Thompson, and Greg Siegle. MATLAB toolbox for
functional connectivity. NeuroImage, 47(4):1590–1607, October 2009.

Unit I Graph Theory and Concepts
No ratings yet
Unit I Graph Theory and Concepts
35 pages
Week 16
No ratings yet
Week 16
47 pages
GML Tutorial I
No ratings yet
GML Tutorial I
5 pages
Lecture 1 Scribe
No ratings yet
Lecture 1 Scribe
13 pages
DM PPT
No ratings yet
DM PPT
12 pages
Unit 6 Mining Social Network Graph
No ratings yet
Unit 6 Mining Social Network Graph
9 pages
Graph Based Data Science
No ratings yet
Graph Based Data Science
37 pages
Graph Analytics For Python Developers
No ratings yet
Graph Analytics For Python Developers
13 pages
MLC 01 Intro Basics Graphs-Sose2023
No ratings yet
MLC 01 Intro Basics Graphs-Sose2023
44 pages
Basics of Network Analysis
No ratings yet
Basics of Network Analysis
38 pages
Unit Ii
No ratings yet
Unit Ii
28 pages
Chapter 2. Graph Theory and Concepts: Figure 2-1
No ratings yet
Chapter 2. Graph Theory and Concepts: Figure 2-1
18 pages
Graph Neural Networks Explained
No ratings yet
Graph Neural Networks Explained
22 pages
Network Centrality Measures in A Graph
No ratings yet
Network Centrality Measures in A Graph
16 pages
Lesson 1
No ratings yet
Lesson 1
50 pages
Graph Mining: Techniques & Applications
No ratings yet
Graph Mining: Techniques & Applications
8 pages
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
No ratings yet
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
110 pages
Introduction to Network Theory
No ratings yet
Introduction to Network Theory
66 pages
Social Network Analysis Basics
No ratings yet
Social Network Analysis Basics
67 pages
Algorithm Design Unit 3
No ratings yet
Algorithm Design Unit 3
4 pages
Mfcs Unit-5
No ratings yet
Mfcs Unit-5
6 pages
Social Network Analysis Unit-3
No ratings yet
Social Network Analysis Unit-3
28 pages
HCMUT MATHS4CS 055263 Assignment Community Structure Identification IMP
No ratings yet
HCMUT MATHS4CS 055263 Assignment Community Structure Identification IMP
10 pages
Graph Theory Cambridge U
No ratings yet
Graph Theory Cambridge U
75 pages
Graph Theory Basics for Beginners
No ratings yet
Graph Theory Basics for Beginners
89 pages
Math Project: Karnataka Law Society's Gogte Institute of Technology, Belgaum
No ratings yet
Math Project: Karnataka Law Society's Gogte Institute of Technology, Belgaum
13 pages
Graph Based Clustering
No ratings yet
Graph Based Clustering
78 pages
Graph Algorithms & Data Mining
No ratings yet
Graph Algorithms & Data Mining
7 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
Introduction To Social Network Analysis (2021)
No ratings yet
Introduction To Social Network Analysis (2021)
57 pages
Introduction To Graph Theory
No ratings yet
Introduction To Graph Theory
10 pages
Graph Theory 1-11 PDF
No ratings yet
Graph Theory 1-11 PDF
13 pages
Datastructure 5
No ratings yet
Datastructure 5
34 pages
01 Concept of Graphs
No ratings yet
01 Concept of Graphs
6 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
Complex Network Theory CS60078: Department of Computer Science & Engineering, IIT Kharagpur
No ratings yet
Complex Network Theory CS60078: Department of Computer Science & Engineering, IIT Kharagpur
36 pages
Social Networks
No ratings yet
Social Networks
10 pages
Characterization of Complex Networks
No ratings yet
Characterization of Complex Networks
77 pages
Training 2024
No ratings yet
Training 2024
20 pages
Complex Nets 1
No ratings yet
Complex Nets 1
64 pages
Lec1 Graph
No ratings yet
Lec1 Graph
42 pages
St-02 Notes Bcam061
No ratings yet
St-02 Notes Bcam061
41 pages
rr2012 Conclude Libre PDF
No ratings yet
rr2012 Conclude Libre PDF
15 pages
Graph Mining: Anuraj Mohan 13MZ01, CSED
No ratings yet
Graph Mining: Anuraj Mohan 13MZ01, CSED
50 pages
I Am Sharing 'DSE ASSIGNMENT ADITI CHAUDHARY' With You
No ratings yet
I Am Sharing 'DSE ASSIGNMENT ADITI CHAUDHARY' With You
7 pages
Social Media Data Mining
100% (2)
Social Media Data Mining
382 pages
Social Media Mining Guide
No ratings yet
Social Media Mining Guide
382 pages
Social Network Graph Mining
No ratings yet
Social Network Graph Mining
34 pages
Complex Network Classification With Convolutional Neural Network
No ratings yet
Complex Network Classification With Convolutional Neural Network
11 pages
Social Network Analysis Guide
No ratings yet
Social Network Analysis Guide
62 pages
Basics of Graph Theory
No ratings yet
Basics of Graph Theory
7 pages
Mathematics-2
No ratings yet
Mathematics-2
10 pages
Graphtheoryanditsapplications 180311171041
No ratings yet
Graphtheoryanditsapplications 180311171041
57 pages
Gionis
No ratings yet
Gionis
191 pages
Graph Theory - Imed - Bca
100% (1)
Graph Theory - Imed - Bca
14 pages
Be Honors 6 Sem Statistics and Machine Learning Mar 2024
No ratings yet
Be Honors 6 Sem Statistics and Machine Learning Mar 2024
2 pages
Mini Project
No ratings yet
Mini Project
13 pages
Solving Systems of Equations Notes
No ratings yet
Solving Systems of Equations Notes
6 pages
Revision: High Variance
No ratings yet
Revision: High Variance
8 pages
MATH-243 Vector Calculus: Line Integrals
No ratings yet
MATH-243 Vector Calculus: Line Integrals
22 pages
Robot Force Control Overview
No ratings yet
Robot Force Control Overview
22 pages
1C Program For Gauss Elimination Method - Code With C
No ratings yet
1C Program For Gauss Elimination Method - Code With C
4 pages
Goal Programming for IIM Selection
No ratings yet
Goal Programming for IIM Selection
12 pages
Circle Drawing Algorithms
No ratings yet
Circle Drawing Algorithms
7 pages
Simplex Method Tutorial
No ratings yet
Simplex Method Tutorial
3 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Error Detection and Correction - Chapter 10
No ratings yet
Error Detection and Correction - Chapter 10
31 pages
3rd Q Math 10 #1
100% (1)
3rd Q Math 10 #1
5 pages
Linear Functions
No ratings yet
Linear Functions
6 pages
Mechanical Vibration Analysis
No ratings yet
Mechanical Vibration Analysis
44 pages
DSP Lab 1: To Simulate The Generation of Continuous Time and Discrete Time Signals
No ratings yet
DSP Lab 1: To Simulate The Generation of Continuous Time and Discrete Time Signals
14 pages
19cse353 L23
No ratings yet
19cse353 L23
15 pages
Blockchain Cryptography Basics
No ratings yet
Blockchain Cryptography Basics
22 pages
Portfolio Theory Harry M. Markowitz
No ratings yet
Portfolio Theory Harry M. Markowitz
26 pages
Questions AIBA
No ratings yet
Questions AIBA
2 pages
Quantum 8 - 250329 - 064926
No ratings yet
Quantum 8 - 250329 - 064926
29 pages
BCA Exam: Problem Solving Techniques
No ratings yet
BCA Exam: Problem Solving Techniques
2 pages
Wilson Problem Sensitivity Analysis
No ratings yet
Wilson Problem Sensitivity Analysis
16 pages
Dariusz Chruściński, Andrzej Jamiołkowski (Auth.) Geometric Phases in Classical and Quantum Mechanics 2004
No ratings yet
Dariusz Chruściński, Andrzej Jamiołkowski (Auth.) Geometric Phases in Classical and Quantum Mechanics 2004
345 pages
Economic Dynamics in Discrete Time The MIT Press 2nd Edition Jianjun Miao Available Instanly
100% (3)
Economic Dynamics in Discrete Time The MIT Press 2nd Edition Jianjun Miao Available Instanly
96 pages
4.1 Divide and Conquer
No ratings yet
4.1 Divide and Conquer
21 pages
AI&ML Lab Manual - 18CSL76 - Master
No ratings yet
AI&ML Lab Manual - 18CSL76 - Master
47 pages
Forecasting Methods Guide
No ratings yet
Forecasting Methods Guide
62 pages
Quantum Thermodynamics Unveiled
No ratings yet
Quantum Thermodynamics Unveiled
19 pages
Filter Realization Wizard
No ratings yet
Filter Realization Wizard
6 pages

Menendez Llorente

Uploaded by

Menendez Llorente

Uploaded by

The combination of Graph Theory and Unsupervised Learning applied to Social Data

T HE COMBINATION OF G RAPH T HEORY AND

2. Basic Definitions from Graph Theory

Definition 2..1 (Graph). A graph G = (V, E) is a set of vertices or nodes V denoted

eij = eji , the graph is said to be undirected.

Definition 2..2 (Adjacency Matrix). An adjacency matrix of G, AG , is a square n × n

Definition 2..3 (Weighted Graph). G is a weighted graph if there is a function

Definition 2..4 (Neighbourhood). If the edge eij ∈ E and eji ∈ E we say

The Local CC measure provides values ranging from 1 to 0. Where 0 means

Then, the Local Weighted Clustering Coefficient can be defined as:

Proof. Following the definition of Ciw we have:

Which proves theorem 1

As a corollary to this theorem, if CCiw = 1 ⇒ CCi = 1.

Where |V | is the number of vertices.

The Global Weighted Clustering Coefficient is:

3.1. Data Extraction

3.2. Data Preprocessing

• Min-max: It computes maximum and minimum values of the attributes apply-

4. Model Generation: Clustering

4.1. Graph Clustering

A complete roadmap to Graph Clustering can be found in [53] where different

4.2. Spectral Clustering

The theoretical analysis of the observed good behaviour of SC is justified using

4.2.1. The Spectral Clustering Algorithm

4.3. Community Finding Approach

Algorithm 1 Normalized Spectral Clustering according to Ng et al. (2001)[45]

variants: directed graphs and weighted graphs [48].

Other approximations related to the finding-community problem can be found in

5. Complex Network Analysis

5.1. Types of Networks

• Random Network [19]: This network is based on random connections. Given

Figure 1. Example of a Random Network with 50 nodes and a connection probability of

networks have a correlation factor amongst their connections. Figure 1 shows an

Figure 3. Example of a Scale-Free Network where N = 45

5.2. Page Rank and HITS

Figure 5. Example of PageRank application

The PageRank algorithm is an iterative algorithm which calculates recurrently the

Figure 5 shows an example of the PageRank application to a directed graph. The

(a) Authorities (b) Hubs

Figure 6. Example of HITS algorithm application

are calculated iteratively. The authority value is calculated as follows:

6. Applications in Social Networks Analysis

The information of Twitter or other networks based on text interchange (such as

In [55] they examine the influence of geographic distance, national boundaries,

• Matlab 6 : MATLAB R [63] is a high-level language and interactive environment

[22] M. Girvan and M. E. J. Newman. Community structure in social and biological

You might also like