0% found this document useful (0 votes)
142 views30 pages

Self-Healing in Emerging Cellular Networks: Review, Challenges and Research Directions

This document discusses the need for self-healing solutions in emerging cellular networks. It notes that network operators currently spend a large portion of their budgets resolving outages, but that manual methods become less viable as networks grow more dense and complex. Self-healing aims to automate outage detection, diagnosis and resolution to reduce costs. The document reviews existing self-healing approaches and identifies key challenges for developing solutions that can meet 5G requirements like low latency and high quality of experience.

Uploaded by

anon_838922822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views30 pages

Self-Healing in Emerging Cellular Networks: Review, Challenges and Research Directions

This document discusses the need for self-healing solutions in emerging cellular networks. It notes that network operators currently spend a large portion of their budgets resolving outages, but that manual methods become less viable as networks grow more dense and complex. Self-healing aims to automate outage detection, diagnosis and resolution to reduce costs. The document reviews existing self-healing approaches and identifies key challenges for developing solutions that can meet 5G requirements like low latency and high quality of experience.

Uploaded by

anon_838922822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Self-Healing in Emerging Cellular Networks:


Review, Challenges and Research Directions
Ahmad Asghar, Member, IEEE, Hasan Farooq, Member, IEEE, and Ali Imran, Senior Member, IEEE,

Abstract—Mobile cellular network operators spend nearly a sometimes go unnoticed leading to poor customer experience,
quarter of their revenue on network management and main- and eventually leading to high customer churn. According to
tenance. Incidentally, a significant proportion of that budget one survey estimate [3], mobile cellular network operators
is spent on resolving outages that degrade or disrupt cellular
services. Historically, operators mainly rely on human expertise worldwide spent nearly $20 Billion in the year 2015 to counter
to identify, diagnose and resolve such outages. However, with issues caused by network outages and service degradations
growing cell density and diversifying cell types, this approach is which accounts for nearly 1.7% of total revenue and nearly
becoming less and less viable, both technically and financially. 7% of total operational expenses.
To cope with this problem, research on Self-healing solutions The inevitable introduction of 5G technologies for mobile
has gained significant momentum in recent years. Self-healing
solutions either assist in resolving these outages or carry out the cellular networks brings with it a key challenge of increased
task autonomously without human intervention, thus reducing load on network resources in terms of network performance
costs while improving mobile cellular network reliability. How- management. The primary solution to this challenge proposed
ever, despite their growing popularity, to this date no survey by researchers and the mobile cellular network standardization
has been undertaken for Self-healing solutions in mobile cellular body, 3GPP, is the deployment of Self-Organizing Network
networks. This study aims to bridge this gap by providing a
comprehensive survey of Self-healing solutions proposed in the (SON) solutions to automate processes that would otherwise
domain of mobile cellular networks, along with an analysis of the require skilled human input. SON are broken down into three
techniques and methodologies employed in those solutions. This key areas: Self-configuration, Self-optimization [4] and Self-
article begins by providing a quantitative analysis to highlight healing [5]. Self-configuration is dedicated to solutions that
why in emerging mobile cellular network Self-healing will become autonomously configure mobile cellular network nodes for
a necessity instead of a luxury. Building on this motivation, the
article provides a review and taxonomy of existing literature on plug and play. Self-optimization is related to solutions that
Self-healing. Challenges and prospective research directions for target mobile cellular network performance optimization based
developing Self-healing solutions for emerging and future mobile on operator specifications. Self-healing is focused on solu-
cellular networks are also discussed in detail. Particularly, we tions that identify performance issues in the mobile cellular
identify that the most demanding challenges from Self-healing network such as cell outages and key performance indicator
perspective are the difficulty of meeting 5G low latency and the
high quality of experience requirement. (KPI) degradations. On top of the three components of SON
mentioned above, Self-coordination was also introduced by the
Index Terms—Self Organizing Network; Self Healing; 5G;
3GPP as part of Release 10 specifications for 4th Generation
Future Mobile Cellular Networks
mobile cellular networks [6] to address the potential conflicts
I. I NTRODUCTION arising between SON solutions that would lead to KPI degra-
At a time when mobile cellular network operators are dations.
competing for customers demanding higher data rates and To understand how the four SON components are related,
greater data capacity at lower costs, keeping revenue margins a generalized SON framework is given in Fig. 1. While Self-
up is proving increasingly difficult. Furthermore, the rising configuration and Self-optimization represent more implicit ar-
network operating expenses add to the stress on network eas of operational expenditure reduction, Self-healing provides
operator revenues. Mobile cellular network expenditures are the clearest quantifiable path towards operational expenditure
divided into two primary categories i.e., capital expenditure reduction by minimizing the impact of mobile cellular network
which is spent on acquiring and updating network entities, outages [3]. These include outages caused due to failure of
and operational expenditure which is spent on managing and physical or soft components of the network entities, rendering
maintaining existing network resources. Based on industry them non-functional and causing complete or full outage, or
estimates, mobile cellular network operators spend between significant service degradations leading to partial outage that
23% and 26% of their total revenue on mobile cellular network may not necessarily generate any system level alarms.
operation [1, 2]. A breakdown of operational expenses reveals An overview of the key research drivers for Self-healing are
that a significant proportion of it is spent on managing mobile presented as follows.
cellular network outages and performance degradations. Such a) Reduction of network operating expenses: As men-
service interruptions require human intervention and may tioned already, mobile cellular network operators can spend
as much as 1.7% of the total revenue on fixing issues due
A. Asghar, H. Farooq and A. Imran are with the Department of Electrical to network outages. Network outages have the potential to
and Computer Engineering, University of Oklahoma, Tulsa, OK, 74135 USA
e-mail: (ahmad.asghar@ou.edu, hasan.farooq@ou.edu, ali.imran@ou.edu, see disrupt service to millions of subscribers, as recently observed
http://www.bsonlab.com for complete information). in case studies [7] and [8]. Overwhelming reliance on manual
2

Fig. 1: Self-Organizing Network Framework for Cellular Networks

outage detection, diagnosis and compensation not only slows in one day (lower line chart), in three days (middle line chart),
down the recovery process, but is also more expensive than au- and seven days (top line chart). We can see that probability
tonomous solutions. Thus, autonomous Self-healing solutions of node failures is relatively low in a low density network
are one of the most inviting areas for mobile cellular network such as a 2nd Generation mobile cellular network. However,
operators to cut down their operational costs for managing as the network density increases, the probability of node failure
network outages. increases, so much so that on any given day the probability
b) Increase in network data: The limited capability of of node failure could be anywhere between 60% and 99.8%.
human experts to absorb large amounts of network information Hardware failures are already a significant area of concern
at the same time and coming to conclusions about the existence for network operators. In [11] the authors present an analysis
of outages or KPI degradations in the mobile cellular network of customer complaints over a period of nine months in an
means that as the number of entities in the network grows, enterprise network. The authors conclude that nearly 39%
the number of experts to monitor the network would grow of all customer complaints are due to hardware failures.
proportionally. This will put further strain on the operators’ Therefore, it is safe to assume that if the number of network
already inflated operating expenses. Self-healing can reduce nodes is increased significantly, the corresponding probability
the load on human experts by providing solutions for the of hardware failure will also increase. In wake of increasing
detection of service degradations and disruptions. number of nodes per unit area, dealing with such high rates of
c) Complexity of network architecture: With small cells node failures will be very difficult if mobile cellular network
expected to make up a significant part of future cellular operators continue the practice of manual outage management.
network infrastructure [9], solutions specifically focusing on In short, Self-healing solutions will be less of a luxury and
them must be developed. This concern is further fueled by more of a necessity in future 5G networks.
the fact that small cells are subject to sparse reporting due e) Increase in network parameters: With the introduction
to the low percentage of users associated with them and a of 5G services and the associated technologies discussed
more packed mobile cellular network topology in terms of above, the number of configuration and optimization param-
inter-node distances. This makes it more difficult to identify eters are expected to grow significantly [12]. The increasing
service disruptions at small cells through traditional means. number of network control parameters and entities can raise
d) Increase in network density: The increasing number the probability of parameter misconfiguration significantly.
of radio nodes in the 5G mobile cellular network can result The frequency and impact of parametric misconfiguration have
in an increase in node failures [10]. This is demonstrated in been noted by Yin et al. [13]. Based on an analysis of a
Fig. 2, which shows the outage probability of a cell as mobile large number of customer complaints, the authors conclude
cellular network density increases, obtained using a Poisson that nearly 31% of high-severity customer complaints are
distribution-based method for estimating node failures derived due to misconfigured parameters. Out of this, 85.5% issues
from [10]. Fig. 2 shows the probability of a single node failure were due to mistakes in parameter configuration and in only
3

Fig. 2: Outage Probability of One Cell with Increase in Cell Fig. 3: Probability of Single Parameter Misconfiguration with
Density Increase in Configurable Parameters

15% of the cases does a misconfiguration lead to an actual and manual outage compensation will not suffice. To address
alarm. Otherwise, the misconfiguration is only identified when this challenge autonomous mechanisms to compensate outages
a customer complains about service outage. Though the actual quickly and seamlessly need to be developed.
count of customer complaints is not shared in [13], if we
assume that there are 2000 parameters in the network and A. Past Work and Contributions
10,000 complaints are received over a period of two years, the In terms of mobile cellular networks, SON and Self-
probability of a parametric misconfiguration every 100 days optimization have received significant attention, with com-
is 1.5% prehensive studies published highlighting the contributions
A quantitative analysis of parameter misconfiguration in in both areas. Aliu et al. [14] present an overview of the
5G mobile cellular networks is presented in Fig. 3 which recent studies carried out under the scope of SON for cellular
shows the probability of misconfiguration of one parameter networks, while Peng et al. [15] have presented an overview
per cell every 100 days as the total number of configurable of state-of-the-art in Self-configuration and Self-optimization
parameters per cell increases. The parameter misconfiguration in mobile cellular networks.
probability is also derived using the Poisson distribution- Another area of automation in wireless networks are cog-
based method of failure estimation presented in [10]. In Fig. nitive radio technologies. Cognitive radio technologies refer
3, three different probabilities, 0.01% (bottom line chart), to dynamic spectrum access techniques that enable need-
0.05% (middle line chart), and 0.1% (top line chart), of based bandwidth allocation to mobile users via heterogeneous
parametric misconfiguration per 100 days are assumed. These physical layer resource usage [16]. A survey of cognitive
probabilities are well below the parameter misconfiguration radio technologies has been presented by Akyildiz et al.
probability estimated from [13]. Furthermore, since the data [17]. Discussion on state-of-the-art and future challenges of
in [13] comes from an analysis of customer complaints, it is cognitive radio technologies has been presented by Akyildiz
safe to argue that parametric misconfiguration does lead to a et al. [18] while Akhtar et al. [19] have discussed the ex-
disruption of service. From Fig. 3 it is clear that parametric ploitation of unlicensed and unused spectral resources for
misconfiguration will become a major concern for mobile dynamic spectrum allocation. Furthermore, Zhang et al. [20]
network operators in 5G networks. have presented a survey of the research studies on Self -
f) Increased focus on (Quality of Experience) QoE Calls optimization for cognitive radio technologies.
for increased focus on Self-healing: Very high user QoE In terms of Self-healing, a survey of applications from
requirements in 5G mobile cellular networks mean near ubiq- natural systems to software engineering has been presented in
uitous spatial and temporal network availability for various [21] where analogies between self-rectifying software systems
5G use cases. State-of-the-art network availability estimation and natural systems have been studied. Psaier and Dustdar
process depends on classic drive test-based methods. However, [22] discuss the applications of Self-healing in autonomous
the process is time and resource consuming while lacking systems pertaining to the fields of information technology
comprehensiveness due to inaccessibility of a major portion of and communications. Furthermore, Paradis and Han [23] have
the network i.e., all areas other than paved roads. Therefore, surveyed studies on Self-healing capabilities in wireless sensor
better methods are needed for network availability estimation networks.
and outage detection for 5G networks. Self-healing techniques in mobile cellular networks have
Additionally, low latency requirements for several 5G use briefly been discussed in [14] in the larger context of SON. The
cases mean that classic methods of manual outage diagnosis authors have presented description of Self-healing in mobile
4

cellular networks accompanied by a review of four outstanding TABLE I: Key Acronym Definitions
works in the area. Since the publication of [14], research
on Self-healing techniques for mobile cellular networks has Acronym Definition
grown significantly and, to the best of our knowledge, this SON Self-Organizing Network
study is the first attempt to provide a consolidated review of KPI Key Performance Indicator
these developments. With the efforts to propose and standard- QoE Quality of Experience
ize SON solutions for 5G technologies reaching their climax, SINR Signal to Interference and Noise Ratio
the need for a comprehensive study on Self-healing highlight- LOF Local Outlier Factor
ing the efforts of research groups, equipment manufacturers kNN k-Nearest Neighbors
and standardization bodies could not be higher. Furthermore, (OC)SVM (One Class) Support Vector Machines
this study aims to go well beyond the limited contributions SOM Self-Organizing Maps
of [14] towards surveying Self-healing techniques for mobile NBC Naïve Bayes Classifier
cellular networks by breaking down the studies in terms of the HC Healing Channel
type of outages, the measurements and methodologies used, UAV Unmanned Aerial Vehicles
and their results.
The primary contributions of this paper are summarized as tions for mobile cellular networks. Based on the generally
follows: accepted trifurcation of Self-healing in literature specific to
mobile cellular networks, Sections IV, V and VI provide a
• This paper identifies the need for Self-healing solutions survey of Detection, Diagnosis and Compensation techniques
in the wake of 5G mobile cellular networks and explains for outages occurring in mobile cellular networks respectively.
why Self-healing functionality will not remain a luxury In Section VII, we identify key challenges faced by Self-
but will become a necessity in 5G and beyond. healing paradigm to become adaptable by 5G and beyond,
• The paper provides a brief introduction and tutorial on along with prospects for future work in the field of Self-healing
Self-healing and provides comprehensive review of ma- for mobile cellular networks. Section VIII concludes the key
jor contributions from individual projects and collective aspects of this survey. For ease of reference, key acronyms are
standardization efforts undertaken so far with respect to given in Table I
Self-healing for mobile cellular networks.
• Following the intrinsic flow of Self-healing in nature and II. S ELF -H EALING : BACKGROUND S TUDY
in practical applications, the paper organizes the literature
on Self-healing into the three primary areas of Self- A. Self-Organizing Networks in Cellular Mobile Networks
healing i.e., Detection, Diagnosis and Compensation. SON functions gained popularity with the introduction of
• The paper further categorizes the reviewed studies on 4th Generation cellular networks, primarily due to the in-
Self-healing in terms of the network topology, perfor- creased network complexity. The efficacy of a SON function
mance metrics, control mechanisms, and methodologies depends on four key design components [24]: Autonomy: SON
used for detection, diagnosis and compensation of full and functions must be independent of human input, Scalability:
partial outages in a mobile cellular network. This allows Any SON functions deployed in the mobile cellular network
easy understanding and comparison of studies within each must be scalable in terms of both time and space, Adaptability:
particular area of Self-healing. The functions must be able to adapt to outside influences
• The paper presents comprehensive discussion of chal- and internal failures. Additionally, it has been proposed that
lenges in Self-healing and identifies the research direc- future SON networks must be intelligent [12], i.e., they must
tions therein. Notably, it discusses the two primary types be able to learn from the information generated by the users
of challenges faced by existing Self-healing solutions to and mobile cellular network entities to become completely
adapt to 5G network requirements: 1) challenges that independent in terms of adapting network parameters based
stem from ambitious QoE and low latency requirements on the primary goals of the operator.
in 5G, and 2) challenges that arise from the idiosyn- As described previously, SON functions for cellular net-
crasies of anticipated 5G technologies i.e., ultra-dense works can be broadly classified into three main categories i.e.
deployments, millimeter wave cells (in which outage is Self-configuration, Self-optimization and Self-healing, with
the norm, not anomaly) and increased rate of emergence Self-coordination being introduced to manage SON function
of sudden traffic hotspots due to higher data rate per users interactions. Since SON functions in general [14] and Self-
leading to sudden change in KPIs (partial outage). optimization in particular [20] have already been the subject
• In order to enable the advancement of research in Self- of comprehensive studies, this study is aimed at covering the
healing solutions for future 5G mobile cellular networks, work done in the domain of Self-healing for mobile cellular
we also discuss possible solution methodologies for each networks.
of the aforementioned challenges.
The organization of this paper is as follows: Section II B. Self-healing in Mobile Cellular Networks
presents a brief tutorial on SON and Self-healing including Traditionally, mobile cellular network operators employ
possible taxonomies. Section III presents key definitions and human experts to detect, diagnose and recover the network
terminologies used in the development of Self-healing solu- from any faults and outages in the network. As per the
5

standard fault management framework defined by the 3GPP


[25], faults and outages include issues such as hardware
failures of mobile cellular network nodes, software failure
issues at the nodes, failures of functional resources in which
case no hardware component is responsible for the fault,
loss of node functionality due to system overloading, and
communication failure between two nodes due to internal
or external influence. In such cases, the node will become
completely dysfunctional leading to a full outage. As per 3GPP
specifications, faults must be accompanied by the generation
of an alarm that identifies the node and the type of failure that
has occurred. The alarm may contain additional information
to aid the recovery of the system but that is dependent on the
equipment manufacturer.
Conversely, many service affecting issues in mobile cellular
networks do not generate alarms or may not specifically be
classified as faults or failures. Such issues are labeled partial
outages. One such example is the degradation of a perfor-
mance metric due to sudden changes in the mobile cellular
network environment. Partial outages may include service Fig. 4: Self-Healing Framework
degradations due to environmental effects, sudden variations
in traffic, or the presence of man-made interference sources
that hinder normal operation of the network. Thus, mobile solutions employ a 3-stage framework. The first stage is de-
cellular network operators are dependent on human experts to tecting network outages for which outage detection algorithms
monitor the network data to identify any such anomalies and are deployed. For effective Self-healing, the outage detection
to execute recovery actions to counter them. However, with the solution must be able to detect both full and partial outages. In
advent of 4G and the growth in network sizes and subscribers, case a network outage is detected, the outage detection solution
network operators can no longer rely purely on human experts flags the effected network node for further actions, depending
to sift through the vast amounts of network performance data on the outage type. For example, in case a cell experiences
generated consequently in search of anomalies. hardware failure and is no longer able to send and receive
1) Research in Self-healing: Self-healing specifically for data, it will be flagged for Self-healing.
cellular networks has been studied as part of several research Once the outage has been detected, diagnostic algorithms
projects focusing on SON for cellular networks including the will execute routines to identify the exact cause of network
EUREKA Gandalf project [26] which explored the parametric outage. For the sample case of hardware failure, the detection
interactions in 2G, 2.5G and 3G networks with the envi- algorithm will examine alarms and fault codes to pinpoint the
ronment and studied the impact of automation in wireless hardware component whose failure led to the outage. This
networks, especially UMTS and Wi-Fi networks. The key information will then be relayed to the Network Controller
deliverable of the project was Bayesian Networks based fault which will either command field teams to replace the failed
identification and diagnosis toolkit. component or activate the redundancy elements to take over
Similarly, the SOCRATES project [27] was aimed at investi- operations of failed entity. Conversely, if the outage is partial,
gating the impact of automation particularly in LTE networks, the diagnosis algorithm will break down the degraded KPI or
while the QSON project [28] investigated SON solutions KPIs in order to identify the reason for the outage.
primarily for Self-optimization and Self-healing along with Upon completion of outage diagnosis, the information is
preliminary analysis of the interactions of parameters and passed along to the final stage of the Self-healing function,
metrics as part of SON coordination. The project investigated i.e., outage compensation. In the outage compensation stage,
new techniques, especially the exploitation of big data ana- the Self-healing function determines the impact of outage on
lytics [12], to empower existing SON solutions. Recently, the neighboring entities and the subscribers which is then used to
SEMAFOUR project [29] has been launched which aims to execute changes to mitigate the outage. For example, in the
develop a unified self-management system for heterogeneous case of hardware failure, outage compensation solution will
radio access networks, comprising multiple radio access tech- identify the coverage hole created as a result of the outage
nologies and SON solutions including solutions for network and execute changes in neighboring cells to provide temporary
anomaly detection, diagnosis and compensation for 4G stan- coverage to affected subscribers. Alternatively, in the case of
dards and possible future 5G cellular networks. partial outage, the outage compensation solution may execute
2) Self-healing Framework for Cellular Networks: As the emergency parameter changes at either the affected cell or its
number of physical entities in a network increases, the prob- neighbors or both to recover the degraded KPI or KPIs. The
ability of network outages, both full and partial, increases complete Self-healing framework, along with relevant studies
proportionally as demonstrated in Figs. 2 and 3. In order is demonstrated in Fig. 4. A taxonomy of studies based on
to respond to these network outages, typical Self-healing these components is presented in Fig. 5.
6

III. K EY C OMPONENTS OF S ELF -H EALING T ECHNIQUES algorithms presented in any study rely heavily on the choice
FOR M OBILE C ELLULAR N ETWORKS of performance metrics employed in the study to construct
To develop a comprehensive review of the work pertaining and evaluate them. The performance metrics most relevant to
to Self-Healing for mobile cellular networks, we present a studies on Self-healing can be classified under the umbrella
collection of key definitions that will enable the reader to term network health.
quickly comprehend the nuances of the reviewed studies. The Network health is a broad term used to describe the per-
five core components that constitute the logical structure of formance of the network in terms of universally accepted
these studies are: 1) methodology, 2) network topology, 3) KPIs such as Accessibility, Retainability and Mobility [39].
performance metrics, 4) control mechanism, and 5) direction Accessibility is the ability of subscribers to access the network
of control. resources for data transmission and includes KPIs such as
attach success rate, radio resource control setup success rate,
connection setup success rate, random access success rate etc.
A. Methodology
Retainability is the ability of the network to carry a data
Each study presenting a solution for detection, diagnosis or session to its completion without drop and is characterized
compensation of outages follows an underlying methodology. by the session drop rate KPI. Mobility is the ability of the
These can be split into three broad categories: 1) Heuristic, 2) network to allow successful transition of a subscriber from
Analytical and 3) Learning-based. Heuristic solutions follow one cell to another with minimal impact on services and is
a set of pre-defined rules and are built upon intuition or prior generally represented by handover attempt, success and failure
knowledge gained from existing literature or experience. Two rate KPIs.
heuristic solutions commonly found in literature are rule-based Additionally, measurements signifying network coverage in-
algorithms, which follow a set of if-else rules, and frameworks, cluding reference signal received power (RSRP), and network
which mostly consist of guidelines. Analytical solutions break quality including spectral efficiency, signal-to-interference and
down a given problem into its mathematical components noise ratio (SINR), reference signal received quality (RSRQ),
which are then solved to achieve an optimal or close to network and user data throughputs, channel quality indicators
optimal solutions. Analytical solution methodologies include and data latency are also often employed in the design and
techniques such as convex optimization [30], non-convex opti- analysis of Self-healing solutions.
mization such as pattern search [31], genetic algorithms [32],
simulated annealing [33] etc., multi-objective optimization
[30], and game theory [34]. Learning-based solutions are built D. Control Mechanism
on machine learning techniques popularized by the field of Control mechanism is defined as the method of controlling
computer science. These algorithms rely overwhelmingly on SON solution functionality and can be categorized by the
user and network data and very little on expert knowledge following methods: 1) Centralized, 2) Distributed, and 3)
[35]. Machine learning techniques are generally split into three Hybrid. Centralized control implies that the SON functions
overarching techniques [36, 37] i.e. supervised, unsupervised are controlled from one central controller connected to every
and reinforcement learning. node in the network, whereas distributed control implies that
the control of SON functions resides within the network nodes.
Hybrid control is a combination of central and distributed
B. Network Topology
control and implies that while some SON functions may reside
The term network topology is defined as the architecture inside a centralized SON controller, other less computationally
or layout of the network in terms of cell deployments. More heavy functions which do not directly impact neighboring
specifically, network topology is used to describe the tiered nodes, can be distributed to the nodes.
structure of the network. There are two main types of network
topologies used in literature. Homogeneous networks consist
of only one tier of cells. These cells may be only macro cells E. Direction of Control
with large coverage areas or only small cells which have lower Direction of control defines whether a SON function is
power, and consequently lower coverage. Conversely, a com- designed to optimize the node-to-user link, user-to-node link,
bination of macro and small cells forming a multi-tier cellular or both. Solutions designed to optimize the node-to-user link
network is referred to as a heterogeneous network or HetNet. are downlink controlled, whereas the solutions optimizing the
While most studies on legacy mobile cellular networks employ user-to-node link are uplink controlled. Some solutions opti-
homogeneous network topology as the baseline, HetNets are mize both downlink and uplink and thus, offer bidirectional
quickly gaining popularity due to their flexibility and their control of network performance.
potential to achieve the goals set out for 5G cellular networks
[38]. IV. O UTAGE D ETECTION IN C ELLULAR M OBILE
N ETWORKS
C. Performance Metrics While the standardized Self-healing framework [5] does
Performance metrics are the benchmark measurements used present a roadmap to a fully integrated Self-healing frame-
to evaluate network performance and can be obtained from work, the precise inner workings of each component have been
network entities and user-generated reports. The solutions and deliberately left open-ended. This has allowed researchers and
7

Fig. 5: Proposed Taxonomy

network equipment manufacturers to come up with proprietary for performance metrics such as cell load, radio link failures,
algorithms to suit the needs of evolving mobile cellular handover failures, user throughputs and cell coverage. A more
networks. In this and the following sections, we describe comprehensive approach to rule-based outage detection has
the research done in each of the Self-healing framework been proposed by Liao et al. [41] that uses variations in
components, beginning with a review of outage detection user performance metric distributions to detect outages. The
techniques. The studies in this section are ordered based on authors propose the construction of a weighted cost function
the type of outage and methodology employed within. composed of channel quality indicator distribution, the time
correlation of channel quality differential and radio resource
A. Full Outage Detection in Mobile Cellular Networks connection re-establishment requests. The cost function is
treated as a hypothesis of normal cell performance. A cell is
The following subsections describe techniques and method-
considered in outage if its neighboring cells fail this hypothesis
ologies proposed for full outage detection in mobile cellular
i.e., their targeted KPIs deviate from normal. The authors
networks. The studies included in this section have been
demonstrate that, using measurements from cell edge users,
summarized in Table II in terms of techniques, network
the proposed algorithm can detect neighbor cell outages almost
architectures, measurements and tools used within them.
instantaneously.
1) Heuristic Solutions for Full Outage Detection: Heuris-
tic algorithms and frameworks for cell outage detection are 2) Learning-based Solutions for Full Outage Detection:
heavily reliant on pre-existing knowledge of domain experts Beyond the heuristic methodologies of identifying outages in
which makes them extremely useful for deployment in existing the network, machine learning based algorithms have been
mobile cellular networks. One such framework has been the prevailing method for full outage detection in research.
proposed by Amirijoo et al. [40] which employs rule-based Most of the studies on full outage detection that employ
decision tree algorithm for full outage detection in mobile learning based algorithms can be split into two categories
cellular networks. The framework derives its rules from expert i.e., supervised learning techniques for full outage detection
knowledge to create full outage detection trigger thresholds solutions and unsupervised learning techniques for full outage
8

TABLE II: Qualitative Comparison of Cell Outage Detection Algorithms

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Retainability,
[40] Mobility,
Heuristic Rule-Based
Quality DL
Homogeneous Centralized
Accessbility,
[41]
Quality
[42] Coverage
Supervised
Retainability,
Full Outage Learning
[45] Mobility, UL/DL
Detection Quality
[73] HetNet Coverage Hybrid
Learning Based Accessibility,
[48] Mobility,
Coverage
Centralized
[56] Homogeneous Coverage DL
Unsupervised Mobility, Cov-
[59,62,63,66]
Learning erage
Retainability,
[68]
Mobility
[54] Coverage Distributed
[71] Coverage
Centralized
[74] HetNet Retainability
[70] Coverage Hybrid

detection. uses level functions which continuously monitor downlink


a) Supervised Learning Techniques for Full Outage De- signal metrics such as channel quality, call drop rate and
tection: Supervised algorithms are a popular choice in terms handover timing advance to detect when a cell falls below the
of full outage detection due to their reliance on pre-classified acceptable threshold set by human experts. The authors have
data. In the study by Mueller et al. [42], the authors have demonstrated that the proposed approach can act in near-real
compared the performance of a rule-based heuristic algorithm time by detecting outages within a few minutes of occurrence,
against a decision tree algorithm [43] and a linear discriminant which is a significant improvement over the detection time by
binary classification function [44] to identify complete cell human experts, especially in very large networks.
outages. The algorithms use user reports containing down- b) Unsupervised Learning Techniques for Full Outage
link signal power measurements to detect when a cell stops Detection: The unique ability of unsupervised learning algo-
featuring in neighbor cell lists due to outage. The results rithms to cluster data into distinct groups without any pre-
show that the expert system is faster but less successful in classification makes them highly popular in outage detection
detecting neighbor cell outages while the linear discriminant applications. A major application of unsupervised learning is
binary classification function performs the best in terms of true the detection of cells that are in outage but do not generate any
positive detection rate. alarms, otherwise known as sleeping cells. Detection of such
Another supervised learning approach for full outage de- cells is not immediately possible manually due to the lack of
tection is developing cell profiles for outage detection. Alias alarms accompanying the outage which makes their detection
et al. [45] have proposed to develop performance profiles a highly useful application of unsupervised learning.
of cells in mobile cellular networks using hidden Markov An extensive comparison of clustering algorithms for sleep-
chains [46] which track the state progression of network ing cell detection has been presented by Chernov et al.
nodes that undergo outages. The proposed framework requires [48] where they have compared the performance of k-
execution of controlled outages to build state profiles using Nearest Neighbors (kNN) [49], Self-Organizing Maps (SOM)
signal quality and signal strength measurements of the outage [50], Local-Sensitive Hashing [51] and Probabilistic Anomaly
affected cell and its neighbors. These measurements are then Detection. The authors use random access channel access
used to identify cell performance in real-time to predict if a cell failure measurements in addition to the high-dimensional
has experienced an outage. The results show that the proposed minimization-of-drive-test (MDT) data [52] as input data for
approach can reach an accuracy of up to 90% in low fading clustering algorithms. To compare the performance of indi-
environments. vidual clustering algorithms, receiver operating characteris-
Since the idea of executing controlled outages to build cell tics and precision-recall curves are used. The results show
profiles may be prohibitive for live mobile cellular networks, that Probabilistic Anomaly Detection has the best receiver
Szilágyi and Novaczki [47] have proposed to construct default operating characteristics out of the four algorithms and a
activity profiles of cells using simulated network data to higher precision-recall curve compared to the other algorithms.
detect when a cell faces an outage. The proposed algorithm Additionally, the authors have compared the training time
9

of the four clustering algorithms which shows that Local in terms of speed and reliability since LOF can sometimes
Sensitivity Hashing has a training time of linear order, whereas misclassify normal cells.
Probabilistic Anomaly Detection takes the least amount of The concepts from [62] and [63] are further extended by
time to detect sleeping cells compared to the other algorithms. Zoha et al. [66] to include comparison of LOF with One-
Another clustering algorithm, Dynamic Affinity Propagation class Support Vector Machine (OCSVM) algorithm [67] under
[53], has been utilized for sleeping cell detection by Ma different shadowing scenarios. The results show that like
et al. [54]. The proposed algorithm uses Dynamic Affinity kNN, OCSVM algorithm also outperforms LOF. Since LOF
Propagation to calculate user clusters based on received power is limited to identifying localized outliers to cell clusters, the
values of neighboring and serving cells reported by users, algorithm is prone to identifying normal cells as sleeping
while Silhouette index [55] is used as clustering quality cells. This is avoided in both kNN and OCSVM because of
criterion to estimate the number of significant user clusters. the global approach adopted by both algorithms which only
The resultant clustering is mapped to physical data including identifies global outliers. However, OCSVM takes significantly
user location to identify cells in outage. While the approach longer to train compared to either k-NN or LOF algorithms.
clearly succeeds in identifying sleeping cells using simulated 3) Full Outage Detection in HetNets: In the studies de-
outages, it is possible that in a live network, some users scribed above, the target topology for outage detection was
suffering deep fade may be wrongly clustered. invariably a homogeneous mobile cellular network of macro
Dimensionality Reduction for Unsupervised Learning: cells. Due to the large serving radii of macro cells and high
While the above unsupervised learning solutions have a high subscriber count associated with them, generating measure-
degree of accuracy, their computational cost is equally high ments for full outage detection is not a primary concern.
because network and user data can have very high dimensions. a) What makes outage detection in HetNets different than
In addition to being resource hungry, the highly dimensional homogeneous networks?: Cell outage detection in HetNets
network and user data may cause increased detection latency differs compared to homogeneous networks due to the ar-
as well as over-fitting. As the implications of these caveats chitectural difference between the two topologies. The low
are likely to surface in large scale real network, they are not computational ability of small cells, sparse network informa-
exclusively addressed in above studies that rely on simulated tion due to fewer connected users and proposed future 5G
small-scale network and user population for performance solutions such as network densification means that outage
evaluation. detection algorithms for HetNets must be designed separately.
To tackle high dimensional network and user data, The influences of sparse network data on outage detection
Chernogorov et al. [56] have proposed to construct diffusion algorithms plays an extremely important role in the accuracy
maps [57] of user handover attempts and successes data. These of the algorithm. Less data can mean less accurate outage
diffusion maps are obtained through Eigen decomposition of detection and an increase in false positive rate.
Markov matrix obtained from the diffusion maps of network This fact is demonstrated by Chernov et al. [68] who
and user data. The resulting low-dimensional data is used to compare the performance of several learning-based outage
create cell coverage dominance maps which are then used detection algorithms using radio link and handover failure
to detect sleeping cells through k-means clustering [58] of metrics under different subscriber density levels. The results
cells into normal and sleeping cell clusters. Alternatively, demonstrate that as the number of subscribers per cell, and
Chernogorov et al. [59] have employed principal component consequently samples of performance metric report, starts to
analysis [60] to reduce the dimensionality of network and decrease, the area under the curve of true positive rate plot
user data. The lower dimension data is then used to identify decreases exponentially. The authors also demonstrate that this
sleeping cell using the FindCBLOF algorithm [61] which result is true regardless of the outage detection algorithm,
separates clusters of normal cells from sleeping cells. Although which makes it a universal issue. Similar evidence is also
a direct comparison of the results of the approaches in [56] and implicit in the results presented in [45, 62, 63, 66].
[59] has not been presented, the authors separately demonstrate b) Outage Detection in Sparse Data Environment: In
that the proposed algorithms in [56] and [59] can identify a sparse data environment such as a HetNet with control-
sleeping cells and the affected neighboring cells as a result data separation architecture [69], Onireti et al. [70, 71] have
of the outage with high level of accuracy and also quantify proposed to use Grey first order one variable prediction model
the impact of the outages in terms of failed handover and call [72] to predict downlink received power of the cell at locations
events. where no such data is reported. Outage detection is triggered
Alternatively, Zoha et al. [62, 63] have addressed the when sudden changes in user associations are observed. The
challenges posed by high dimensionality through multi- Grey prediction model predicts the downlink received power
dimensional scaling [64]. Multi-dimensional scaling allows of the cells if user associations had remained the same. The
easy visualization of the high dimensional network and user predicted information is then compared to actual downlink
data by translating it into fewer dimensions using kernel trans- measurement reports to identify cells in outage. For this
formations. This reduces the convergence time of clustering al- purpose, the authors use k-NN and LOF algorithms with k-
gorithms. In [62], the resulting low dimensional data is passed NN demonstrating higher prediction accuracy just as it did for
to Local Outlier Factor (LOF) [65] algorithm for sleeping the case of homogeneous networks [63]. The choice of Grey
cell identification, whereas kNN and LOF are compared with prediction models in this study stems from the fact that these
each other in [63]. It is observed that kNN outperforms LOF models have been shown to have higher prediction accuracy
10

in sparse data environments compared to other prediction during heavy traffic situations to identify partial outages. The
algorithms such as linear regression. authors use data from a large mobile cellular network operator
The algorithm proposed by Wang et al. [73] also refers to to study the trend of several network performance metrics
a HetNet with control-data separation and outages in small including radio link setup failures, user counts, dropped calls,
cells are detected through a comparison of predicted versus blocked calls, data session count, data session duration and the
actual measurements. Measurement prediction is made using average time between consecutive data sessions of a user. The
collaborative filtering where data collected during normal resulting time series profiles of cells during routine operation is
circumstances from highly correlated users is used to generate compared with their operation during an unusual traffic activity
predictions for normal cell performance. The predicted data period such as a sporting event. The authors demonstrate that
is then passed through sequential hypothesis testing which if the normal cell performance during routine operations is
measures the likelihood of a hypothesis being true and returns known, it is possible to predict the level of cell performance
the hypothesis with maximum likelihood to be true i.e., degradation during non-routine events with a high degree of
whether a cell is in outage or not. The proposed algorithm is accuracy.
accurate nearly 75% of the time even in very low user density b) Comparative analysis-based Heuristic solutions for
(1 user per 10000m2 ) and very high fading (8 dB). Partial Outage Detection: In order to facilitate partial out-
Finally, Xue et al. [74] have proposed to use simulated radio age detection through comparative analysis of normal and
link failure data of normal and outage-hit cells to overcome degraded cell behavior, Novaczki and Szilagyi [78] propose
the lack of data generated per cell in an ultra-dense HetNet. construction of faultless network performance profiles by
The authors propose to use kNN clustering to detect outages fitting network performance metrics such as channel quality to
in HetNets using simulated outages in the network to train the a β-distribution. The detection algorithm compares the α and
algorithm. β parameters of real time cell performance distribution with
the faultless performance distribution parameters. In case the
real-time parameters differ from faultless profile parameters
B. Partial Outage Detection in Cellular Networks
by a threshold decided by experts, the cell is considered to be
Partial outage detection has historically been the domain suffering partial outage.
of network optimization experts since, unlike full outage, Comparison of time-series distribution has also been ex-
KPI degradation generally does not generate network alarms. plored by D’Alconzo et al. [79] who propose to construct
Degradation of network performance can lead to poor user univariate probability distribution functions of performance
QoE and may go unnoticed not only because no alarms are metrics including number of synchronization packets and
generated, but also because unlike full outage, the effect of number of distinct network addresses contacted. The baseline
partial outage may not manifest itself right away in the form of distribution functions are constructed for different temporal
customer complaints. Therefore, it is integral to include partial resolutions to avoid false detections. The approach in [79]
outage detection in the autonomous Self-healing framework. differs from that in [78] since the proposal is to identify partial
In this sub-section, we discuss the recently proposed solutions outages using the Kullback-Leibler divergence [80] or relative
for partial outage detection in mobile cellular networks, while entropy of current behavior distribution from baseline behavior
Table III presents a qualitative comparison of the studies distribution, while the behavior distribution modeling is not
included in this sub-section. Before presenting techniques for limited to β-distributions.
partial outage detection, it is clarified that the terms partial Correlational comparison of time-series is an alternative
outage and performance degradation are used interchangeably methodology of comparative analysis-based techniques for
in this sub-section. partial outage detection. An example of correlational com-
1) Heuristic Solutions for Partial Outage Detection: parison has been presented by Asghar et al. [81] who have
a) Heuristic solutions leveraging large-scale network proposed to utilize Pearson’s correlation factor to match cells
data for Partial Outage Detection: Karatepe and Zeydan [75] based on cell load estimated through the number of active users
have proposed a heuristic rule based algorithm for network associated with the cell. The algorithm states that if a cell falls
misconfiguration detection due to its scalability and speed of below an arbitrary correlation threshold with multiple cells
operation compared to learning-based approaches especially with which it was previously well correlated, it is considered
when dealing with large-scale network data. The authors to be degraded. The authors demonstrate that not only is the
deploy a Hadoop [76] based data processing cluster to process proposed method effective for detecting slow partial outages,
large amounts of customer call detail record data which it is also effective for full outage detection. However, the per-
contains timestamps, handover attempts and successes, and formance of this algorithm is highly dependent on correlated
all the cells a user is associated with during the call. After cells i.e., if multiple correlated cells suffer same degradation, it
data processing, the information is forwarded to a heuristic may go undetected. To avoid this pitfall, Muñoz et al. [82] have
algorithm that matches user location with the associated cells proposed to correlate successful handover count and call drop
and returns any misconfigurations observed during the call. count time series of a cell with a synthesized data series that
The authors claim that the proposed algorithm can detect represents partial outage and a reference data series of the cell
misconfigured cells over 82% of the time. itself during normal behavior as a preventive measure for false
Similarly, Shafiq et al. [77] have proposed to compare cell flags. High correlation with synthesized data and low correla-
profiles during routine network operation with performance tion with reference data signifies partial outage. The authors
11

TABLE III: Qualitative Comparison of Partial Outage Detection Algorithms

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
[75] Mobility
Centralized DL
[78,84] Rule-Based Quality
[79] Retainability Distributed UL
Heuristic Accessibility,
[77] Retainability,
Quality
Framework
[81] Quality
Partial Outage Homogeneous Retainability, DL
Detection [82] Centralized
Mobility
Retainability,
[83] Mobility,
Quality
Retainability,
[86] Supervised
Quality
Learning
Accessibility,
Retainability,
[87,88]
Learning Based Mobility,
Quality
[91,93] Quality
Accessibility,
Unsupervised Retainability,
[95] DL/UL
Learning Coverage,
Quality
[96] Quality Distributed
Retainability,
[97] DL
Mobility
Coverage
Accessibility,
[102,104]
Retainability
[105] Retainability

advocate use of time-series correlations over cumulative data a day [84, Fig. 7]. Based on the similarity of channel quality
correlations since cumulative correlation may hide any short- measurement distribution of a cell over a day with fuzzy
term degradations in cell performance. However, time-series clusters, the solution decides if it is degraded. The authors
correlation requires higher and faster computations especially have demonstrated that the proposed solution can not only
if more performance metrics are included in the comparison identify degraded cell performance but also the amount of
process. time it spends as degraded. However, scalability of the solution
requires further investigation since the proposed approach is
c) Other heuristic solutions for Partial Outage Detection:
limited to evaluation of one performance metric over a period
In their work on partial outage detection, Sanchez-Gonzalez
of a whole day.
et al. [83] propose a decision tree based solution to identify
partial outages in a mobile cellular network. The proposed 2) Learning-based Solutions for Partial Outage Detection:
algorithm applies a set of expert-defined rules separating One of the application areas of machine learning is the
normal and degraded behavior on the uplink and downlink estimation of network reliability explored by Sattiraju et al.
received power measurements, handover failures, and radio [86]. The authors capture long-term reliability data such as
link failures to categorize the performance of each cell. If link availability and apply semi-Markov transition process to
a cell fails said rules, it is considered to be in partial outage construct renewal models for normal and degraded network
and diagnostic functions are initiated. The solution is validated link states. Link reliability is defined as the amount of time
using real-network data where it is able to effectively identify network links spend in normal states and two transition actions
the degraded cells. i.e., failure and repair exist in the network. The authors find
Merging heuristic and learning-based methodologies, that lower reliability states are highly absorbing states i.e.,
Kumpulainen et al. [84] have proposed a hybrid solution once a link is sufficiently degraded, its recovery probability
for partial outage detection. The proposed solution evaluates approaches zero.
channel quality measurements of a cell over one day and cat- Ciocarlie et al. [87], have also explored the feasibility
egorizes the quality samples as good, medium and bad based of deploying time-series averaging based anomaly detection
on a heuristic algorithm developed using expert knowledge. algorithms over variable window lengths. However, unlike the
Additionally, the solution utilizes fuzzy C-means clustering heuristic approaches presented in [78] and [79], the proposed
[85] to generate cell clusters based on the commonality of algorithm uses autoregressive integrated moving average to
their profiles in terms of channel quality data distribution over compute predicted KPI values for a cell which are then
12

compared with an ensemble of models for different unspecified clustering technique. SOMs work by projecting input vectors
KPIs. The authors propose to construct normal and anomalous of large size onto a 2-dimensional space using weights ob-
KPI models using different techniques including empirical tained by training the underlying neural network. A number
cumulative distribution function and SVM with radial basis of studies have proposed SOM-based algorithms for partial
function kernel. The proposed solution is validated against outage detection including [91, 93, 95, 96, 97].
human experts using visualization tools. Results show that As already discussed, Barreto et al. [91] and Frota et al.
while the proposed approach is able to accurately predict a [93] have used SOMs for comparison-based partial outage
partial outage, the detection delay between outage occurring detection. On the other hand Lehtimäki and Raivio [95]
and being detected was never less than five hours. Another harness the capability of SOMs to arrange similar input vectors
important concern raised by the authors is the exponential of network measurements including call request blocking,
training time of the machine learning algorithms which can traffic channel availability, channel quality, voice call traffic,
make the proposed methodology prohibitive in live networks. and uplink/downlink signal strength together. The authors use
The authors have provided further refinement of this approach this arrangement to identify cells with partial outage through k-
in [88] by including the utilization of the Kolmogorov- means clustering algorithm. The proposed scheme is compared
Smirnov test [89] to identify the sliding window size for with principal component analysis and independent component
data streams used to train the SVM models. Another key analysis [98] to detect partial outages in control signaling and
distinction of [88] over [87] is that the authors use seasonal traffic channel statistics of a real 2G network. Results show
trend decomposition based on Loess [90] to identify and that SOM and principal component analysis performed equally
remove outliers from the original training data to create true well while outperforming independent component analysis.
performance models. Kumpulainen and Hätönen [96] also use SOM based clus-
A key commonality among [78, 79, 87, 88] is the use of tering to detect localized partial outages compared to the
individual data streams for input to outage detection algo- general global partial outage detection models. The proposed
rithms. However, Barreto et al. [91] postulate that using single algorithm first creates SOM which is then used to identify
variable data streams for anomaly detection, though simple, best matching units for each node in the map and distance
is not always effective. Therefore, the authors have proposed (quantization error) between the two units is calculated. A
a joint neural network that takes univariate and multivariate cell is considered in partial outage if its best matching unit is
data containing channel quality measurements, traffic loads also in outage and the distance between the two is less than
and user throughputs from the network as inputs to generate a pre-defined threshold. The authors compare the usage of
global and local network performance profiles which are local partial outage detection model using SOM with Gaussian
used to detect anomalous cells via percentile-based confidence Mixture Models and k-means clustering with results showing
intervals computed over global and local network profiles. that the local anomaly detection scheme not only detects all the
The authors demonstrate the efficacy of training a multivariate outages but also whenever the activity level of a cell changes.
neural algorithm by presenting a comparison with a single- Gómez-Andrades et al. [97] employ a similar approach to
threshold neural algorithm using several neural network-based [96] in their work where SOM is used to arrange the cells
algorithms including winner-take-all, frequency sensitive com- based on signal strength, quality, call drop and handover failure
petitive learning [92], Self Organizing Map (SOM) and neural metrics, and then clustered using Ward’s hierarchical clus-
gas algorithm. Results show that the proposed multivariate tering [99]. The authors use the Davies-Bouldin index [100]
partial outage detection algorithm consistently outperforms and the Kolmogorov-Smirnov test [101] to set the number of
single-threshold method in terms of false positive alarm rate clusters to be created in the SOM. The clusters are labeled as
by 0.6% to over 5.5%. normal or faulty based on expert knowledge. A comparison of
Frota et al. [93] have presented an extension to the work the proposed methodology with a rule-based algorithm and a
in [91] where the authors combine the originally proposed Bayesian network classifier shows that the proposed approach
multivariate neural networks with Gaussian distribution based outperforms them by 31% and 12% respectively.
SOM clustering algorithm to create a partial outage detection b) Partial Outage Detection using clustering techniques:
algorithm. The authors use network core traffic statistics to Apart from SOMs, other unsupervised clustering technique
train the Gaussian distribution based SOM clustering algorithm such as k-means, density based and hierarchical clustering,
which is compared with multivariate heuristic anomaly detec- topic modeling and LOF clustering have also been explored
tion methods. It is demonstrated that the proposed technique in literature for partial outage detection. Rezaei et al. [102]
can lower false partial outage detection rate by nearly 30% have presented a comparison of several supervised partial
when trained over 10% of dataset compared to the algorithm outage detection schemes in a 2G network. The study uses
proposed in [94] for fault diagnosis in rotating machines. input data including call blocking and drops, as well as signal
However, the solution proposed in [93] builds on an underlying quality measurements. Classification techniques explored by
assumption that network performance metrics such as user the authors for partial outage detection include chi-squared
count, throughput, noise levels and interference levels are automatic interaction detection [103], quick unbiased efficient
normally distributed which may not hold always true in typical statistical tree, Bayesian networks, SVM, and classification
real networks. and regression trees. The authors find that SVM has the best
a) Partial Outage Detection using Self-Organizing Maps: detection rate among supervised learning techniques (94%)
Self-Organizing Maps are a popular neural networks based but requires from longer training time while quick unbiased
13

efficient statistical tree has the shortest training time with 2 Majority of techniques for outage detection discussed
relatively high accuracy (93%). above only consider spatial data for outage detection
Ciocarlie et al. [104] use topic modeling to detect partial purposes. This means that the KPI data used for outage
outages in a cellular network. The method resembles other detection is gathered over a set of spatial points repre-
clustering techniques with the difference that it assigns a senting user locations for one time instance. Therefore,
probability to the presence of commonality within the cluster outages detected by these solutions are instantaneous.
of cells. Once the clusters have been developed, the framework This raises the issue of outages that are extremely short-
uses domain knowledge to identify which cluster represents lived, have little impact on subscriber QoE, and may be
anomalous behavior. The approach is tested on real-network gone by the time they can be compensated. To address
data with verification of results performed using visual analy- this issue, future solutions for outage detection must
sis of data by experts. Alternatively, Dandan et al. [105] have consider the temporal dimension as well as the spatial
used kernel-based LOF anomaly detection which is simply dimension of user reported data to differentiate between
LOF with kernel based distance calculation. The authors temporary and long-term outages.
propose using kernel-based LOF to identify cells in partial 3 Most of the approaches for outage detection reviewed
outage by associating a degree of anomaly to each cell in above require a secondary analysis by human expert to
a density map for LOF based on kernel Gaussian distance confirm the existence of the outage which can add some
(kGD). Normal cells are characterized by having a kGD of delay before outage compensation is triggered. This can
1 and any cells with kGD above are outliers. The authors be an issue in 5G networks where low latency and high
also suggest that kernel-based LOF can better deal with non- QoE requirements mean that the outages would have to
uniform distributions of cells in real datasets compared to be detected and compensated as quickly as possible.
typical LOF algorithm. The proposed method has a 91% In addition to addressing the above issues, future studies for
success rate in detecting outages compared to 70% for normal outage detection must also incorporate the effects of millimeter
LOF. wave propagation and capacity enhancement solutions such
as massive MIMO. Additionally, detecting partial outages in
C. Summary and Insights massive MIMO cells such as failure of some beams will also
need to be addressed. Based on the review of existing litera-
Outage detection is one of the most labor intensive process
ture, there are no current studies that expressly include either
in a mobile cellular network. Researchers have devoted a lot
of these two features which makes them prime candidates for
of attention to autonomous full and partial outage detection
future research in outage detection.
solutions. Majority of these solutions attempt to detect outages
based on coverage metrics such as received signal strength. For
outage detection in future 5G networks with millimeter wave V. O UTAGE D IAGNOSIS IN C ELLULAR M OBILE
cell deployment, researchers will need to consider additional N ETWORKS
metrics. This is because millimeter wave cells have a very high Once a network outage (full or partial) is detected, the next
pathloss leading to natural loss of coverage even at a distance phase is to diagnose the underlying cause of the outage. In this
of a few hundred meters [106]. A challenge for future studies section, we analyze the literature on Outage Diagnosis. Some
is to come up with solutions that can detect outages in spite full outages can trigger fault alarms, thus eliminating the need
of the coverage limitations of millimeter wave cells. for full outage detection in those particular cases. However,
A common theme among the studies for full and partial the exact cause of the failure still needs to be diagnosed.
outage detection is the growing use of machine learning Conversely, the key difficulty in diagnosis with partial outage
techniques in general, and unsupervised clustering techniques is the lack of fault alarms associated with the anomalies
in particular, for outage detection. This reduces the chances which makes their diagnosis more difficult, thus requiring
of outages due to unconventional reasons, such as weather sophisticated diagnostic techniques. Table IV provides the
anomalies, to be missed. This is not the case for heuristic and qualitative comparison of studies describing full and partial
supervised machine learning based solutions since they are outage diagnosis techniques.
only trained to look for evidence of outage based on human ex-
pert knowledge. This does not mean that unsupervised learning
solutions for outage detection can become industry standard as A. Diagnosis of Full Outages in Cellular Networks
is. Some of the major issues concerning unsupervised learning A starting point towards full outage diagnosis is building the
solutions include: knowledge-base of possible faults. A quite extensive descrip-
1 Machine learning techniques in general are prone to er- tion of standard faults in cellular networks has been presented
rors due to noise in the recorded dataset, as demonstrated in [25] which are applicable to 2G, 3G and 4G networks.
in [45, 62, 63, 66]. This means that unsupervised learning The standard documentation also provides alarm descriptions
solutions deployed for outage detection in areas with high for faults associated with hardware failure, software failure,
shadowing and multipaths, such as metro hubs, can result functionality failure or any other faults that cause the network
in higher false negatives. Future solutions for outage node to stop performing its routine operations. However,
detection must address this issue before they can become outage diagnostics have remained in the domain of human
practically viable. experts who use their knowledge to identify outage causes.
14

TABLE IV: Qualitative Comparison of Outage Diagnosis Algorithms

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Retainability,
[47] Heuristic Rule-Based Mobility,
UL/DL
Full Outage Quality
Diagnosis Accessibility,
Retainability,
[107] Supervised
Mobility,
Learning
Learning Based Homogeneous Quality Centralized
[108,110] Retainability
Unsupervised Accessibility,
[102]
Learning Retainability
Accessibility, DL
[77] Heuristic Framework Retainability,
Quality
Partial Outage Accessibility,
[104]
Diagnosis Supervised Retainability
Learning Accessibility,
Learning Based Retainability,
[114]
Mobility,
Quality
[115] Retainability
Retainability,
[97] Unsupervised
Mobility
Learning
[116] Quality UL

While this method is effective, it cannot remain as the method independence of causal influence [109]. The two methods are
of choice going forward towards ultra-dense networks. compared using data from a live network containing faults
To this end, some studies have proposed techniques com- such as call drops, handover failures and call blocking with
bining expert knowledge with mobile cellular network data results showing modified NBC to be more efficient in terms
to create autonomous outage diagnosis algorithms. One such of simplicity with the same level of accuracy as regular NBC.
approach has been demonstrated by Szilágyi and Novaczki However, in order for modified NBC to diagnose outages
[47] which utilizes expert knowledge to create targets for accurately, it needs knowledge of prior KPI distributions in
network performance such as channel quality, dropped calls the event of an outage. Barco et al. [110] have discussed
and handover failures. The solution uses weighted sums of the the process of developing this knowledge using a knowledge
difference of actual KPI value to the target value to calculate acquisition tool. The tool combines past diagnoses performed
a diagnostic score. The algorithm then uses expert knowledge by experts with fault data from the mobile cellular network.
to associate a range of scores with different fault causes The tool takes faults such as high network congestion or
to complete the diagnosis process. The proposed technique high call drops, possible causes such as high interference,
is validated using real data, with results showing that the observed performance metrics at the time of the fault such
algorithm was able to diagnose each outage correctly. as handovers due to high interference, and cell parameter
1) Learning-based solutions for Full Outage Diagnosis: settings. Combining this information, the tool outputs the prior
Solutions for outage diagnosis using stationary KPI targets probabilities of different diagnoses.
derived from expert knowledge can become obsolete quickly in
the face of changing network dynamics. Khanafer et al. [107] Unlike other techniques for full outage diagnosis, Rezaei
argue this point and propose an alternate learning-based solu- et al. [102] propose to use unsupervised clustering techniques
tion using Naïve Bayes Classifier (NBC) to predict possible for fault diagnosis and present a comparison of several such
causes of hardware faults and KPI degradations in the network techniques including expectation minimization, density-based
given the symptoms (failures). The algorithm uses discretized spatial clustering of applications with noise [111], agglom-
value ranges for various KPIs including blocked calls, dropped erative hierarchical clustering [112], X-means and k-means
calls, connection request failures, and HO failures to indicate clustering. The authors use clustering algorithms to split cells
normal and faulty performance states. The authors compare based on their call drops and blocking values. Diagnosis
two different techniques of KPI value discretization namely is done by comparing cells in clusters to faulty cells with
percentile-based discretization and entropy minimization dis- known diagnosis. Validation is done using expert knowledge
cretization. Results show that outage diagnoses are over 10% to confirm the result of fault diagnosis through clustering. The
more accurate when entropy minimization discretization is clustering results are verified using the Silhouette Coefficient
used compared to percentile-based discretization. [113] and show that expectation minimization is the most
Barco et al. [108] compare the performance of a NBC for successful technique in terms of data clustering with clearest
outage diagnosis with a modified NBC which assumes the cluster divisions between different sets of faulty cells.
15

B. Partial Outage Diagnosis in Cellular Networks a set of symptomatic KPI distributions. The results show
Diagnostic techniques are primarily needed in mobile cellu- that continuous models exhibit nearly 10% higher diagno-
lar network for performance degradations scenarios i.e., partial sis accuracy when the training set size is sufficiently large
outages which generally do not generate any alarms. The (∼2000 examples) while the discrete models are more accurate
operators can define thresholds for KPI values to generate (∼20%) when the training data is sparse (∼50 examples).
customized alarms; however, apart from being useful only The results from [114] have been used by Barco et al.
for KPI degradation detection, this technique cannot help in [115] to propose a hybrid KPI modeling methodology called
diagnosis or root cause analysis. For this reason, partial outage Smoothed Bayesian Networks which can decrease the sen-
diagnosis carries great importance in autonomous Self-healing sitivity of diagnosis accuracy to imprecision in the model
solutions for SON. parameters. The posterior probabilities of the causes follow a
Shafiq et al. [77] have presented an analysis of real-time smoother transition near the boundaries between states given
measurements from some cells of a large mobile cellular their related symptoms in Smoothed Bayesian Networks than
network before, during and after two abnormally high traf- in traditional Bayesian networks. The authors compare the
fic events. The results have been used to present heuristic accuracy of diagnoses for both Smoothed Bayesian Networks
detection and diagnosis schemes for network congestion and and Discrete Bayesian Networks on real network data for
dropped calls during such events along with suggestions on diagnosis of call drop rate. The results suggest that Smoothed
how to rectify these problems. The authors analyze network Bayesian Networks perform better by almost 10% when there
performance measurement for call connections, link perfor- was a certain degree of inaccuracy in the model brought about
mance and data service performances, and suggest that major by sparseness in data. However, Discrete Bayesian Networks
issues in terms of call drops and congestion occur when users perform better on a larger dataset resulting in a more accurate
access the network without coordination. While this would KPI model.
not pose problems during routine network operations since the b) Unsupervised Learning Based Solutions for Partial
network is designed to handle such traffic, it becomes an issue Outage Diagnosis: SOMs have been used frequently not
during major events or gatherings if additional capacity is not only to detect KPI degradations [93, 95, 96, 97], but also to
deployed. The analysis presented in the paper solely relies on diagnose them [97, 116]. Gómez-Andrades et al. [97] have
expert knowledge to derive diagnostic inferences from the real used SOM based clustering cell in 4G networks based on
data. call drop rate, channel interference, handover failures, received
1) Partial Outage Diagnosis using Learning-based Tech- signal strength, channel quality, and throughput to diagnose
niques: Other than heuristic techniques, learning-based tech- the possible cause of performance degradations in the eNBs.
niques have also been exploited in literature [97, 104, 114, The clustering algorithm arranges cells based on their degree
115, 116] for KPI degradation diagnosis. of association with other degraded cells by finding the best
a) Supervised Learning Techniques for Partial Outage matching unit for each cell. If a cell is experiencing KPI
Diagnosis: Ciocarlie et al. [104] propose to use Markov Logic degradations, it will be clustered with pre-existing degraded
Networks and Principal Component Analysis to diagnose cells with known diagnosis. The authors demonstrate that
weather-related and parameter misconfiguration-related partial the proposed scheme can outperform rule-based algorithms
outages from real network data. The proposed technique gen- and Bayesian Network Classifiers by ∼32% and ∼12% re-
erates clusters of degraded cells using Principal Component spectively but takes longer to train compared to the other
Analysis which are then passed through a Markov Logic two techniques. Laiho et al. [116] have proposed a similar
Network for diagnosis. The Markov Logic Network generates solution to diagnose degradations in channel quality and frame
a sequence of events that would lead to a degradation in error rate in 3G networks with the exception that the cells
call drop rate, throughput or handover failures, thus leading are clustered using k-means clustering. Cells are diagnosed
to the diagnosis. Weights for each sequence of events in the by taking the diagnosis of the nearest known degraded cell
Markov Logic Network leading to a diagnosis are initialized and the results are validated using real-network data and
using expert knowledge and updated with each successful and comparing expert diagnoses with the diagnoses generated by
unsuccessful diagnosis. The diagnostic results of the proposed the technique.
approach have been validated against expert diagnoses. The
proposed approach also relies heavily on expert knowledge
to generate the event sequences used in the Markov Logic C. Summary and Insights
Networks. Outage diagnosis is a relatively under-explored aspect of
Barco et al. [114] present a comparison of the impact of Self-healing in mobile cellular networks compared to outage
continuous versus discretized data models for auto-diagnostic detection and compensation techniques. Part of the reason
systems in cellular network using Bayesian network classifier. for this are the standardized fault and alarm codes that are
The authors use β-distributions to construct continuous models automatically generated in the event of a full outage due
from KPI data streams, and selective entropy minimization to hardware/software failure. However, no such standardized
discretization [117] to construct discrete KPI models. The diagnostics exist for partial outages. This is because the
study uses dropped call rate, blocked call rate, handover same partial outage may be caused by two different sets of
blocking, throughput, and active neighbor set update rate KPIs circumstances. For this reason, majority of studies on outage
to generate probability of degradation in the network given diagnosis use supervised learning solutions such as Bayesian
16

networks and Markov logic networks which can associate a cell to create coverage polygons for neighboring cells. The al-
probability with each known cause leading to an outage. How- gorithm then iterates through different antenna configurations
ever, the use of such solutions can be challenging in practical of key neighboring cells with potential coverage overlap to
networks since training them would require constructing a outage cell until coverage constraints of all users are met.
database of every root cause resulting in an outage. Additionally, the algorithm monitors downlink throughputs
To address this issue, future studies on outage diagnosis and radio link failures of the neighboring cells to benchmark
should focus on how this database of root causes can be cre- network recovery. A demonstration of the algorithm by the
ated without creating artificial outages. In addition, causes of authors on real network outages shows it can effectively
full and partial outages in future 5G networks with millimeter compensate for outages within 2 hours of their occurrence.
wave cells, massive MIMO and ultra-dense cell deployment The outage compensation framework proposed by Amirijoo
must also be explored since they are an uncharted territory as et al. [119] compares compensation potential of different
yet. control parameters suggested in [40] i.e., reference signal
power, uplink target received power level P0 and antenna
tilt in mitigating outage-induced performance degradations.
VI. O UTAGE C OMPENSATION IN C ELLULAR M OBILE
An iterative algorithm is used to update the parameters of
N ETWORKS
neighboring cells and their results are benchmarked. Results
Outage compensation forms the core element of the Self- in terms of cell coverage and user throughput indicate that
healing framework; therefore, it is no surprise that, among uplink target received power level P0 and antenna tilt are the
the three components of Self-healing, outage compensation most effective parameters for improving coverage, while P0 is
has received the most attention from the research community. most effective for improving throughput.
Compensation actions and algorithms are designed specifically Frenzel et al. [120] discuss choice of optimal recover action
to provide temporary service to users in case of a full outage based on three inputs i.e., the probability of effectiveness of a
or partial outage since both events are not immediately re- solution which depends on the outage cause, the preference of
coverable. While detection and diagnosis of full outage and the network operator for a recovery action, and the preference
partial outage in a mobile cellular network require different of the network operator for a degradation resolution. The
methodologies, compensatory actions for both events involve authors propose a weighted-sum function which returns the
similar techniques. The majority of studies on compensation cost of selecting a solution, action and resolution tuple. The
algorithms are presented as a solution for full outage but lend proposed framework is flexible to changing network technol-
themselves seamlessly to compensation for partial outages. ogy as more tuples can be added for future networks; however,
The key principle of outage compensation is to leverage the determination of probabilities and preferences requires
resources from neighboring cells of outage-affected cells to manual input by experts.
provide temporary services in affected area. These resources 2) Non-convex Coverage Optimization Techniques for Out-
include cell bandwidth and user associations which can be age Compensation: Several studies have explored the use of
modified using primary parameters such as cell/user equip- non-convex optimization methods for outage compensation
ment transmit powers, and antenna parameters as well as based on the analysis that in a large network with a diverse
secondary parameters such as neighbor lists and cell selection set of optimization parameters, outage compensation can be
parameters [40]. In the following subsections, compensation a NP-hard non-convex problem. Conversion of the outage
algorithms are presented based on the optimization objective compensation problem into a convex problem requires too
with description of their methodology of optimization along many generalizations and assumptions which can make the
with parameters of choice and other taxonomically significant result unsuitable for practical implementation. Jiang et al.
insights. [121] and Wenjing et al. [122] base their solutions on this
premise and use non-convex optimization techniques to solve
the problem of coverage optimization.
A. Coverage Area Optimization for Outage Compensation
Jiang et al. [121] have proposed a cost function minimiza-
One of the key consequences of network outages and KPI tion approach which uses weighted sum of downlink channel
degradations is the loss of network coverage near effected net- quality and received signal strength. The authors state that the
work entity. Several studies [66, 71, 118, 119, 120, 121, 122] problem is a large scale non-convex optimization problem.
have presented outage compensation algorithms that focus on Outage compensation is carried out by calculating the optimal
coverage optimization. A list of these studies along with their uplink target received power P0 using a non-convex optimiza-
proposed techniques is presented in Table V. tion technique called immune algorithm [123] for cost function
1) Choosing the right neighboring cells, optimization pa- maximization. The authors show that the immune algorithm
rameters, and recovery action: Choice of neighboring cells, improves both coverage and channel quality after optimization
optimization parameters, and recovery action plays an im- and can converge in a very short time period. The results,
portant role in the effectiveness of an outage compensation compared against two other techniques [124], [125], show that
solution and has been investigated in [118], [119], and [120] the proposed methodology can significantly improve coverage
respectively. The Self-healing framework proposed by Asghar post-optimization by 10% without significantly sacrificing cell
et al. [118] defines an outage compensation algorithm that uses edge throughput. However, it is observed that the immune
received power measurements from users of outage-affected algorithm is highly sensitive to initial parameters i.e., it may
17

TABLE V: Qualitative Comparison of Coverage Optimization Algorithms for Compensation

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Retainability,
[118] Coverage, DL
Heuristic Framework Quality
Coverage,
Coverage [119] Homogeneous UL/DL
Quality Centralized
Optimization
Coverage,
[120] DL
Quality
[121] Non-convex UL/DL
Analytical Coverage, Quality
[122] Optimization
DL
[66] Reinforcement
Learning Based Coverage
[71] Learning HetNet

not be able to escape the infeasible solution set if initial actor-critic module executes an exploratory or exploitative
parameters are not set correctly. actions such as changing antenna tilt or transmit power of a
Similarly, Wenjing et al. [122] propose that the minimization neighboring cell based on probability of reward learned over
of coverage holes and pilot pollution using downlink pilot time. The critic then evaluates the reward associated with the
powers of neighboring cells for outage compensation is also action taken and updates past rewards and probabilities. The
a non-convex problem. In this study, the authors propose to solution is compared against the one presented in [66] with
use a non-convex optimization technique called particle swarm results showing it improves cell coverage and channel quality,
algorithm [126]. Results on the analysis of the algorithm particularly for cell edge users, and brings them closer to pre-
indicate that it is highly efficient in terms of execution time outage levels.
while also recovering over 98% of the coverage area in
terms of signal strength without significantly degrading link
B. SINR Optimization for Outage Compensation
quality. However, like immune algorithm, the particle swarm
algorithm is also highly dependent on initialization parameters A secondary consequence of outage compensation can be
for convergence. the degradation of SINR of existing users in neighboring
3) Learning-based Coverage Optimization Solutions for cells due to parameter reconfiguration. Therefore, some studies
Outage Compensation: Examples of learning-based algo- [124, 127, 128, 129, 130] use SINR as the objective to be
rithms for outage detection and diagnosis covered in the pre- optimized while including the existing and outage-affected
vious sections mostly employed classification and clustering users into the optimization process. This allows them to avoid
techniques. However, reinforcement learning [37] represents or minimize the degradation of SINR in areas not affected by
the most effective learning-based solution for outage com- outage. Table VI lists a qualitative comparison of the studies
pensation algorithms, primarily due to its ability to identify targeting SINR optimization for outage compensation.
maximum reward strategies over a learning period. One re- 1) Heuristic SINR Optimization Solutions for Outage Com-
inforcement learning-based solution for outage compensation pensation: Wang et al. [127] present a distributed heuristic
has been proposed by Zoha et al. [66] within a complete outage compensation algorithm for SINR optimization in
learning-based Self-healing framework. The outage compen- HetNets. The proposed algorithm minimizes the number of
sation component of the framework is built upon fuzzy-logic neighboring cells to be reconfigured to achieve desired post-
based reinforcement learning which adjusts antenna tilts and outage SINR. This is done by calculating an inner group
cell transmit powers to achieve the desirable compensated of femtocells that can recover the outage-affected femtocell
performance in terms of cell coverage. The compensation through reconfiguration of transmit powers, and by creating a
algorithm makes incremental or decremental step changes in second outer group of femtocells beyond which no further
optimization parameters after an outage using exploration of outage compensation actions can be propagated to prevent
new rewards or exploitation of past rewards. The resulting the effects of reconfigurations from rippling outwards. The
network state from the reinforcement learning database is authors demonstrate that the proposed technique requires fewer
interpreted through the fuzzy-logic regulator as better or worse neighboring cells for SINR optimization compared to other
than the previous state which then dictates the next step of solutions such as [131] while also reducing the number of
the reinforcement learning algorithm. The authors demonstrate cells with negative differential SINR compared to pre-outage
that the proposed solution can improve post-outage cell edge values. However, the authors also show that as the density of
coverage by 5 dB while also helping to regain mean data rate the mobile cellular network increases, the grouping algorithms
to pre-outage levels. takes longer to converge.
A similar approach to [66] has been presented by Onireti While the solution in [127] endeavors to find the optimal set
et al. [71] for heterogeneous networks with the difference of compensating neighbors, the solution put forth by Amirijoo
that the fuzzy logic component has been replaced with an et al. [124] focuses on optimization parameters of the neigh-
actor-critic module for enabling reinforcement learning. The boring cells for outage compensation. The algorithm iterates
18

TABLE VI: Qualitative Comparison of SINR Optimization Algorithms for Compensation

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Coverage,
[124] Homogeneous Centralized UL/DL
Heuristic Rule Based Quality
SINR [127]
Distributed
Optimization Convex Opti- HetNet
[128] Analytical Quality DL
mization
[129] Supervised Centralized
[130] Learning Based Learning Homogeneous Distributed

through values of uplink target received power P0 and the terms of overall DL SINR. Simulation results indicate around
antenna tilts of neighboring cells in a homogeneous network. 40% of effected users are restored to their original SINR
The optimal set is obtained when cell coverage can no longer under low load conditions. Similarly, Moysen and Giupponi
be improved without affecting SINR. Results indicate that the [130] propose reinforcement learning technique for adjusting
algorithm can regain pre-outage SINR and coverage values neighbor cell coverage using antenna tilt and the downlink
in low network load scenario. Moreover, the compensation transmission power. The approach differs from the one in [129]
potential of the solution in terms of SINR improves as the such that the actions and rewards are calculated using the
network load decreases while quality degradation is most actor-critic approach discussed previously in [71] for coverage
visible for high and medium loads. optimization instead of fuzzy logic. To make the algorithm in
2) Convex SINR Optimization Solution for Outage Com- [130] work, each cell reserves a certain amount of frequency
pensation: Lee et al. [132] present an outage compensation bandwidth for users effected by the outage. Neighboring cells
solution using the concept of collaborative resource allocation are informed of this bandwidth through the inter-cell interface
strategy. The solution is based on reallocation of dedicated so that a distributed and cooperative outage compensation
bandwidth called Healing Channels (HCs) to provide physical solution can be achieved. The algorithm modifies cell power
channel resources to users affected by an outage. The concept and antenna tilts in fixed step sizes to exploit the reward of
has been used in associated studies for outage compensation, each change which is based on the SINR of users effected by
such as the one by Lee et al. [128] who use a fairness-aware outage. Simulation results indicate that compensation delay
collaborative resource allocation algorithm with the objective is around 500 ms and the approach can compensate 98% of
of maximizing the sum of logarithmic user rates. The maxi- outage users.
mization process guarantees user fairness in terms of resource One key observation regarding reinforcement learning so-
allocation while maximizing user throughput which is directly lutions is that solutions such as the ones presented in [66],
related to bandwidth and user SINR. Use of log-rate removes [71], [129] and [131] require considerable number of training
the possibility of outage facing users not being allocated any examples, or outages, before their actions can become effec-
resources and ensures that the rate maximization algorithm tive. This can make effective deployment of such solutions a
treats all users fairly. The proposed scheme is compared with challenge for mobile cellular network operators.
a number of competing resource allocation solutions for outage
compensation including regular collaborative resource alloca- C. Cell Capacity Optimization for Outage Compensation
tion [128], non-cooperative resource allocation, and the outage Like degradation in SINR, cell overloading is another con-
compensation solution for wireless sensor networks proposed sequence of network outages resulting from re-association of
in [133]. Results show that even though regular collaborative affected users to neighboring cells. Moreover, compensatory
resource allocation offers nearly 10% more mean throughput actions to achieve another objective, such as coverage opti-
gains, those gains are overshadowed by large disparity between mization, can also result in overloading of neighboring cells.
maximum and minimum throughput levels. On the other hand This can lead to users being blocked and service requests being
the fairness aware-collaborative resource allocation algorithm discarded, which affects subscriber QoE. To circumvent these
offers a fairer throughput distribution between users. problems, some studies [131, 132, 134, 135, 136, 137, 138]
3) Learning-Based SINR Optimization Algorithms for Out- have focused on outage compensation solutions that focus
age Compensation: Saeed et al. [129], and Moysen and on optimizing user associations so that the load is fairly
Giupponi [130] employ reinforcement learning techniques to distributed among neighboring cells. Table VII presents a
optimize SINR for outage compensation. Saeed et al. [129] qualitative comparison of these studies.
propose a fuzzy Q-learning algorithm for compensation of 1) Convex Capacity Optimization Solution for Outage Com-
SINR loss due to outage. The algorithm configures transmit pensation: As mentioned previously, Lee et al. [132] have
power and antenna tilts of neighboring cells iteratively using proposed an outage compensation solution for HetNets based
fuzzy logic control and records the rewards in terms of change on collaborative resource allocation. The authors state that
in downlink SINR of affected users. The rewards are used users in faulty femtocells cannot be served reliably by the
by the reinforcement learning algorithm for learning future macro cells due to power imbalance between macro cell and
actions which might lead to better outage compensation in small cells, and cell edge performance limitations of macro
19

TABLE VII: Qualitative Comparison of Capacity Optimization Algorithms for Compensation

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Convex Opti-
[132,134] HetNet Distributed
mization
Analytical
Cell Capacity [135] Non-convex Homogeneous Accessibility, Centralized
DL
Optimization [136] Optimization Quality Distributed
HetNet
[137] Supervised
Learning Based Centralized
[131,138] Learning Homogeneous

cells. Therefore, only normal small cells can support users in converge.
a faulty small cell. To this end, the reserved HCs of healthy Rohde and Wietfeld [136] propose to use probabilistic
small cells are allocated cooperatively to users of the outage- network performance estimation to compensate network out-
affected cell. The proposed scheme finds adaptable set of ages through ad-hoc deployment of unmanned aerial vehicles
HCs, sub-channels and power allocation to maximize network (UAVs) mounted relays. Aerial relays can help to exploit
capacity through convex optimization implemented via an unused local capacities of nearby macro cells which can-
iterative gradient descent algorithm. The solution is quick and not be used optimally for connectivity by users or ground
improves the total capacity utilization of neighboring cells based relays when no line of sight link is available. The
by nearly 30% while also ensuring fairness in terms of user proposed algorithm builds probabilistic estimation models of
throughputs. interference and throughputs through iterative modification
The collaborative resource allocation solution [132] is fur- of relay positions to achieve stable cell loads. The authors
ther extended by Lee et al. [134] to include collaborative have compared results using 1 to 6 aerial relays at different
beamforming strategy along with HC allocation for outage distances from outage cell under stationary user locations with
compensation. The proposed cooperative beamforming strat- results showing that as the number of relays increases and
egy can be performed without power cooperation between distance from outage cell center decreases, average resource
nodes, and is also the optimal transmission strategy under utilization on neighboring cells decreases.
individual power constraints. The proposed algorithm performs 3) Learning-Based Capacity Optimization Solutions for
HC selection through convex optimization based on maxi- Outage Compensation: Aráuz and McClure [137] utilize
mizing system capacity in outage scenario, and then carries probabilistic graphic models derived from Bayesian Networks
out sub-channel allocation and power allocation based on to detect sleeping cells in HetNets and compensate for their
an iterative algorithm. The proposed solution is compared outage. Probabilistic graphic models are used to predict user
against several resource allocation schemes including regular distribution in the outage-affected cell as well. It also allows
collaborative resource allocation, equal power allocation [133] the categorization of incoming load based on the user dis-
and multi-user iterative water filling [139] schemes, with the tribution and the active cell load without the need to store
results showing that for 10 HCs, the proposed algorithm lengthy baseline data. Each neighboring cell of the faulty
improves the average cell capacity by 5% and user fairness cell arranges the predicted load probabilities in increasing
by 10%. order and decides the expansion of its coverage. The authors
2) Non-convex Capacity Optimization Solutions for Outage report that the probabilistic graphic model can successfully
Compensation: As already discussed, a diverse set of problem predict the expected user distribution and incoming loads for
constraints and parameters can result in the outage compensa- majority of the cases which results in 91.1% of the cases in
tion problem becoming non-convex. To solve these problems total coverage recovery with just two sectors cooperating by
researchers must resort to non-convex optimization methods. expanding their footprint. Total recovery is reported for 96%
One such solution presented by Xia et al. [135] uses genetic of the cases with three sectors cooperating. The key advantage
algorithm [32] to solve the capacity optimization problem for of proposed approach is that instead of using all neighboring
outage compensation. The problem objective is to minimize sites or sectors it can yield substantial recovery using only two
the sum of squared difference between capacity utilization of or three neighboring sectors.
a compensated cell and average network capacity utilization in In another study based on supervised learning, Tiwana et al.
a homogeneous network. In this study, the genetic algorithm [131] use statistical learning with constrained optimization for
searches over the user association sets including users affected outage compensation. The study utilizes logistic regression to
by the outage to find the set that minimizes the capacity extract the functional relationships between the noisy KPIs
utilization objective. Results show that the proposed method- including file transfer time, block call rate and drop call
ology can improve average resource utilization by at least 5% rate, and cell resource utilization. These relationships are
compared to non-optimized cell capacity utilization. The key then processed by an optimization engine to calculate the
advantage of using genetic algorithms is their immunity to optimized resource allocation which improves the KPIs of a
initialization point and their ability to get out of the non- degraded cell. The process is iterative and converges to the
feasible zones in the solution set. However, as the size of optimum value in few iterations, which makes it suitable for
a system grows larger, the genetic algorithm takes longer to large mobile cellular networks. Results using Monte Carlo
20

TABLE VIII: Qualitative Comparison of Spectral Efficiency Optimization Algorithms for Compensation

Network Performance Control Direction


Solution Reference Methodology Sub-Method
Topology Metrics Mechanism of Control
Convex Opti-
Spectral [140]
mization Distributed
Efficiency Analytical HetNet Quality DL
[141] Game Theory
Optimization
Multi-
[142] objective Centralized
Optimization

simulations indicate 44% improvement in blocked call rate i.e., by splitting the bandwidth of healthy cells for the purpose
and ∼26% improvement in file transfer time. of compensating users affected by the outage. The problem
The algorithm in [131] has been extended by Tiwana [138] is formulated as a rate maximization coalition game with
to utilize α-fair packet scheduling for radio resource allocation weights for individual users and is solved using equal power
at neighboring cells for outage compensation. At α = 0, the allocation strategy. Once coalitions are formed between users
scheduler acts as max-throughput scheduler, whereas at α and compensating cells, the authors use Lagrangian multipliers
= 1, the scheduler becomes proportional fair. Changing the to solve for the optimal power set with the objective function
value of α allows compromise between higher capacity (higher of maximizing rate over a coalition. The approach requires
throughput for its mobile users) and greater coverage (serving users to go through multiple iterations of cell coalitions
higher number of users concurrently). The results indicate that until the Pareto-optimal coalition is found which may require
for α = 1.3, the average blocked call rate decreases by 61%, significant time expense.
which is a gain of 17% compared to the scheme in [131], Finally, He et al. [142] present a multi-objective opti-
while average bit rate falls by 4%. However, for α = 0.8, the mization based approach for outage compensation in Cloud-
average bit rate increases by 3% while blocked call rate falls RAN architecture. The optimization objective is the weighted
by 5%. sum of spectral efficiency of edge users of outage-affected
remote radio units, and average spectral efficiency of users
in outage and compensating remote radio units. Optimization
D. Spectral Efficiency Optimization for Outage Compensation parameters i.e., antenna tilt of adjacent remote radio units, are
Spectral efficiency is the ratio of data rate to the used band- adjusted to expand the coverage in an online-iterative manner.
width and depends on factors which include user distribution, The algorithm is designed to maximize spectral efficiency
interference, neighboring cell load, geographical SINR dis- of compensating cells and users affected by the outage but
tribution, topology, spectrum reuse, modulation schemes, and does not guarantee global maximization. Results show that
the number of data links between the communicating nodes, the solution can recover spectral efficiency of users affected
among others. Therefore, spectral efficiency is heavily depen- by an outage by 90%.
dent on the outage compensation actions and has been used as
the optimization objective in several studies [140, 141, 142]
which are presented below while their qualitative comparison E. Summary and Insights
is given in Table VIII. A review of techniques for outage compensation in Self-
The physical implementation of HCs, described in [132], healing mobile cellular networks suggests four basic metrics
has been discussed by Lee et al. [140] for outage compen- are targeted in the event of an outage. These are: 1) coverage
sation. The study assumes that indoor base stations or small area, 2) SINR, 3) cell capacity/load, and 4) spectral efficiency.
cells can support scalable bandwidths which can be used to The optimization of these metrics is suitable for legacy mobile
compensate users affected by outage in neighboring small cellular networks. However, future 5G cellular networks will
cells. Furthermore, it is shown that the maximum spectral be more complex and QoE-focused. This means that outage
efficiency in the event of an outage is achieved when the compensation solutions of the future will have to focus on
minimum number of HCs, predetermined by an indoor central more than just these basic metrics. Some examples of potential
unit, is assigned to support users covered by the outage- metrics which will be important in 5G cellular networks
affected cell. The proposed technique achieves the largest include energy efficiency, service latency, and throughput
average cell capacity and user fairness in terms of spectral fluctuations [38].
efficiency when compensating cells can be selected by affected Ensuring service latency by itself will be a major challenge
users opportunistically for each HC, which is called the multi- for network operators in 5G mobile cellular networks due to
cell diversity effect. the complex nature of these networks. A review of outage
Fan and Tian [141] employ game theory to address outage compensation studies suggests that the most popular tech-
compensation in HetNets. The authors propose a resource niques for outage compensation are convex and non-convex
allocation scheme in which data transmission can be done optimization. Both of these techniques are computationally
cooperatively by the cells. Similar to the approach in [134], tedious and require far more time than would be acceptable
channel allocation and cooperation is done at sub-channel level in a 5G network. Furthermore, as these networks become
21

denser, and the number of tunable parameters increases, the misconfiguration which can lead to degradation in user QoE.
optimization process will get slower and more complex. Thus, While a number of studies, including but not limited to [143]
one of the foremost challenges for future outage compensation and [144], have proposed solutions for coordination of SON
solutions will be to reduce the time it takes for an optimization functions, the general approach utilized for coordination is
algorithm to reach its solution. Exploring trade-offs between reactive rather than proactive in nature. While this may be
different metrics for outage compensation in 5G networks will feasible in existing 4G and legacy networks, it cannot be the
also be an interesting future area of study. way forward in 5G mobile cellular networks.
Another important research area in terms of outage com- Possible Solution and Future Research Direction: In order
pensation solutions is their integration into the larger SON to proactively overcome outages due to parametric miscon-
framework. The SON framework includes technique for Self- figurations in 5G mobile cellular networks, the Self-healing
optimization which oftentimes use the same parameters as framework may benefit from the ability to predict when a
outage compensation techniques. For example, coverage and parametric misconfiguration might occur and take preventive
capacity optimization solutions use transmit powers, antenna measures to rectify it. One method of doing that is to explore
tilts and beamforming parameters which are also key for the probabilistic reliability behavior of SON functions. This
outage compensation techniques, as evidenced by the review can be done by exploring techniques such as hidden Markov
of studies above. To avoid this issue, network operators will prediction models, as explored in [46].
need to incorporate a Self-coordination entity to resolve such Using hidden Markov models we can calculate the station-
conflicts. Additionally, coordination will be important to avoid ary probability of a parameter being misconfigured given a
the triggering of Self-optimization as a result of some outage sequence of parametric reconfigurations. This allows us to
compensation action. For example, changing the azimuth of a analyze the long-term reliability behavior of SON-enabled mo-
cell to provide coverage to subscribers of a cell affected by a bile cellular networks to estimate the time of first occurrence
full outage might trigger coverage and capacity optimization of misconfigurations and the fraction of time the network
in a neighboring cell. This could, in turn, trigger a cascade spends in outage. Fig. 6 shows a Markovian SON coordination
of changes in neighboring cells. While some studies have framework that can project the effect of activation of SON
proposed the use of exclusion zones to reduce the impact functions on the overall performance of the network. Such a
of outage compensation on other cells [127], this area needs solution can also be used to identify the selection priorities of
further research. proactive SON functions as well as their network parameters.
Finally, like existing outage detection and outage diagnosis
techniques, outage compensation techniques do not incorpo-
rate technologies such as massive MIMO and millimeter spec- Challenge 2: Coping with increased outages due to increased
trum utilization. To enable Self-healing in 5G networks, more network density
solutions must be explored which focus on these technologies, Network densification, driven by the need to meet capacity
making this a key area of research. and data rate requirements of 5G mobile cellular networks,
means that future mobile cellular networks will have to handle
VII. C HALLENGES AND F UTURE P ROSPECTS IN far more network nodes than before. Higher cell densities
S ELF - HEALING FOR 5G AND BEYOND coupled with technologies such as millimeter wave spectrum
utilization, and more configurable parameters will result in
In order for future 5G mobile cellular networks to achieve
frequent network outages driven by both parametric miscon-
the desired gains laid out by the research and standardization
figurations and routine equipment failures as demonstrated in
community [38], SON solutions must play a far greater role
Figs. 2 and 3.
than ever before [12]. This means that future mobile cellular
Possible Solution and Future Research Direction: A number
networks must be intelligent, proactive, knowledge-rich and
of research areas have been highlighted in recent studies that
interactive at the same time. To achieve this goal, researchers
can aid in dealing with network outages quickly and efficiently,
must develop solutions which enable the network to achieve
especially in the context of dense and ultra-dense HetNets.
self-reliance, and harness the power of vast quantities of
One such approach is the control-data separation architecture
data generated by the users and network nodes to empower
(CDSA) [69] where the control functionality lies with macro
such solutions. However, Self-healing in future mobile cellular
cells while data transmission is handled by small cells. This
networks must cope with several research challenges which
adds redundancy to the network architecture. For example, in
have been discussed below.
the event of a small cell failure, the macro cell can handle
both control and data transmissions to the affected users.
Challenge 1: Coping with increased number of conventional Furthermore, with the development of UAV technology
undetectable outages arising from SON conflicts for enabling 5G mobile cellular networks, UAV-based outage
SON functions deployed independently can potentially compensation techniques, such as the one presented in [136],
come into conflict with each other. A list of potential paramet- can become ubiquitous. Additionally, decreasing cost of small
ric SON conflicts has been presented in [143]. Similarly, [144] cell deployment will mean network densification itself can be
identifies the types of potential SON conflicts that may occur used to create redundancies within the network such that the
in the network when multiple SON functions are deployed UE-to-cell ratio becomes less than 1. This will mean that in
concurrently. A consequence of these conflicts is parametric the event of a small cell failure, there will be additional small
22

Fig. 6: Markovian SON coordination framework

cells ready to serve the users without effecting their QoE. Challenge 4: Meeting 5G latency requirements in Self-healing
Network densification will play an especially significant role
in the context of millimeter wave cells where coverage will be 5G mobile cellular networks are expected to have end-to-
limited to line of sight links and outages due to link obstruction end data latency of 1 ms. This means that any Self-healing
will be frequent. solution deployed in the network must be able to detect,
diagnose and compensate any outage in far less time than
state-of-the-art solutions.
Possible Solution and Future Research Direction: Given
Challenge 3: Coping with sparsity of data due to smaller the nature of detection and compensation tasks within the
number of users per cell Self-healing framework, future Self-healing solutions must
be proactive in nature. This implies that the Self-healing
With network densification, another challenge arises in the framework will predict when and where an outage might occur
form of data sparsity due to fewer users per cell. This will with some probability, and execute changes in neighboring
make full outage detection and partial outage detection ex- cells proactively. Despite the seemingly random nature of
tremely difficult since there will not be enough measurements outages, especially full outages, outage prediction is possible
to accurately distinguish between cell edge users and outage and has been demonstrated by Kumar et al. [147] who have
scenarios. Moreover, even though the expected throughput per used different machine learning techniques such as neural
user will increase, decreasing user density per cell will mean networks, NBC and SVM to predict the next fault from real
fewer users will consume more data, hence data sparsity will network data. Similarly, Kogeda and Agbinya [148] have
stay an issue for Self-healing in 5G mobile cellular networks. predicted fault occurrences by collecting the past data and
Possible Solution and Future Research Direction: As we calculating maximum likelihood of next fault location using
saw in Section IV, the overwhelming majority of full out- Bayesian Network prediction models.
age detection and partial outage detection solutions relied All of the above-mentioned techniques rely on exploitation
on machine learning techniques. However, unlike analytical of big data [12] to identify key patterns in cell and user per-
or heuristic techniques, learning based algorithms are over- formance data and associating the information with previous
whelmingly dependent on data from the network, which can outage information and data. This will allow the proactive Self-
be sparse especially in the case of ultra-dense small cell healing algorithms to identify changes in network performance
deployment. To improve the accuracy of learning-based outage that lead to a failure or an outage. Fig. 7 illustrates the concept
detection solutions and to counter data sparsity in future of exploiting big data resources for prediction of faults in a
mobile cellular networks, measurement prediction techniques future mobile cellular network. The definition of big data in
can be used. Predictive techniques such as Grey prediction the context of Self-healing framework includes historical fault
model [72], and smoothing techniques such as Witten-Bell data, user transition and handover data, network traffic and
smoothing [145] and Good-Turing smoothing [146] can be cell load data, and contextual data mined from sources such
used to remove knowledge gaps in the measurement data. as social media.
23

Fig. 7: Proactive Self-Healing Framework for Future Cellular Networks

Challenge 5: Meeting QoE requirements in Self-healing into the network. However, as discussed previously, network
The combination of requirements for 5G mobile cellular densification can lead to a rise in network outages itself.
networks including low latency, high capacity, high throughput Furthermore, bandwidth limitation becomes even more acute
and low energy consumption means 5G networks will be in the event of an outage when already strained neighboring
user QoE centric compared to legacy networks which were cell resources can become completely choked causing partial
user quality of service centric. This implies that meeting user outages.
QoE requirements will be the utmost priority in future mobile Possible Solution and Future Research Direction: While
cellular networks, even in the event of an outage. Given that millimeter wave spectrum utilization has been promoted as
outages due to failures and parameter misconfigurations are the primary solution to bandwidth limitation [106], it is still in
likely to increase, meeting user QoE will be a key challenge exploratory phases. In addition, the limited range of millimeter
for Self-healing solutions. wave cells does not make them the ideal candidates for outage
Possible Solution and Future Research Direction: The so- compensation solutions unless they are deployed in very high
lution to meeting user QoE requirements despite outages is densities. One possible solution to the issue of bandwidth
to deploy intelligence-rich proactive Self-healing framework limitation for Self-healing is to deploy spectrum sensing or
such as the one shown in Fig. 7. The user-centricity of the cognitive radio solutions [19, 18]. Some outage compensation
framework will be driven by spatio-temporal user activity solutions based on spectrum splitting have been proposed in
models. These include user mobility models derived from user [132, 134, 140, 141] but these solutions propose to reserve
transition data in the form of MDT reports [52] along with user special bandwidth called Healing Channels (HCs) specifically
location information which can easily be harvested from the for outage compensation. Given that mobile cellular networks
positioning sensors inside modern cell phones. Additionally, are already facing bandwidth shortage, this approach may not
user behavior load prediction models can be generated using be suitable especially when there are no outages. To avoid
machine learning techniques shown in Fig. 8 while contextual dedicating bandwidth for outage compensation, cognitive radio
data from social media sources such as Twitter and Facebook technologies can be explored to split the spectrum between
can be mapped to network topology which would help to HCs and normal bandwidth specifically in the event of an
identify potential traffic hotspots and failures. Historical fault outage. Not only would this improve radio resource utilization
data collection can be done by setting up databases that under normal circumstances, it can also improve the service
would include network failure records as well the KPI data provided to outage-affected users by assigning them low
immediately preceding the failure. All this information will be interference resources.
fed to the proactive fault prediction algorithms which would
sit alongside a reactive Self-healing triggering algorithm which Challenge 7: Enabling Self-healing with future 5G services
monitors fault data from live network.
Future 5G mobile cellular networks will be a combination
of a multitude of services including legacy call, text and
Challenge 6: Coping with bandwidth constraints for Self- data services as well as Internet of Things services such as
healing connected homes and smart grids. Each of these services has
Bandwidth constraints are one of the greatest limiting fac- its own requirements. For example, providing wireless con-
tors for mobile cellular network capacity. Limited bandwidth nectivity to smart grids does not require very high data rates
means extra capacity can only be added by adding more cells but data security and robustness is highly important [149]. As
24

Fig. 8: Machine Learning Tools to Enable Proactive Self-Healing Framework for Future Cellular Networks

discussed in Section VI, existing studies on Self-healing only in terms of future 5G mobile cellular networks while also
address how legacy services such as data transmission and presenting possible solutions and future research directions. It
call connectivity would be restored in the event of an outage is hoped that this survey and the prospective research areas
and do not tackle other services expected to be part of 5G presented within it will empower and encourage researchers
networks. to create Self-healing solutions for future mobile cellular
Possible Solution and Future Research Directions: Self- networks that can address the limitations of existing research.
healing for future services such as Internet of Things con-
nectivity is still an open research topic despite being flagged
ACKNOWLEDGMENT
as one of the primary challenges to the technology [150].
Similarly, Self-healing with respect to smart grids has been This material is based upon work supported by the National
raised as a key issue [151]. Use of mobile cellular networks Science Foundation under Grant Numbers 1559483, 1619346
to empower smart grids has been a long standing concept and 1730650.
[149]. However, due to the differences in performance level
requirements for different services, the task of coming up with
R EFERENCES
unified Self-healing solution is very difficult. Some studies
have proposed to use cognitive radio technologies to provide [1] Capgemini, “Quest for Margins: Operational Cost
the required performance levels in smart grids [152] which Strategies for Mobile Operators in Europe,” Telecom
means they can also be a potentially useful tool in restoring Media and Insights, no. 42, 2009.
performance levels in the event of an outage in the mobile [2] A. Networks, “Top Ten Pain Points of Operating Net-
cellular network within the unified Self-healing framework. works,” Report, 2011.
[3] P. Donegan, “Mobile Network Outages and Service
Degradations,” Heavy Reading, Report, 2016.
VIII. C ONCLUSION
[4] 3GPP, “TR 36.902 - V9.3.1 Evolved Universal Ter-
Self-healing is potentially the most powerful SON compo- restrial Radio Access Network (E-UTRAN); Self-
nent in terms of reducing mobile cellular network operational configuring and self-optimizing network (SON) use
expenses, especially in future networks. However, to this cases and solutions,” 2011.
date, a comprehensive study on the existing literature on [5] ——, “TS 32.541 - V10.0.0 Telecommunication
Self-healing techniques for cellular networks was not carried management; Self-Organizing Networks (SON); Self-
out. This study is an attempt to rectify this issue through healing concepts and requirements,” 2011.
a complete background review of Self-healing in terms of [6] ——, “TS 32.522 - V11.7.0 Telecommunication man-
mobile cellular networks along with a description of the agement; Self-Organizing Networks (SON) Policy Net-
complete Self-healing framework. Moreover, we have pre- work Resource Model (NRM) Integration Reference
sented methodologies, topologies, design metrics and control Point (IRP) Information Service (IS),” 2013.
mechanisms along with their descriptions which are employed [7] CircleID, “Misconfiguration brings down
in the reviewed studies. We have also surveyed the studies entire .se domain in Sweden. [Online]
in each of the three Self-healing framework components i.e., www.circleid.com/posts/misconfiguration_brings_down
outage detection, diagnosis and compensation in the event of _entire_se_domain_in_sweden.”
a failure or KPI degradation. [8] R. Johnson, “More details on today’s outage.
In addition to the review of existing literature supporting [Online] http://www.facebook.com/notes/facebook-
Self-healing for mobile cellular networks, this study presents engineering/more-details-on-todays-outage/
and elaborates the challenges faced by Self-healing functions 431441338919.”
25

[9] N. Bhushan, J. Li, D. Malladi, R. Gilmore, D. Bren- Ghafar, F. A. Saparudin, and N. Katiran, “Challenges
ner, A. Damnjanovic, R. Sukhavasi, C. Patel, and and practical implementation of self-organizing net-
S. Geirhofer, “Network densification: the dominant works in LTE/LTE-Advanced systems,” in Proc. In-
theme for wireless evolution into 5G,” IEEE Communi- ternational Conference on Information Technology and
cations Magazine, vol. 52, no. 2, pp. 82–89, 2014. Multimedia (ICIM), 2011, pp. 1–5.
[10] S.-I. Yang, D. M. Frangopol, and L. C. Neves, “Service [25] 3GPP, “TS 32.111-1 - V13.0.0 Telecommunication
life prediction of structural systems using lifetime func- management; Fault Management; Part 1: 3G fault man-
tions with emphasis on bridges,” Reliability Engineering agement requirements,” 2016.
and System Safety, vol. 86, no. 1, pp. 39–51, 2004. [26] P. Stuckmann, Z. Altman, H. Dubreil, A. Ortega,
[11] D. Turner, K. Levchenko, J. C. Mogul, S. Savage, R. Barco, M. Toril, M. Fernandez, M. Barry, S. Mc-
and A. C. Snoeren, “On failure in managed enterprise Grath, G. Blyth, P. Saidha, and L. M. Nielsen, “The
networks,” HP Labs HPL-2012-101, 2012. EUREKA Gandalf project: monitoring and self-tuning
[12] A. Imran, A. Zoha, and A. Abu-Dayya, “Challenges in techniques for heterogeneous radio access networks,”
5G: how to empower SON with big data for enabling in Proc. IEEE 61st Vehicular Technology Conference
5G,” IEEE Network, vol. 28, no. 6, pp. 27–33, 2014. (VTC), 2005, pp. 2570–2574.
[13] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairava- [27] L. Schmelz, J. Van Den Berg, R. Litjens, K. Zetter-
sundaram, and S. Pasupathy, “An empirical study on berg, M. Amirijoo, K. Spaey, I. Balan, N. Scully, and
configuration errors in commercial and open source S. Stefanski, “Self-organisation in wireless networks use
systems,” in Proc. 23rd ACM Symposium on Operating cases and their interrelation,” in Proc. Wireless World
Systems Principles, 2011, pp. 159–172. Res. Forum Meeting, vol. 22, pp. 1–5.
[14] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans, [28] A. Imran, “Quality of Service Aware Energy Efficient
“A Survey of Self Organisation in Future Cellular Net- Self Organizing Future Cellular Networks.” [Online].
works,” IEEE Communications Surveys and Tutorials, Available: http://qson.org/
vol. 15, no. 1, pp. 336–361, 2013. [29] R. Litjens, F. Gunnarsson, B. Sayrac, K. Spaey, C. Will-
[15] M. Peng, D. Liang, Y. Wei, J. Li, and H. H. Chen, “Self- cock, A. Eisenblatter, B. G. Rodriguez, and T. Kurner,
configuration and self-optimization in LTE-advanced “Self-management for unified heterogeneous radio ac-
heterogeneous networks,” IEEE Communications Mag- cess networks,” in Proc. IEEE 77th Vehicular Technol-
azine, vol. 51, no. 5, pp. 36–45, May 2013. ogy Conference (VTC), 2013-Spring, pp. 1–5.
[16] I. F. Akyildiz, W.-Y. Lee, and K. R. Chowdhury, [30] S. Boyd and L. Vandenberghe, Convex optimization.
“CRAHNs: Cognitive radio ad hoc networks,” Ad Hoc Cambridge University Press, 2004.
Networks, vol. 7, no. 5, pp. 810 – 836, 2009. [31] R. M. Lewis, V. Torczon, and M. W. Trosset, “Direct
[17] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty, search methods: then and now,” Journal of Computa-
“NeXt generation/dynamic spectrum access/cognitive tional and Applied Mathematics, vol. 124, no. 1, pp.
radio wireless networks: A survey,” Computer Net- 191–207, 2000.
works, vol. 50, no. 13, pp. 2127 – 2159, 2006. [32] D. Whitley, “A genetic algorithm tutorial,” Statistics and
[18] I. F. Akyildiz, W. y. Lee, M. C. Vuran, and S. Mohanty, computing, vol. 4, no. 2, pp. 65–85, 1994.
“A survey on spectrum management in cognitive radio [33] P. J. Van Laarhoven and E. H. Aarts, Simulated anneal-
networks,” IEEE Communications Magazine, vol. 46, ing. Springer, 1987, pp. 7–15.
no. 4, pp. 40–48, April 2008. [34] J. F. Nash Jr., “The bargaining problem,” Econometrica:
[19] F. Akhtar, M. H. Rehmani, and M. Reisslein, “White Journal of the Econometric Society, pp. 155–162, 1950.
space: Definitional perspectives and their role in ex- [35] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell,
ploiting spectrum opportunities,” Telecommunications Machine learning: An artificial intelligence approach.
Policy, vol. 40, no. 4, pp. 319 – 331, 2016. Springer Science and Business Media, 2013.
[20] Z. Zhang, K. Long, and J. Wang, “Self-organization [36] K. P. Murphy, Machine learning: A probabilistic per-
paradigms and optimization approaches for cognitive spective. MIT Press Cambridge, 2012.
radio technologies: A survey,” IEEE Wireless Commu- [37] R. S. Sutton and A. G. Barto, Reinforcement learning:
nications, vol. 20, no. 2, pp. 36–42, 2013. An introduction. MIT Press Cambridge, 1998, vol. 1.
[21] D. Ghosh, R. Sharman, H. Raghav Rao, and S. Upad- [38] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly,
hyaya, “Self-healing systems - survey and synthesis,” A. Lozano, A. C. Soong, and J. C. Zhang, “What will
Decision Support Systems, vol. 42, no. 4, pp. 2164– 5G be?” IEEE Journal on selected areas in communi-
2185, 2007. cations, vol. 32, no. 6, pp. 1065–1082, 2014.
[22] H. Psaier and S. Dustdar, “A survey on self-healing [39] 3GPP, “TS 32.450 - V13.0.0 Telecommunication
systems: approaches and systems,” Computing, vol. 91, management; Key Performance Indicators (KPI) for
no. 1, pp. 43–73, 2011. Evolved Universal Terrestrial Radio Access Network
[23] L. Paradis and Q. Han, “A Survey of Fault Management (E-UTRAN): Definitions,” 2016.
in Wireless Sensor Networks,” Journal of Network and [40] M. Amirijoo, L. Jorguseski, T. Kurner, R. Litjens,
Systems Management, vol. 15, no. 2, pp. 171–190, 2007. M. Neuland, L. Schmelz, and U. Turke, “Cell outage
[24] M. Marwangi, N. Fisal, S. Yusof, R. A. Rashid, A. S. management in LTE networks,” in Proc. 6th Interna-
26

tional Symposium on Wireless Communication Systems, buch, “Detection of sleeping cells in LTE networks
2009. ISWCS, 2009, pp. 600–604. using diffusion maps,” in Proc. IEEE 73rd Vehicular
[41] Q. Liao, M. Wiczanowski, and S. Stańczak, “Toward Technology Conference (VTC), 2011-Spring, pp. 1–5.
cell outage detection with composite hypothesis test- [57] R. R. Coifman and S. Lafon, “Diffusion maps,” Applied
ing,” in Proc. IEEE International Conference on Com- and Computational Harmonic Analysis, vol. 21, no. 1,
munications (ICC), 2012, pp. 4883–4887. pp. 5–30, 2006.
[42] C. M. Mueller, M. Kaschub, C. Blankenhorn, and [58] J. A. Hartigan and M. A. Wong, “Algorithm AS 136:
S. Wanke, “A cell outage detection algorithm using A k-means clustering algorithm,” Journal of the Royal
neighbor cell list reports,” in Proc. International Work- Statistical Society. Series C (Applied Statistics), vol. 28,
shop on Self-Organizing Systems, pp. 218–229. no. 1, pp. 100–108, 1979.
[43] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, [59] F. Chernogorov, T. Ristaniemi, K. Brigatti, and S. Cher-
Classification and regression trees. CRC press, 1984. nov, “N-gram analysis for sleeping cell detection in
[44] D. Ro and H. Pe, Pattern classification and scene LTE networks,” in Proc. IEEE International Conference
analysis. John Wiley., 1973. on Acoustics, Speech and Signal Processing (ICASSP),
[45] M. Alias, N. Saxena, and A. Roy, “Efficient Cell Outage 2013, pp. 4439–4443.
Detection in 5G HetNets Using Hidden Markov Model,” [60] S. Wold, K. Esbensen, and P. Geladi, “Principal compo-
IEEE Communications Letters, vol. 20, no. 3, pp. 562– nent analysis,” Chemometrics and intelligent laboratory
565, 2016. systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[46] L. Rabiner and B. Juang, “An introduction to hidden [61] Z. He, X. Xu, and S. Deng, “Discovering cluster-based
Markov models,” IEEE ASSP Magazine, vol. 3, no. 1, local outliers,” Pattern Recognition Letters, vol. 24,
pp. 4–16, 1986. no. 9, pp. 1641–1650, 2003.
[47] P. Szilágyi and S. Novaczki, “An automatic detection [62] A. Zoha, A. Imran, A. Abu-Dayya, and A. Saeed, “A
and diagnosis framework for mobile communication Machine Learning Framework for Detection of Sleeping
systems,” IEEE Transactions on Network and Service Cells in LTE Network,” in Proc. Machine Learning and
Management, vol. 9, no. 2, pp. 184–197, 2012. Data Analysis Symposium.
[48] S. Chernov, M. Cochez, and T. Ristaniemi, “Anomaly [63] A. Zoha, A. Saeed, A. Imran, M. A. Imran, and A. Abu-
detection algorithms for the sleeping cell detection in Dayya, “A SON solution for sleeping cell detection
LTE networks,” in Proc. IEEE 81st Vehicular Technol- using low-dimensional embedding of MDT measure-
ogy Conference (VTC), 2015-Spring, pp. 1–5. ments,” in Proc. IEEE 25th Annual International Sym-
[49] K. Fukunaga and P. M. Narendra, “A branch and bound posium on Personal, Indoor, and Mobile Radio Com-
algorithm for computing k-nearest neighbors,” IEEE munication (PIMRC), 2014, pp. 1626–1630.
Transactions on Computers, vol. 100, no. 7, pp. 750– [64] J. B. Kruskal and M. Wish, Multidimensional scaling.
753, 1975. Sage, 1978, vol. 11.
[50] T. Kohonen, “The self-organizing map,” Neurocomput- [65] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander,
ing, vol. 21, no. 1, pp. 1–6, 1998. “LOF: identifying density-based local outliers,” in ACM
[51] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, Sigmod Record. ACM, pp. 93–104.
“Locality-sensitive hashing scheme based on p-stable [66] A. Zoha, A. Saeed, A. Imran, M. A. Imran, and A. Abu-
distributions,” in Proc. 20th Annual Symposium on Dayya, “A learning-based approach for autonomous
Computational Geometry, pp. 253–262. outage detection and coverage optimization,” Transac-
[52] 3GPP, “TS 37.320 - V10.0.0 Universal Terrestrial Ra- tions on Emerging Telecommunications Technologies,
dio Access (UTRA) and Evolved Universal Terrestrial 2015.
Radio Access (E-UTRA); Radio measurement collec- [67] J. Weston and C. Watkins, “Multi-class support vector
tion for Minimization of Drive Tests (MDT); Overall machines,” Citeseer, Report, 1998.
description; Stage 2,” 2010. [68] S. Chernov, M. Pechenizkiy, and T. Ristaniemi, “The
[53] J. Zhang, X. Tuo, Z. Yuan, W. Liao, and H. Chen, influence of dataset size on the performance of cell
“Analysis of fMRI data using an integrated principal outage detection approach in LTE-A networks,” in Proc.
component analysis and supervised affinity propagation 10th International Conference on Information, Commu-
clustering approach,” IEEE Transactions on Biomedical nications and Signal Processing (ICICS), 2015, pp. 1–5.
Engineering, vol. 58, no. 11, pp. 3184–3196, 2011. [69] A. Mohamed, O. Onireti, M. A. Imran, A. Imran, and
[54] Y. Ma, M. Peng, W. Xue, and X. Ji, “A dynamic affinity R. Tafazolli, “Control-data separation architecture for
propagation clustering algorithm for cell outage detec- cellular radio access networks: A survey and outlook,”
tion in self-healing networks,” in Proc. IEEE Wireless IEEE Communications Surveys and Tutorials, vol. 18,
Communications and Networking Conference (WCNC), no. 1, pp. 446–465, 2016.
2013, pp. 2266–2270. [70] O. Onireti, A. Imran, M. A. Imran, and R. Tafazolli,
[55] P. K. Velamuru, R. A. Renaut, H. Guo, and K. Chen, “Cell outage detection in heterogeneous networks with
“Robust clustering of positron emission tomography separated control and data plane,” in Proc. 20th Euro-
data,” Joint Interfce CSNA, 2005. pean Wireless Conference, 2014, pp. 1–6.
[56] F. Chernogorov, J. Turkka, T. Ristaniemi, and A. Aver- [71] O. Onireti, A. Zoha, J. Moysen, A. Imran, L. Giupponi,
27

M. A. Imran, and A. Abu-Dayya, “A cell outage man- H. Sanneck, “On the feasibility of deploying cell
agement framework for dense heterogeneous networks,” anomaly detection in operational cellular networks,”
IEEE Transactions on Vehicular Technology, vol. 65, in Proc. IEEE Network Operations and Management
no. 4, pp. 2097–2113, 2016. Symposium (NOMS), 2014, pp. 1–6.
[72] D. Julong, “Introduction to grey system theory,” The [88] G. F. Ciocarlie, U. Lindqvist, S. Nováczki, and H. San-
Journal of Grey System, vol. 1, no. 1, pp. 1–24, 1989. neck, “Detecting anomalies in cellular networks using
[73] W. Wang, J. Zhang, and Q. Zhang, “Cooperative cell an ensemble method,” in Proc. 9th international con-
outage detection in self-organizing femtocell networks,” ference on Network and service management (CNSM),
in Proc. IEEE International Conference on Computer pp. 171–174.
Communications (INFOCOM), 2013, pp. 782–790. [89] F. J. Massey Jr., “The Kolmogorov-Smirnov test for
[74] W. Xue, H. Zhang, Y. Li, D. Liang, and M. Peng, “Cell goodness of fit,” Journal of the American statistical
outage detection and compensation in two-tier hetero- Association, vol. 46, no. 253, pp. 68–78, 1951.
geneous networks,” International Journal of Antennas [90] R. B. Cleveland, W. S. Cleveland, and I. Terpenning,
and Propagation, vol. 2014, 2014. “STL: A seasonal-trend decomposition procedure based
[75] I. A. Karatepe and E. Zeydan, “Anomaly detection in on loess,” Journal of Official Statistics, vol. 6, no. 1,
cellular network data using big data analytics,” in Proc. p. 3, 1990.
20th European Wireless Conference, 2014, pp. 1–5. [91] G. A. Barreto, J. C. M. Mota, L. G. M. Souza, R. A.
[76] Apache, “http://hadoop.apache.org/.” Frota, and L. Aguayo, “Condition monitoring of 3G
[77] M. Z. Shafiq, L. Ji, A. X. Liu, J. Pang, S. Venkataraman, cellular networks through competitive neural models,”
and J. Wang, “Characterizing and optimizing cellular IEEE Transactions on Neural Networks, vol. 16, no. 5,
network performance during crowded events,” Biologi- pp. 1064–1075, 2005.
cal Cybernetics, vol. 24, no. 3, pp. 1308–1321, 2016. [92] C.-C. Hung, “Competitive learning networks for un-
[78] S. Novaczki and P. Szilagyi, “Radio channel degradation supervised training,” International Journal of Remote
detection and diagnosis based on statistical analysis,” Sensing, vol. 14, no. 12, pp. 2411–2415, 1993.
in Proc. IEEE 73rd Vehicular Technology Conference [93] R. A. Frota, G. A. Barreto, and J. Mota, “Anomaly
(VTC), 2011-Spring, pp. 1–2. detection in mobile communication networks using the
[79] A. D’Alconzo, A. Coluccia, F. Ricciato, and P. Romirer- self-organizing map,” Journal of Intelligent and Fuzzy
Maierhofer, “A distribution-based approach to anomaly Systems, vol. 18, no. 5, pp. 493–500, 2007.
detection and application to 3G mobile traffic,” in Proc. [94] M. Tanaka, M. Sakawa, I. Shiromaru, and T. Mat-
IEEE Global Telecommunications Conference (GLOBE- sumoto, “Application of Kohonen’s self-organizing net-
COM), 2009, pp. 1–8. work to the diagnosis system for rotating machinery,”
[80] J. M. Joyce, Kullback-leibler divergence. Springer, in Proc. IEEE International Conference on Systems,
2011, pp. 720–722. Man and Cybernetics/Intelligent Systems for the 21st
[81] M. Z. Asghar, R. Fehlmann, and T. Ristaniemi, Century, 1995, pp. 4039–4044.
“Correlation-based cell degradation detection for opera- [95] P. Lehtimäki and K. Raivio, “A SOM based approach
tional fault detection in cellular wireless base-stations,” for visualization of GSM network performance data,”
in Proc. International Conference on Mobile Networks in Proc. International Conference on Industrial, Engi-
and Management, pp. 83–93. neering and Other Applications of Applied Intelligent
[82] P. Muñoz, R. Barco, I. Serrano, and A. Gómez- Systems, pp. 588–598.
Andrades, “Correlation-Based Time-Series Analysis for [96] P. Kumpulainen and K. Hätönen, “Local anomaly de-
Cell Degradation Detection in SON.” tection for mobile network monitoring,” Information
[83] J. Sanchez-Gonzalez, O. Sallent, J. Pérez-Romero, Sciences, vol. 178, no. 20, pp. 3840–3859, 2008.
R. Agusti, M. Díaz-Guerra, J. A. Moreno, and D. Paul, [97] A. Gómez-Andrades, P. Muñoz, I. Serrano, and
“A new methodology for RF failure detection in UMTS R. Barco, “Automatic root cause analysis for LTE
networks,” in Proc. IEEE Network Operations and networks based on unsupervised techniques,” IEEE
Management Symposium, 2008, pp. 718–721. Transactions on Vehicular Technology, vol. 65, no. 4,
[84] P. Kumpulainen, M. Särkioja, M. Kylväjä, and K. Hätö- pp. 2369–2386, 2016.
nen, “Analysing 3G radio network performance with [98] T.-W. Lee, Independent component analysis. Springer,
fuzzy methods,” Neurocomputing, vol. 107, pp. 49–58, 1998, pp. 27–66.
2013. [99] J. H. Ward Jr., “Hierarchical grouping to optimize an
[85] R. Xu and D. Wunsch, “Survey of clustering al- objective function,” Journal of the American statistical
gorithms,” IEEE Transactions on Neural Networks, association, vol. 58, no. 301, pp. 236–244, 1963.
vol. 16, no. 3, pp. 645–678, 2005. [100] D. L. Davies and D. W. Bouldin, “A cluster separation
[86] R. Sattiraju, P. Chakraborty, and H. D. Schotten, “Relia- measure,” IEEE Transactions on Pattern Analysis and
bility analysis of a wireless transmission as a repairable Machine Intelligence, no. 2, pp. 224–227, 1979.
system,” in Proc. Globecom Workshops (GC Wkshps), [101] J. C. Bezdek and N. R. Pal, “Some new indexes of
2014, pp. 1397–1401. cluster validity,” IEEE Transactions on Systems, Man,
[87] G. Ciocarlie, U. Lindqvist, K. Nitz, S. Nováczki, and and Cybernetics, Part B (Cybernetics), vol. 28, no. 3,
28

pp. 301–315, 1998. for wireless networks,” IEEE Transactions on Mobile


[102] S. Rezaei, H. Radmanesh, P. Alavizadeh, H. Nikoofar, Computing, vol. 7, no. 6, pp. 673–681, 2008.
and F. Lahouti, “Automatic fault detection and diagnosis [115] R. Barco, L. Díez, V. Wille, and P. Lázaro, “Automatic
in cellular networks using operations support systems diagnosis of mobile communication networks under im-
data,” in Proc. IEEE/IFIP Network Operations and precise parameters,” Expert systems with Applications,
Management Symposium (NOMS), pp. 468–473. vol. 36, no. 1, pp. 489–500, 2009.
[103] D. A. Hill, L. M. Delaney, and S. Roncal, “A chi- [116] J. Laiho, K. Raivio, P. Lehtimaki, K. Hatonen, and
square automatic interaction detection (CHAID) anal- O. Simula, “Advanced analysis methods for 3G cellular
ysis of factors determining trauma outcomes,” Journal networks,” IEEE Transactions on Wireless Communica-
of Trauma and Acute Care Surgery, vol. 42, no. 1, pp. tions, vol. 4, no. 3, pp. 930–942, 2005.
62–66, 1997. [117] U. Fayyad and K. Irani, “Multi-interval discretization of
[104] G. F. Ciocarlie, C. Connolly, C.-C. Cheng, U. Lindqvist, continuous-valued attributes for classification learning,”
S. Nováczki, H. Sanneck, and M. Naseer-ul Islam, 1993.
“Anomaly detection and diagnosis for automatic radio [118] M. Z. Asghar, S. Hämäläinen, and T. Ristaniemi, “Self-
network verification,” in International Conference on healing framework for LTE networks,” in Proc. IEEE
Mobile Networks and Management, pp. 163–176. 17th International Workshop on Computer Aided Mod-
[105] M. Dandan, Q. Xiaowei, and W. Weidong, “Anomalous eling and Design of Communication Links and Net-
cell detection with kernel density-based local outlier works (CAMAD), 2012, pp. 159–161.
factor,” China Communications, vol. 12, no. 9, pp. 64– [119] M. Amirijoo, L. Jorguseski, R. Litjens, and R. Nasci-
75, 2015. mento, “Effectiveness of cell outage compensation in
[106] T. S. Rappaport, G. R. MacCartney, M. K. Samimi, LTE networks,” in Proc. IEEE Consumer Communica-
and S. Sun, “Wideband Millimeter-Wave Propagation tions and Networking Conference (CCNC), 2011, pp.
Measurements and Channel Models for Future Wireless 642–647.
Communication System Design,” IEEE Transactions on [120] C. Frenzel, H. Sanneck, and B. Bauer, “Automated
Communications, vol. 63, no. 9, pp. 3029–3056, Sept rational recovery selection for self-healing in mobile
2015. networks,” in Proc. International Symposium on Wire-
[107] R. M. Khanafer, B. Solana, J. Triola, R. Barco, L. Molt- less Communication Systems (ISWCS), 2012, pp. 41–45.
sen, Z. Altman, and P. Lazaro, “Automated diagnosis [121] Z. Jiang, Y. Peng, Y. Su, W. Li, and X. Qiu, “A cell out-
for UMTS networks using Bayesian network approach,” age compensation scheme based on immune algorithm
IEEE Transactions on Vehicular Technology, vol. 57, in LTE networks,” in Proc. 15th Asia-Pacific Network
no. 4, pp. 2451–2461, 2008. Operations and Management Symposium (APNOMS),
[108] R. Barco, V. Wille, L. Diez, and P. Laizaro, “Compari- 2013, pp. 1–6.
son of probabilistic models used for diagnosis in cellular [122] L. Wenjing, Y. Peng, J. Zhengxin, and L. Zifan, “Cen-
networks,” in Proc. IEEE 63rd Vehicular Technology tralized management mechanism for cell outage com-
Conference (VTC), 2006-Spring, pp. 981–985. pensation in LTE networks,” International Journal of
[109] D. Heckerman and J. S. Breese, “Causal independence Distributed Sensor Networks, 2012.
for probability assessment and inference using Bayesian [123] G. Hong and M. Zong-Yuan, “Immune algorithm,” in
networks,” IEEE Transactions on Systems, Man, and Proc. 4th World Congress on Intelligent Control and
Cybernetics-Part A: Systems and Humans, vol. 26, Automation, 2002, pp. 1784–1788.
no. 6, pp. 826–831, 1996. [124] M. Amirijoo, L. Jorguseski, R. Litjens, and L.-C.
[110] R. Barco, P. Lázaro, V. Wille, and L. Díez, “Knowledge Schmelz, “Cell outage compensation in LTE networks:
acquisition for diagnosis in cellular networks based on algorithms and performance assessment,” in Proc. IEEE
bayesian networks,” in Proc. International Conference 73rd Vehicular Technology Conference (VTC), 2011-
on Knowledge Science, Engineering and Management, Spring, pp. 1–5.
pp. 55–65. [125] F. Li, X. Qiu, W. Li, and H.-L. Wan, “High load cell
[111] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A outage compensation method in TD-SCDMA wireless
density-based algorithm for discovering clusters in large access network,” Journal of Beijing University of Posts
spatial databases with noise,” in Kdd, pp. 226–231. and Telecommunications, vol. 35, no. 1, pp. 32–35,
[112] W. H. Day and H. Edelsbrunner, “Efficient algo- 2012.
rithms for agglomerative hierarchical clustering meth- [126] R. Eberhart and J. Kennedy, “A new optimizer using
ods,” Journal of classification, vol. 1, no. 1, pp. 7–24, particle swarm theory,” in Proc. 6th International Sym-
1984. posium on Micro Machine and Human Science, 1995,
[113] P. J. Rousseeuw, “Silhouettes: A graphical aid to the in- pp. 39–43.
terpretation and validation of cluster analysis,” Journal [127] W. Wang, J. Zhang, and Q. Zhang, “LOGA: Local
of Computational and Applied Mathematics, vol. 20, pp. grouping architecture for self-healing femtocell net-
53–65, 1987. works,” in Proc. IEEE Global Communications Con-
[114] R. Barco, P. Lázaro, L. Díez, and V. Wille, “Contin- ference (GLOBECOM), 2012, pp. 5136–5141.
uous versus discrete model in autodiagnosis systems [128] K. Lee, H. Lee, and D.-H. Cho, “Fairness-aware cooper-
29

ative resource allocation for self-healing in SON-based [142] L. He, X. Su, J. Zeng, X. Xu, and Y. Kuang, “Automated
indoor system,” IEEE Communications Letters, vol. 16, healing approach in cloud base-station with high load
no. 7, pp. 1030–1033, 2012. using RRUs cooperation,” in Proc. IEEE Globecom
[129] A. Saeed, O. G. Aliu, and M. A. Imran, “Controlling Workshops (GC Wkshps), 2012, pp. 285–290.
self healing cellular networks using fuzzy logic,” in [143] H. Y. Lateef, A. Imran, M. A. Imran, L. Giupponi,
Proc. IEEE Wireless Communications and Networking and M. Dohler, “LTE-advanced self-organizing network
Conference (WCNC), 2012, pp. 3080–3084. conflicts and coordination algorithms,” IEEE Wireless
[130] J. Moysen and L. Giupponi, “A Reinforcement Learn- Communications, vol. 22, no. 3, pp. 108–117, 2015.
ing based solution for Self-Healing in LTE networks,” [144] H. Y. Lateef, A. Imran, and A. Abu-Dayya, “A frame-
in Proc. IEEE 80th Vehicular Technology Conference work for classification of Self-Organising network con-
(VTC), 2014-Fall, pp. 1–6. flicts and coordination algorithms,” in Proc. IEEE 24th
[131] M. I. Tiwana, B. Sayrac, and Z. Altman, “Statistical International Symposium on Personal Indoor and Mo-
learning in automated troubleshooting: Application to bile Radio Communications (PIMRC), 2013, pp. 2898–
LTE interference mitigation,” IEEE Transactions on 2903.
Vehicular Technology, vol. 59, no. 7, pp. 3651–3656, [145] I. H. Witten, T. C. Bell, M.-E. Harrison, M. L. James,
2010. and A. Moffat, “Textual image compression,” 1991.
[132] K. Lee, H. Lee, and D.-H. Cho, “Collaborative resource [146] I. J. Good, “The population frequencies of species and
allocation for self-healing in self-organizing networks,” the estimation of population parameters,” Biometrika,
in Proc. IEEE International Conference on Communi- pp. 237–264, 1953.
cations (ICC), 2011, pp. 1–5. [147] Y. Kumar, H. Farooq, and A. Imran, “Fault Prediction
[133] H. Lee and K. Lee, “Resource allocation considering and Reliability Analysis in a Real Cellular Network,” in
fault management in indoor mobile-WiMAX system,” in Proc. 13th International Wireless Communications and
Proc. IEEE 20th International Symposium on Personal, Mobile Computing Conference (IWCMC), 2017.
Indoor and Mobile Radio Communications, 2009, pp. [148] O. P. Kogeda and J. I. Agbinya, “Proactive Cellular
1492–1496. Network Faults Prediction Through Mobile Intelligent
[134] K. Lee, H. Lee, Y.-U. Jang, and D.-H. Cho, “CoBRA: Agent Technology,” in Proc. 2nd International Con-
Cooperative beamforming-based resource allocation for ference on Wireless Broadband and Ultra Wideband
self-healing in SON-based indoor mobile communica- Communications, 2007, pp. 55–55.
tion system,” IEEE Transactions on Wireless Commu- [149] A. Qaddus and A. A. Minhas, “Wireless communication
nications, vol. 12, no. 11, pp. 5520–5528, 2013. a sustainable solution for future smart grid networks,” in
[135] L. Xia, W. Li, H. Zhang, and Z. Wang, “A cell out- Proc. International Conference on Open Source Systems
age compensation mechanism in self-organizing RAN,” Technologies (ICOSST), Dec 2016, pp. 13–17.
in Proc. 7th International Conference on Wireless [150] J. A. Stankovic, “Research Directions for the Internet of
Communications, Networking and Mobile Computing Things,” IEEE Internet of Things Journal, vol. 1, no. 1,
(WiCOM), 2011, pp. 1–4. pp. 3–9, Feb 2014.
[136] S. Rohde and C. Wietfeld, “Interference aware posi- [151] M. Amin, “Challenges in reliability, security, efficiency,
tioning of aerial relays for cell overload and outage and resilience of energy infrastructure: Toward smart
compensation,” in Proc. IEEE Vehicular Technology self-healing electric power grid,” in Proc. IEEE Power
Conference (VTC), 2012-Fall, pp. 1–5. and Energy Society General Meeting - Conversion and
[137] J. Aráuz and W. McClure, “PGM structures in self- Delivery of Electrical Energy in the 21st Century, July
organized healing for small cell networks,” in Proc. 2008, pp. 1–5.
International Conference on Selected Topics in Mobile [152] A. A. Khan, M. H. Rehmani, and M. Reisslein, “Cog-
and Wireless Networking (MoWNeT), 2013, pp. 7–12. nitive Radio for Smart Grids: Survey of Architectures,
[138] M. I. Tiwana, “Enhancemant of the Statistical Learning Spectrum Sensing Mechanisms, and Networking Proto-
Automated Healing (SLAH) technique using packet cols,” IEEE Communications Surveys Tutorials, vol. 18,
scheduling,” in Proc. International Conference on no. 1, pp. 860–898, Firstquarter 2016.
Emerging Technologies (ICET). IEEE, 2012, Confer-
ence Proceedings, pp. 1–5.
[139] W. Yu, “Multiuser water-filling in the presence of
crosstalk,” in Proc. Information Theory and Applica-
tions Workshop, 2007, pp. 414–420.
[140] H. Lee, H. Kim, and K. Lee, “Collaborative self-healing
with opportunistic IBS selection in indoor wireless com-
munication systems,” IEEE Communications Letters,
vol. 18, no. 12, pp. 2209–2212, 2014.
[141] S. Fan and H. Tian, “Cooperative resource allocation
for self-healing in small cell networks,” IEEE Commu-
nications Letters, vol. 19, no. 7, pp. 1221–1224, 2015.
30

Ahmad Asghar (S’17) received his B.Sc. degree in


Electronics Engineering from Ghulam Ishaq Khan
Institute of Science and Technology, Pakistan, in
2010 and the M.Sc. degree in Electrical Engineering
from Lahore University of Management and Tech-
nology, Pakistan in 2014. Currently he is pursuing
the Ph.D. degree in Electrical and Computer En-
gineering at the University of Oklahoma, USA as
well as contributing to multiple NSF funded studies
on 5th Generation Cellular Networks. His research
work includes studies on Self-Healing and Self-
Coordination of Self-Organizing Functions in Future Big-Data Empowered
Cellular Networks using analytical and machine learning tools.

Hasan Farooq (S’14) received his B.Sc. degree in


Electrical Engineering from the University of Engi-
neering and Technology, Lahore, Pakistan, in 2009
and the M.Sc. by Research degree in Information
Technology from Universiti Teknologi PETRONAS,
Malaysia in 2014 wherein his research focused on
developing adhoc routing protocols for smart grids.
Currently he is pursuing the Ph.D. degree in Elec-
trical and Computer Engineering at the University
of Oklahoma, USA. His research area is Big Data
empowered Proactive Self-Organizing Cellular Net-
works focusing on Intelligent Proactive Self-Optimization and Self-Healing in
HetNets utilizing dexterous combination of machine learning tools, classical
optimization techniques, stochastic analysis and data analytics. He has been
involved in multinational QSON project on Self Organizing Cellular Networks
(SON) and is currently contributing to two NSF funded projects on 5G SON.
He is recipient of Internet Society (ISOC) First Time Fellowship Award
towards Internet Engineering Task Force (IETF) 86th Meeting held in USA,
2013.

Dr. Ali Imran (M’15) is the founding director of


Big Data Enabled Self-Organizing Networks Re-
search Lab (www.bsonlab.com) at The University
of Oklahoma. His current research interests include
Self-Organizing Wireless Networks (SON) functions
design for enabling 5G; Big Data Enabled SON
(BSON); artificial intelligence enabled wireless net-
works (AISON), new RAN architectures for en-
abling low cost human-to-human as well as IoT
and D2D communications. On these topics, he has
published over 70 refereed journal and conference
papers. He has given tutorials on these topics at several international confer-
ences including IEEE ICC, WF-IoT, PIMRC, WCNC, CAMAD and European
Wireless and Crowncom. He has been and is currently principle investigator
on several multinational research projects focused on next generation wireless
networks, for which he has secured research grants of over $3 million. He is
an Associate Fellow of Higher Education Academy (AFHEA), UK; president
of ComSoc Tulsa Chapter; Member of Advisory Board for Special Technical
Community on Big Data at IEEE Computer Society; board member of ITERA,
and Associate Editor of IEEE Access special section on heterogeneous
networks.

You might also like