Malcomb NewUniversalLaw
Malcomb NewUniversalLaw
Schedule Resilience
Source: Catholic University of America
Contributed by: Malcomb, Armelle (Author); The Catholic University of America (Degree
granting institution); Lucko, Gunnar (Thesis advisor); Thompson, Rick (Committee
member); Agbelie, Bismark (Committee member)
Stable URL: https://www.jstor.org/stable/community.38760519
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Catholic University of America is collaborating with JSTOR to digitize, preserve and extend access to Catholic
University of America
New Universal Law: Application of Tracy-Widom Theory for Construction Network Schedule
Resilience
A DISSERTATION
School of Engineering
Doctor of Philosophy
©
Copyright
By
Armelle P. Malcomb
Washington, D.C.
2022
New Universal Law: Application of Tracy-Widom Theory for Construction Network Schedule
Resilience
A methodology based on random matrix theory (RMT) has been proposed to investigate the
underlying behavior of project network schedules. The approach relies on a devised mathematical
model and three premises. The first assumption demands that the probabilistic activity durations
have an identical triangular distribution with known parameters. A repetitive joint sampling of
activity durations serves to create a sample data matrix 𝑿 using the identified scheme for
translating a project network of size p into a random matrix utilizing its dependency structure
matrix. Although the joint sampling distribution was unknown, it served to draw each of the n
rows of 𝑿 . The second assumption is that the Tracy-Widom (TW1) distribution is the natural
distribution of each row of 𝑿 's sampling. Interactions between numerous parties participating
in project management and construction cause a project network schedule to fall under complex
systems marked by a phase transition and a tipping point. In addition, the striking similarities
between the fields of applications of the TW distributions and those of project scheduling support
this assumption. The last assumption is that a project network schedule with sufficient correlation
in its structure, like that of complex systems, can be investigated within the framework of RMT.
This assumption is justified by the interdependence structure defined by the various pairwise links
between project activities. This assumption enabled the application of RMT’s universality results
scaled eigenvalues of sample covariance matrices serve as test statistics for such a study.
, ,
As a result, a carefully engineered sample covariance matrix 𝑺 was developed, and two
standardization approaches (Norm I and Norm II) for its eigenvalues were identified. Both
standardization approaches relate to the universality of the TW1 limit law, which many authors
have extended (e.g., Soshnikov 2002, Péché 2008) to a broad class of matrices that are not
necessarily Gaussian under relaxed assumptions. Although some of these assumptions have been
, ,
eased, others must still be met. Among these extra requirements, the formulation of 𝑺 was
chosen. Its formulation necessitated the centering and scaling of the matrix X consisting of n
samples of p early finish (EF) times of a project network’s activities. In addition, it included the
significance level α to test the TW1 distributional assumption. The Kolmogorov-Smirnov (K-S)
goodness-of-fit test with the α values of 5, 10, and 20% was found suitable for this study.
35 project networks of diverse sizes and complexity values were identified from the study's
benchmark networks 2040 obtained from the Project Scheduling Problem Library (PSPLIB). Their
sizes (resp. restrictiveness RT values) ranged from 30 to 120 activities (resp. 0.18 to 0.69). Kelly's
(1961) forward and backward passes of the critical path method (CPM) determined the EF times.
Using the devised methodology, the set of 100 simulations of network schedules yielded three
significant findings. First, the scatterplot of 100 pairs of the normalized largest eigenvalue (𝑙 )
, ,
of 𝑺 and the sample size n revealed a distinct and consistent pattern. The pattern is a concave
upward curve that steepens to the left and flattens to the right as n increases. Surprisingly, networks
of varying sizes and complexity showed the same pattern regardless of the normalization method.
of 𝑙 from the mean of the TW distribution (𝜇 ) were determined using the same 100 outputs.
They enabled the graphing of scatterplots of sample size n against ∆ . The resulting pattern
highlighted the association between n and 𝑙 . Similarly, the deviations ∆ between the
variances of 𝑙 and 𝑣𝑎𝑟 were calculated. The resulting pattern, also consistent across
networks, helped determine an optimum sample size (𝑛 ) that would maximize variance in a
project network schedule's sampled durations. This sample size was found at the mean deviation
curve's intersection with the horizontal axis (n-axis). One may view 𝑛 as the required pixel
count for high-quality printing. The size 𝑛 was found to be related to the network size p but not
its RT value. Moreover, an 𝑛 value was found for all the 35 networks and included α in the
, ,
expression of 𝑺 was not necessary. Still, leaving it out resulted in higher values of 𝑛 .
Subsequently, the derived 𝑛 was used in a series of 1000 simulations to validate the
distributional assumption on activity durations. The K-S test statistics were the normalized first
, ,
through fourth-largest eigenvalues 𝑙 ,𝑙 ,𝑙 , and 𝑙 of the matrix 𝑺 . By comparing
results based on the normalization approaches, Baik et al. (1999) and Johansson (1998)—Norm II
may be better suited to studying project network scheduling behavior than Johnstone (2001)—
Norm I. Under Norm I, 18 of the 35 project networks validated the null hypothesis when using
, ,
𝑙 and 𝑙 of their matrices 𝑺 . Norm II supported the null hypothesis for 19 of the 21
, ,
networks evaluated when using 𝑙 and 𝑙 of the matrices 𝑺 . This discovery is significant,
perhaps expected, since Baik et al. (1999) introduced Norm II while studying the length of the
longest increasing sequence of random permutations, which was governed by a TW limit law. The
plots and histograms. The graphs corroborated the K-S test results that the TW1 distribution is the
limiting joint sampling distribution of project network schedules. Also, the Q-Q plots showed a
proper normalization of the mth largest eigenvalue should improve the K-S test performance.
After the assumed limiting distribution validation for the durations of project schedules, another
methodology was proposed to help design better project schedules. The intended methodology is
formulated based on the previous model and assumptions. For the matrices' eigenvalues to be
significance level 𝛼 5%. At this 𝛼 level, the TW1 distribution is the natural limiting distribution
for the sampling of durations of project activities. The suggested methodology relies on three rules
to help choose which principal components (PCs) to keep. The simulations on four networks of
various sizes yielded the following findings. First, using the scree plot rule and proportions of each
, ,
PC to the total variance of the sample covariance 𝑺 or population correlation R matrix, the
study discovered a link between 𝑛 and PC retention for any of the networks. In addition, the
, ,
eigenvalues of both 𝑺 and R are very nearly equal. This is a significant result.
Furthermore, the investigation demonstrated that Johnston's (2001) spiked covariance model might
forecast project network activities' limiting durations via a PCA-based linear regression model. On
scree plots, one or a few largest eigenvalues stood out from the rest. While the proportion of total
variances with an 80% cutoff criterion for selection helped select the number of the rth ranked 𝑙
to retain as PCs, the hypothesis testing criteria based on TW p-values did not. The TW p-value
estimations' availability for testing s after the 4th largest eigenvalues was the issue. Finally, the
transition in project network schedules transitioning from stable to unstable. This discovery is
critical because it may help practitioners determine when a construction project's schedule may
become problematic. Since the empirical study involved only four networks, more study is needed
In conclusion, while the uncovered universal pattern may not be suited for manual applications, it
can be added as an add-on to project network scheduling applications. Doing so would aid in
simulating the necessary network schedules and determining the optimal sample size
corresponding to the tipping point associated with a project network schedule. At that point, a
project schedule may transition from a strong-coupling phase with activities in concert to a weak-
coupling phase with independent activities. In addition, because the optimal sample size
corresponds to the maximum variance in the project activity durations, it may determine the
limiting duration of each activity and total project duration, which, if exceeded, may result in a
project schedule instability with unrecoverable delays. The proposed PCA-based linear regression
model, based on Johnstone's (2001) spiked covariance model, is intended to forecast project
limiting durations. These durations may aid practitioners in predicting project schedules and costs
that are resilient. Finally, the significant discoveries of this pioneering study have resulted in
proposals for contributions to the body of knowledge and recommendations for future research.
This dissertation by Armelle P. Malcomb fulfills the dissertation requirement for the doctoral
degree in Civil Engineering and Management, approved by Gunnar Lucko, Ph.D., as Director, and
by Richard C. Thompson, Jr., Ph.D., and Bismark Agbelie, Ph.D., as Readers.
ii
Signature Page............................................................................................................................... ii
Table of Contents ......................................................................................................................... iii
List of Figures ................................................................................................................................ x
List of Tables .............................................................................................................................. xiii
List of Equations ........................................................................................................................ xvi
List of Abbreviations ............................................................................................................... xxiv
Acknowledgment ...................................................................................................................... xxvi
Introduction ........................................................................................................... 1
Abstract ............................................................................................................ 1
1.1 Background ............................................................................................... 2
1.2 Network Schedules ..................................................................................... 5
1.2.1 Definitions .................................................................................................................... 5
1.2.2 Planning a Construction Project ................................................................................... 7
1.2.3 Visual Displaying a Construction Project Schedule ..................................................... 8
1.2.4 Scheduling a Construction Project ............................................................................. 11
1.2.5 Network Schedule Structures ..................................................................................... 11
1.3 Construction Scheduling Techniques .......................................................... 12
1.3.1 PERT .......................................................................................................................... 12
1.3.2 Critical Path Method (CPM)....................................................................................... 15
1.4 Vectors and Matrices ................................................................................. 22
1.4.1 Vector Definitions and Operations ............................................................................. 22
1.4.2 Matrix Definitions and Operations ............................................................................. 28
1.5 Descriptive Statistics and Inferential Statistics ............................................ 42
1.5.1 Prelude ........................................................................................................................ 42
1.5.2 Describing the Data: Sample and Population ............................................................. 43
iii
iv
vi
vii
viii
ix
xi
xii
Chapter 1 – Introduction.............................................................................................................. 1
Table 1-1: Activity Float Types and their Mathematical Equations ....................... 18
Table 1-2: Logic Constraints for Link Connections between Activities in a CPM
Network Schedule ............................................................................................. 20
Table 1-3: Tabular Presentation of n Measurements on p Variables ...................... 49
Table 1-4: Computations for Constructing a Q-Q Plot: Normal Distribution .......... 59
Table 1-5: Kolmogorov-Smirnov Test - Critical Values between Data Sample and
Hypothesized CDFs .......................................................................................... 78
Table 1-6: Illustrations of n and p-Values in Various Fields of Applications of
Multivariate Analysis ...................................................................................... 104
Table 1-7: Decision Rules When Testing H 1 to Reject H 0 Given α ...................... 109
Table 1-8: Type I Error versus Type II Error .................................................... 110
xiii
Chapter 3 - Application of PCA for Data Reduction in Modeling Project Network Schedules
Based on the Universality Concept in RMT ........................................................................... 311
xiv
xv
xvi
xvii
xviii
xix
xx
xxi
Chapter 3 - Application of PCA for Data Reduction in Modeling Project Network Schedules
Based on the Universality Concept in RMT ........................................................................... 311
Equation 3.1: General Formulation of Population Principal Components ............ 322
Equation 3.2: Variance and Covariance of Population Principal Components ...... 322
Equation 3.3: ith Population PC (Result 1) ....................................................... 323
Equation 3.4: Variance and Covariance of Population PCs (Result 1) ................. 323
Equation 3.5: Link Between the Population Covariance Matrix and Pcs .............. 324
Equation 3.6: Proportion of the Total Population Variance Due to k-th PC .......... 324
Equation 3.7: Correlation Coefficients between the Population PCs Y and Random
Variables X .................................................................................................... 325
Equation 3.8: PCs for Standardized Variables ................................................... 326
Equation 3.9: PCs for Standardized Population Variables (Matrix Notation) ....... 326
Equation 3.10: Covariance of Standardized Population PCs Z ............................ 326
Equation 3.11: ith Standardized PC (Result 4) .................................................. 327
Equation 3.12: Link Between Variances of Standardized and Pcs (Result 4) ........ 327
Equation 3.13: Correlation Coefficients between Standardized Variables’ PCs and
Original Random Variables ............................................................................. 327
Equation 3.14: Proportion of the Total Standardized Population Variance Due to kth
PC ................................................................................................................. 328
Equation 3.15: Covariance Matrices with Special Structures (Example 1) ........... 328
Equation 3.16: Covariance Matrices with Special Structures (Example 2) ........... 329
xxii
xxiii
SK Sherrington-Kirkpatrick
ADM Arrow diagramming method
AOA Activity-on-Arrow
AON Activity-on-Node
CDF Cumulative distribution function
CI Complexity index
CNC Coefficient of network complexity
CPM Critical Path Method
EF Early Finish
ES Early Start
ESD Empirical Spectral Distribution
FF Free float
FS Finish-to-start
FTS Finish-to-start
GOE Gaussian Orthogonal Ensemble
GPM Graphical Planning Method
GSE Gaussian Symplectic Ensemble
GUE Gaussian Unitary Ensemble
KLT Karhunen-Loeve Transform
LDM Logic Diagramming Method
LF Late finish
LS Late start
LSM Linear scheduling method
NHST Null Hypothesis Significance Testing
OE Orthogonal ensemble
OS Order of Strength
PCA Principal Component Analysis
xxiv
xxv
First and foremost, I would want to convey my heartfelt gratitude to my adviser, Dr. Gunnar Lucko,
I would like to convey my sincere appreciation to Dr. Richard C. Thompson, Jr., and Dr. Bismark
I am grateful to the Department of Civil Engineering at The Catholic University of America for
allowing me to remotely access many computers required to run simulations at the same time.
Finally, I thank God for the intelligence and patience He has given upon me to complete my
dissertation!
xxvi
Abstract
This chapter aims to introduce the current research study and other background materials required
mathematics and probability. This chapter begins by introducing projects and their associated
planning and scheduling approaches. Following that, this chapter discusses vectors and matrices
in general. The chapter next discusses descriptive and inferential statistics in greater detail. As a
result, this chapter discusses the prerequisites for probability distributions and random variables.
Additionally, it gives context for multivariate data and their analysis. Finally, the chapter
1.1 Background
For decades, practitioners in the construction industry have struggled with project delivery delays.
The risk variables that contribute to its occurrence have been thoroughly explored. For example,
Tafazzoli and Shrestha (2017) discovered fourteen primary reasons for delays in the United States
alone after surveying more than 10,000 construction specialists participating in various projects
with multiple delivery systems and ownerships. Among them, the authors cited "unrealistic
schedules (bid durations that are too short)" and modification orders, both of which have been
linked to increased schedule growth and delayed construction projects in numerous studies
(Tafazzoli and Shrestha 2017, p.144 and 117). To address project schedule delays, researchers
have undertaken extensive analysis using a variety of approaches to uncover delay drivers (Stumpf
2000, Lucko et al. 2018, Bagaya and Song 2016) and propose remedies (Youcef and Andersen
2018). On the one hand, several writers have developed ways that are based on traditional
deterministic and probabilistic Critical Path Method (CPM) scheduling techniques. On the other
hand, others (e.g., BD+C Staff 2018) have investigated artificial intelligence to forecast the future
Additionally, resiliency in project schedules has been a goal in resolving schedule-related delays.
Han and Bogus (2011) defended their use of these approaches: " A disrupted schedule becomes
resilient when the construction workers, instead of sitting idle, are reassigned to other tasks they
otherwise were not expecting." Thus, to incorporate resiliency into construction schedules, one
must account for resource availability and interactions between the various parties involved in the
construction activities. However, due to the unpredictable nature of these interactions, particularly
as the number of activities increases, analyzing construction network schedules becomes like
studying the Cuernavaca bus problem, which exhibits a ubiquitous pattern known as universality
(Baik et al. 2006). As a result, to assist in resolving persistent delays and mitigating their
consequences during the design and construction of projects, the current study intends to adapt and
adopt methodologies based on the Tracy-Widom distribution laws, which scholars observed
Furthermore, because network schedules are composed of construction activities, the Tracy-
Widom limit law s may aid in elucidating the underlying behavior of their complex interactions.
If so, this may help prevent delays, improve on-time delivery using resilient schedules, and propose
remedies that may aid in making more accurate forecasts of project duration and cost. Figure 1.1
depicts the operations that must be performed to accomplish the primary objective of this research
study, which is to develop realistic solutions to the delays problem in the building of projects based
This section will provide an overview of the elements employed in construction network schedules
due to the scope of this study. The following development will place the most significant emphasis
on the time or duration of each activity involved with the construction of projects.
1.2.1 Definitions
Harris (2006) defines a project as a set of operations or activities that need to be completed in
logical order. These activities, all together, define not only the project goal but cannot be performed
by following any arbitrary order (Adeli and Karim 2001). Accordingly, all project activities must
be carefully planned by considering other project activities and constraints. For instance, in vertical
construction, the activity “pour and finish concrete slab” can only occur after the activity
“formwork for concrete” activity has been completed. Thus, a rigorous schedule of all project
activities and their relationships is necessary to achieve the project goal, which typically involves
Network
The Free Dictionary by Farlex defines a network, also known as a net, as an interconnected system
of things or people. Depending on the things or people being connected, its meaning varies across
different fields. For instance, in computer science, any system that allows an exchange of
biology (ecosystem), the term network refers to a system allowing random interactions between
species such as people or turtles. In construction management and engineering, project tasks or
activities that need to be executed in a specific order to complete the project successfully are known
as a network. Because their execution sequence is restricted by the project's available resources
and the construction methodology and practical methodology (Adeli and Karim 2001, p. 49), these
Construction schedule
A schedule is generally a term used to refer to a plan for either performing work or achieving an
objective by following a specific order to complete each part within the allotted time. For instance,
a bus schedule lists all departure and arrival times of buses. Similarly, a construction schedule is
“traditionally defined as the timetable of the execution of tasks in a project” (Adeli and Karim
designed to define each party’s responsibility to complete the project on time and within budget
and make available the project documents to all parties. Such participants in the construction
industry would be the owner of the project, the contractor or executor of the project, the project's
designer, financial institutions, and the project manager (PM). In addition, the project manager
may have the scheduler's role inherent in developing and maintaining the schedule during the
project life cycle. Since scheduling a project is a painstaking task that can become cumbersome
for large projects with hundreds or thousands of activities, software such as Primavera P6 or
Developing a project schedule requires a collaborative effort between the scheduler in charge of
developing a project schedule and other project team members involved with the project's design
phase. This collaborative effort is crucial for the scheduler to become acquainted with the project
information, which is usually available to all parties involved. The project information includes
the project scope, specifications, plans, project execution plan, contracting and procurement plan,
equipment lists, and operations. Getting technical inputs and feedback from the project
development team (PDT) or subject matter experts are essential to creating sound and resilient
schedules. After understanding the project and collecting the project relevant data such as project
total duration, budget, and applicable construction methods, depending on the project complexity,
a project scheduler would either use software or an Excel spreadsheet to plan and schedule the
In step 1, a project is broken down into individual activities. For this step, the work breakdown
structure (WBS) is used to dissect the project’s work into smaller and logical chunks. In step 2,
the duration, predecessors, and successors, of each activity are established and calculations of its
start and finish dates are performed. In step 3, resources such as people, equipment, or materials
for the execution of each activity are allocated. For instance, during this step also known as
resource leveling for dealing with limited resources, activity start dates may be postponed until
resources become available. In step 4, each activity actual progress, is monitored, and the original
schedule is amended if necessary. In the last step, the consumption of resources is monitored, and
resources needed to complete the project is re-estimated (Harris 2006). Accordingly, planning and
collaborative effort between the project stakeholders and a great understanding of the project
intricacy. Since this study uses generated networks as benchmark schedules, planning a complete
project is not required except for schedule calculations performed in step 2 to determine activity
start and finish dates. This topic will be covered in a subsequent section.
Pioneered by Henry L. Gantt and Frederick W. in the early 1900s, Gantt charts are diagrams
partitioned into rows and columns drawn to schedule a project or work. Columns are used as a
timescale to represent activity durations which may be expressed in either hours, days, weeks, or
months. In contrast, project activities are scheduled in rows as horizontal bars drawn within the
timeframe of the Gantt chart. As shown in Figure 1.2, each bar's start and endpoints correspond to
the beginning and end of successful events that must be carried out to complete an activity. For
instance, the activity “pour ready mix concrete” starts with the event “delivery of the ready-mix to
the project site,” followed by the event “on-site inspection and testing of the mix” to ensure the
engineered design is met…, and finish with the event “allow a proper curing of concrete.” The bar
length represents the activity duration. In addition, horizontal bars can overlap, and the end point
of each horizontal bar, describing a project activity, indicates the relationship between that activity
and the following one (Uher and Zantis 2012). Due to their simplicity and ease of understanding,
Gantt charts would be appropriate for small to midsize projects or even portions of a large project.
However, Gantt charts would not be suitable for scheduling large projects that generally possess
many activities and complex relationships between activities. Moreover, according to (Adeli and
Karim 2001), they “are not recommended as the primary tools for monitoring and managing
projects.”
Days
Activities 2 4 6 8 10 12
Approve shop drawings by owner
Mobilization
Excavate soil for concrete work
Prepare and install rebars and form work
Pour ready mix concrete
Strip forms
Clean up site/Demobilization
Network diagrams are intensively used in different fields for scheduling or modeling purposes. In
the field of construction engineering and management, project planners and managers widely use
them to represent project schedules. They are composed of nodes, arrows, numbering or lettering
to express activity resources and activity logic constraints. Depending on whether activities are
represented on nodes or arrows to form a network, the resulting network schedule is either an
as illustrated in Figure 1.3(a), nodes and arrows serve to represent respectively activities and
precedence relationships between activities. Conversely, for an AOA diagram, such as the one
shown in Figure 1.3(b), nodes are used to represent events or times of importance while arrows
correspond to activities. A good example of times of importance would be the start time or finish
time of an activity. Although AON and AOA diagrams convey the same information, the AON
diagram is widely used mostly due to its compact appearance and ability to display more
information at once.
Besides Gantt charts and network diagrams, few other forms of graphs are used to represent project
activities. Among them is the Logic Diagramming Method (LDM), a hybrid diagramming method
of the precedence diagramming method and arrow diagramming method (ADM). This method is
used in conjunction with a Graphical Planning Method (GPM) to display activity sequences and
their timing and represent connections between activities in intuitive and versatile ways. LDM
activity notation resembles ADM but is defined on a time scale, while logic links have multiple
arrowheads (Ponce de Leon 2008). Another project planning and management technique
practitioners use to display project schedules is the Linear Scheduling Method (LSM). The LSM
uses a coordinate system with a time axis and another axis to indicate the amount of already
Literature on scheduling techniques used by practitioners in the construction industry suggests that
there are various techniques. Among them is the Program Evaluation and Review Technique
(PERT), developed for the Polaris Missile Project of the U.S. Navy in 1958 (Lucko 2017). Another
technique is the CPM, which is well known and extensively used in the industry. Although Walker
of Du Pont and Kelley of Remington Rand later achieved its full potential, in the USA, the CPM
was first introduced in Great Britain sometime in the middle of 1950 during the construction of a
central electricity-generating complex (Uher and Zantis 2012). As devised, for PERT and CPM
performed differently – and display project schedules using AON network diagrams.
The number of its activities and precedence relationships and a project network can also be
that can be derived by studying its structure, many scholars have developed various measures to
gauge network structures. These measures allow not only network comparisons but also their
classifications. For instance, Vanhoucke et al. (2008) measured the topological network structures
to evaluate and compare network generators used to generate project scheduling instances. Hence,
1.3.1 PERT
deterministic. In other words, durations of activities are uncertain and variable rather than fixed
and certain. PERT is equipped with a probability distribution for the activity durations. Schedules
are drawn as AON network diagrams, as shown in Figure 1.4, where each activity carries three
duration values: optimistic, typical, and pessimistic durations. These durations may be determined
at various α-percentiles. Though, it can be challenging to find those values. Practitioners find
reliable data from company records, experience, or industry averages. One of the advantages of
using PERT over the CPM scheduling technique, one can use probability tables to answer the
following question: “What is the probability that the project is done in x days?” (Lucko 2017).
The following are durations used for the PERT scheduling technique. The first one is the optimistic
duration (a). An optimistic duration is defined as the duration of an activity if everything works
well. Thus, it represents the minimum possible duration of an activity. In general, a0, a1, and a5
denote the optimistic durations at 0, 1, and 5 percentiles. It is not likely that one has or ever will
without revising the estimation equations. The second one is a typical duration denoted as m. It
represents the usual duration of an activity which is the average duration that an activity follows.
Its estimate may vary whether percentile definitions of a and b are being employed to find the
optimistic and pessimistic durations. The third one is the pessimistic duration, symbolized by the
letter b. The pessimistic duration is the duration of an activity if nothing works as planned.
Accordingly, it is the maximum duration of an activity and can be denoted b100, b99, b95 at 100, 99,
and 95 percentiles, respectively. The last one is termed as the expected mean duration and is
denoted by 𝝁𝜶 . Given an activity, the expected mean duration 𝜇̂ at α-percentile is the weighted
average of a, m, and b as provided in Equation 1.1. The expected mean duration of an entire project
is calculated by summing up the expected mean durations of all activities on the critical path.
Given a α-percentile, for any activity on the network schedule, the expected standard deviation sα
or 𝜎 at α-percentile can be found. As defined in Equation 1.2, 𝜎 represent the empirically scaled
difference between a and b. When α is zero, 𝜎 is simply denoted by s which is also given by
Equation 1.3, where K 3.2 for 5-percentile and K 2.7 for 10-perecntile.
𝑏 𝑎 [Eq. 1-2]
𝜎 𝐾
b a [Eq. 1-3]
s σ
With the PERT scheduling technique, one can calculate the probability of completing a project in
x days using PERT. The calculation is performed as follows. First, the critical path for minima “a”
determines the minimum project duration and its associated critical path using each activity
minimum duration a. Second, the critical path for maxima “b” computes the maximum project
duration using each activity to calculate the project's duration and associated critical path. Both
project durations are used to calculate the expected project duration and related path. Third, the
for each activity and then sum them up to find the expected project variance and standard deviation.
The expected project standard deviation of the entire project is the square root of the expected
variance 𝑣 . That is 𝑠 𝑣 . Last, knowing the project expected duration x, its associated
normalized value z may be computed using Equation 1.4. Thus, a look-up table, such as the normal
distribution lookup table, can be used to find the probability of the project being completed in x
days.
x μ x ∑t [Eq. 1-4]
z
σ s
The ratio s in Equation 1.3 is susceptible to the shape of each distribution (beta, exponential,
normal, triangular, and uniform) and moderately sensitive to the location of the mode for the
triangular distribution. The normal distribution has a ratio of infinity when plotted with the mode
on the abscissa scale. Except for normal distribution, s ranges from 3.5 to 6. Conversely, the ratios
𝜎 and 𝜎 respectively vary from 3.12 to 3.35 and 2.52 to 2.77 for all distributions and mode
locations considered. Pearson Tukey, who studied 29 distributions including Pearson, log-normal,
and non-central t curves, obtained similar results. From a statistical standpoint, the use of the 5 and
95 percentiles, or the 10 and 90 percentiles, provides a method for estimating the standard
deviation that is resilient to both mode location and distribution shape fluctuations (Williams
1992).
Unlike PERT, the critical method path method is a deterministic method for not inherently
supporting activities that are distributed as random variables (Lindeburg 2011). In other words,
the CPM requires that single or fixed values be used to define the durations of activity considered
as chunks of work necessary to complete a project. The CPM uses an AON network diagram to
display project activity information and their precedence relationships using nodes and links.
Nodes show activity identifications and start and finish dates, whereas links describe the activities’
precedence relationships with other activities. Nodes are drawn as rectangular or square boxes to
include any activity scheduling information by following the general layout provided in Figure 1.5
below provided.
ID = Activity Identification
ES = Early Start Time
EF = Early Finish Time
LS = Late Start Time
LF = Late Finish Time
ST = Start Time
FT = Finish Time
TF = Total Float
D = Duration
Where:
Activity Identification (ID): an activity ID is required to identify the project activity whose
information is being provided in the box. The activity description – optional - may also be provided
Duration (D): is defined as the normal duration of an activity under normal circumstances.
Early Start Time (ES): an activity ES time is the earliest time an activity may get started.
Early Finish Time (EF): an activity EF time is the sum of an activity ES time plus the activity
duration.
Late start time (LS): an activity LS time is the latest time upon which an activity may be started
Late finish time (LF): an activity LF is the sum of the activity LS and the activity D.
Total float (TF): The amount of time an activity can be delayed without affecting its successors,
even more so than the overall project schedule. It can be computed by subtracting the sum of its
ES time and duration D from the activity LF time. When the TF equals zero, the activity is said to
be critical.
Free float (FF): The amount of time a non-critical activity can be extended without delaying the
final project completion date. It is also defined as the amount of time an activity may be postponed
without affecting the ES times of its succeeding activities (Uher and Zantis 2012).
Starting float (StF): The Start-to-finish SF time for an activity is the difference between the activity
LS time and ES time. If the starting float for this activity equals “zero,” then the start of the activity
is said to be critical. In this case, any delay in the starting time of this activity would delay the
entire project. Note that an activity with a critical starting time is not necessarily a critical activity,
Finishing float (FnF): The difference between an activity's LF time and its EF time represents the
finishing float for the activity. If the float is “zero,” then the finish time of the activity is critical.
That said, there is no flexibility in the finishing time of the activity. The activity must finish by its
finishing time for the project to complete on time. Note that a critical activity has a critical finishing
float, but the opposite is not necessarily true (Adeli and Karim 2001).
Table 1-1 provides the mathematics of the float types hereby defined since as (Kelley 1961)
pointed out, “several float measures have been tested … As a result, the following measures
…have been found useful.” Those measures were: total, free, and independent floats.
Free float for activity “i” which FF min 𝐸𝑆 , ,…, 𝐸𝐹 ) Characteristics of non-
has “k” succeeding activities. critical activities
Unlike nodes, links are drawn by using any of the four different logic constraints, as shown in
Figure 1.6(a) through Figure 1.6(d). These constraints establish precedence relationships between
successive activities using activity start or finish times (T). In addition, there are two other types
of constraints besides the logic constraints that are out of the scope of this study. One of them is
absolute constraints used to indicate the time constraint on the activity's finish and or start time.
The other is buffer constraints defined between pairs of activities to maintain a minimum buffer.
At the same time, they are being executed (Adeli and Karim 2001).
Where:
Indices “i” and “j”: are used to refer respectively to the preceding and succeeding activities.
Time lags/leads (L): are positive or negative slack times allowed in the constraint. A positive L is
indicative of a lead-time, while a negative L refers to a lag time. To illustrate this, one may consider
the finish-to-start relationship depicted in Figure 1.5(a). If L is a lead-time, activity “j” can only
start after L time units following the completion of activity “i.” Else, if L is a lag time, activity “j”
can only start after L time units before the completion of activity “i.”
As part of the CPM scheduling process, constraints need to be specified because activities can not
be executed in any arbitrary sequence. Thus, constraints will also always exist between activities,
even if they may be omitted from a project network schedule. Table 1-2 provides the mathematical
Finish-to-start (FS) 𝑇 𝐷 𝐿 𝑇
Start-to-start (SS) 𝑇 𝐿 𝑇
Finish-to-finish (FF) 𝑇 𝐷 𝐿 𝑇 𝐷
Start-to-finish (SF) 𝑇 𝐿 𝑇 𝐷
Table 1-2: Logic Constraints for Link Connections between Activities in a CPM Network
Schedule
A network schedule may have more than one path of critical and or non-critical activities. The
critical path is a path that sequentially connects all critical activities. A network may have one or
more critical paths, but their number is smaller than non-critical paths. Based on real projects
studied with shortest or longest durations, (Kelley 1961) suggested that less than 10 percent of the
activities were critical. To explain this deduction, Kelley (1961) added, "This is probably an
illustration of Pareto’s principle….” Nevertheless, since critical activities on a critical path have
zero floats, delaying any of them would delay the entire project completion date. Also, the duration
of the critical path corresponds to the duration of the project (Adeli and Karim 2001).
Based on the forward and backward passes, each starts with an appropriate numbering of nodes.
After that, nodes are numbered consecutively, with the number assigned to the node located at the
tail of an arrow always being more significant than the one assigned to the node at the head of the
arrow.
Start with numbering all N nodes, including dummy activities, so that each number at the tail of
an arrow is always smaller than the one at the arrow’s head. Note a dummy or fictitious activity
with a zero duration may be added at the beginning or end of a project network schedule to connect
all activities that begin or end the project. Next, prepare the schedule data for use in the forward
Follow the flowchart provided in Appendix C.1 (p. 409) to conduct the forward pass. This process
allows the determination of each activity at early times. In addition, it will enable the calculation
Follow the flowchart provided in Appendix C.2 (p. 410) to perform the backward pass. This
process allows the determination of each activity's finish times needed. In addition, it permits the
calculation of the total project duration (𝐷 ) as the maximum of the activity LF times. Note that
the total project duration calculated using the activity finish times is identical to that estimated in
The network critical path is identified as the path that sequentially connects all its critical activities,
which is the path connecting all activities with no float. The activity float can be calculated
following the methodology depicted in the flowchart in Appendix C.3 (p. 411). As previously
mentioned, a network may have more than one critical path. The duration of the critical path equals
the total duration of the project. As a result, the critical path is also known as the network's longest
path.
One often runs into numbers in science and engineering represented by quantities such as time,
volume, mass, length, and temperature. These numbers are called scalars that stay constant even
if their coordinate systems change. Alternatively, one may also encounter various fascinating
physical quantities equipped with magnitudes and directions (Weber and Arfken 2005). Examples
include velocity, force, acceleration, electric current, and protons in this group. These quantities
are known as vector quantities and are encountered in pairs in forms of magnitude and directions.
For differentiation purposes, vector quantities are often written with either boldface letters (e.g.,
V) or arrows (e.g., 𝐕⃗). For this study, all vectors will be written with boldface letters except in a
few subsequent illustrations proposed to understand preliminary definitions better. The following
paragraphs define terms associated with vectors, such as vector length, graphical and analytical
representation of a vector by its components, followed by vector operations and proprieties worth
Weber and Arfken (2005) define a vector as “a quantity having magnitude and direction.” Both
magnitude and direction are generally used to geometrically represent a vector V. In Figure 1.7(a),
the segment 𝐎𝐀⃗ and its magnitude or length 𝑂𝐴 serve to indicate the direction and magnitude of
the vector V. Unlike V, the segment 𝐀𝐎⃗ may serve to plot the vector -V as shown in Figure 1.7(a).
Similarly, more than one vector can be geometrically added together to form one vector given their
directions and magnitudes. Figure 1.7(b) shows the geometrical addition of a couple of vectors,
V1 and V2, defined by their segments 𝐎𝐀⃗ and 𝐀𝐁⃗ and magnitudes 𝑂𝐴 and 𝐴𝐵 are to form the
resulting vector V. Figure 1.7(c) depicts the addition of four vectors to obtain the vector V.
A
A
A
V
-V
O
O
Unlike the vectors in Figure 1.7, Figure 1.8(a) depicts vector V in the ax-y coordinate plane based
on its horizontal and vertical components x and y and the OX and OY axis. If the vectors i and j
have a magnitude of a unit each; moreover, if the vectors x and y have magnitudes respectively of
a and b; hence, Equation 1.5 may represent vector V and Equation 1.6.
𝐕 a𝐢 b𝐣 [Eq. 1-5]
where tan-1 is the abbreviation of the trigonometric function “tangent,” defined as the ratio of the
length of the side opposite to the angle α, that is b, to the size of the side adjacent to α, that is a.
Z
Y
γ z
β
o y
Y
α α x
X
x
X
(a) (b)
and z along three mutual and perpendicular lines OX, OY, and OZ, as shown in Figure 1.8(b). If
the vectors x, y, and z have magnitudes a, b, and c as given in Equation 1.7 and are represented by
the unit vectors i, j, and k, then the magnitude |𝐕| and directions α, β, and γ of V as provided in
𝐕 a𝐢 b𝐣 c𝐤 [Eq. 1-7]
As geometrically added in the case of four vectors as depicted in Figure 1.7(c), any number of
vectors can be summed up and the resulting vector sum V direction and magnitude can be
determined using Equation 1.9 below provided. Besides adding vectors, vectors can also be either
multiplied by scalars or multiplier among them to produce new vectors or scalars. Multiplying a
vector V by a scalar s produces a vector sV with the same direction as V and a magnitude s times
𝐕 𝐕 𝐕 𝐕 ⋯ [Eq. 1-9]
𝑎 𝑎 𝑎 ⋯ 𝐢 𝑏 𝑏 𝑏 ⋯ 𝐣 𝑐 𝑐 𝑐 ⋯ 𝐤
s𝐕 s a𝐢 b𝐣 c𝐤 sa 𝐢 sb 𝐣 sc 𝐤 [Eq. 1-10]
In addition to the previous multiplication type, two vectors can be multiplied through a scalar
product, also known as a dot product. The dot product of two vectors V1 and V2 is denoted as 𝐕 •
𝐕 is equals to the product of the magnitudes of both vectors times the cosine of the angle “θ”
|𝐕 ||𝐕 |cos θ , where the angle “θ” is as shown in Figure 1.9(a). As a result, the dot product of
two vectors is a scalar number. In addition, the dot product is commutative and distributive over
Moreover, the dot product of a vector is equal to the square of its magnitude. This can be expressed
as 𝐕 • 𝐕 𝐕 |𝐕 | .
Using the properties of the dot product and the fact that the unit vectors i, j, and k are mutually
perpendicular in a plane or space, the following expression can be derived: 𝐢 • 𝐢 𝐣•𝐣 𝐤•𝐤
1; and 𝐢 • 𝐣 𝐣•𝐤 𝐤•𝐢 0. Thus, the dot product 𝐕 • 𝐕 can be expressed in terms of the
and V2, respectively. Another type of vector multiplication is called vector product. The vector
product of V1 and V2, denoted as 𝐕 𝐕 and shown in Figure 1.9 (b), may be computed using
Equation 1.11 below, where p is the unit vector perpendicular to the plane containing 𝐕 and 𝐕
and directed by the right-hand rule, and θ is the angle oriented from V1 to V2. Unlike the dot
product, the vector product result is another vector. In addition, the vector product is
𝐕 𝐕 𝐕 𝐕 𝐕 𝐕 𝐕 , respectively.
Moreover, the product of a vector by itself is equal to the null vector. That is, 𝐕 𝐕 0.
Through the application of the properties of the dot product and the knowledge that the unit vectors
i, j, and k are mutually perpendicular in a plane or space, the following expression can be derived:
expressions of 𝐕 𝐕 in the x-y plane or 3D may be derived in Equation 1.12 and Equation 1.13.
𝐕 𝐕 a 𝑏 a b 𝐤 [Eq. 1-12]
𝐕 𝐕 𝑏𝑐 𝑏 c 𝐢 a 𝑐 a c 𝐣 a 𝑏 a b 𝐤 [Eq. 1-13]
V1
V2
𝐕 𝐕 y
θ V2 θ
V1 θ
V2
V1
(a) Dot Product of V1&V2 (b) Vector Product of V1&V2 (c) Vector Projection
The projection or shadow of V1 on V2, shown in Figure 1.9(c), is a vector that has the same
direction as V2 and a magnitude |𝐕 | cos θ. The concept of projections has a great application in
mechanics as it serves to define the moment of a vector force V1 at the point of application O
located at the intersection of the directions of both vectors which arm length 𝑦 |𝐕 | sin θ
represents the product of two vectors, V1 and V2. In addition to the concept of vector projections,
linear combination and linear span of vectors are possible. A vector y is said to represent a linear
combination of the set of vectors 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐤 if there exist k-constants in a way that y can be
their linear span. For the set of vectors 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐤 to be considered as linearly independent, if at
least one of the constants is non-null. In this case, Equation 1.14 becomes Equation 1.15.
Otherwise, the set of vectors 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐤 are said to be linearly independent and their set is known
𝐲 a 𝐱 a 𝐱 ⋯ a 𝐱 [Eq. 1-14]
a 𝐱 a 𝐱 ⋯ a 𝐱 𝟎 [Eq. 1-15]
The following example illustrates a case of three linearly dependent sets of vectors, given that any
2 2 2
𝐱𝟏 1 ,𝐱 3 , 𝐱𝟑 3 , then 3𝐱 2𝐱 𝐱 0 or 𝐱 𝟑𝐱 𝟏 𝟐𝐱 𝟐
0 1 2
A matrix is a rectangular array of real (ℝ) or complex (ℂ) numbers in horizontal rows and vertical
columns enclosed in brackets or parentheses. The matrix entries can also be functions taking values
in ℝ or ℂ. A matrix can be described by its row and column dimensions. A matrix with m rows
and n columns is an n p matrix (read “m by n”). While a capital letter is generally used to
name a matrix, lowercase letters with double subscripts, such as a in the ith row and jth column
of A in Equation 1.16, denote the matrix entries. Below the letter depicting a matrix, the dimension
of the matrix can be indicated in parentheses, as shown in Equation 1.16. For example, matrix A
entry 𝑎 read “a sub two one,” indicates the entry in the second row and the first column. A
general term is represented by 𝑎 . The notation indicates the entry in row i and column j (Dawkins
2007).
or
𝐀
𝐀 a
mxn
Square matrix: If the number of columns m equals the number of rows n, A is said to be a square
matrix.
Identity matrix: An identity matrix is a square matrix that equals that same matrix when multiplied
by another matrix. As provided in Equation 1.17, its main diagonal entries are all “1” while entries
elsewhere are zeros. It is usually denoted by In or I, where n is the matrix size. “The matrix I acts
1 0 0 [Eq. 1-17]
⎡0 ⋯
⎢ 1 0 ⎤⎥
𝐈 ⎢ ⋮⎥
⋮ ⋱
⎢ ⎥
⎣0 0 ⋯ 1⎦
𝟎
Zero matrix: Denoted by or simply by 0, a zero matrix is an m n matrix whose entries
m n
are all “0,” as its name implies. Equation 1.18 is its matrix representation.
0 0 0 [Eq. 1-18]
⎡0 0 ⋯
𝟎 ⎢ 0 ⎤⎥
⎢ ⋮ ⋮⎥
m n ⋱
⎢ ⎥
⎣0 0 ⋯ 0⎦
Triangular and diagonal matrices: A triangular matrix is a square matrix denoted by “U” for upper
when all the entries below the main diagonal are all zeros and by “L” for lower when the entries
above the main diagonal are zeros in a lower triangular matrix. Equation 1.19 represents both
matrices. A special case of a triangular matrix is a diagonal matrix where all entries u or l are
u u u ⋯ u l 0 0 ⋯ 0 [Eq. 1-19]
⎡0 u ⋯ u ⎤ ⎡ 11 ⎤
u ⎢l21 l22 0 ⋯ 0 ⎥
⎢ u ⎥
U ⎢0 0 ⋯ u ⎥ 𝐋 ⎢l31 l32 l33 ⋯ 0⎥
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 0 ⋯ u ⎦ ⎣ln1 ln2 ln3 ⋯ lnn ⎦
Symmetric matrix: A symmetric matrix A, such as the one given in the example below, is a square
A symmetric a a ∀i, j
While A in the example below is a symmetric matrix, B is not since its entries verify the above
condition.
1 0 4 1 0 4
𝑨 0 3 2 ,𝑩 0 3 2
4 2 5 4 1 5
Equal Matrices: Two matrices A and B of the same dimension n p are equivalent only if their
corresponding entries a and b are identical for all i and j as specified in Equation 1.20,
𝐀 𝐁 [Eq. 1-20]
a b ∀i 1, … , m; ∀ j 1, … , n.
m n m n
whose elements are a . As defined, the transpose of A is obtained by swapping the rows of A
into columns in a way that the first row of A becomes the first column of AT, the second row of A
becomes the second column of AT, and so on. The following is an illustration of an 2 3 matrix
A whose transpose AT is a 3 2 obtained by swapping the rows of A into columns to form AT.
9 3
9 2 1
𝐀 ;𝐀 2 0
3 0 4
1 4
Matrix Scalar Multiplication: A matrix A may also be multiplied by a constant c to obtain the
product matrix cA, whose elements are obtained by multiplying each entry of A by c as provided
in Equation 1.21.
ca ca ca [Eq. 1-21]
ca ca ⋯ ca
c𝐀 ⋮ ⋱ ⋮
ca ca ⋯ ca
Matrix Addition/Subtraction and Product: Two matrices, A and B, of the same dimension n p ,
can be added or subtracted. The resulting matrix 𝐀 ∓ 𝐁 with entries a ∓ b is of the same order
performing the inner product of each row i of A with column j of B as in Equation 1.22 for the
[Eq. 1-22]
ab a b a b ⋯ a b a b
Let A and B be the below-provided matrices to apply the above definition of a matrix product.
Their product matrix operations are provided to illustrate the above description regarding the
𝐀 3 1 4 𝐁 1
If
3x3 3 0 2 and 3 x 1 5
6 2 1 2
3 1 1 5 4 2 0
𝐀 𝐁 3 1 4 1
7
Then, 3 0 2 5 3 1 0 5 2 2
3x3 3x1 18
6 2 1 2 6 1 2 5 1 2 3x1
The renowned German mathematician and philosopher Gottfried Wilhelm von Leibniz introduced
the concept of determinant and its notation (Weber and Arfken 2005, p. 165). Given an n n
square matrix A, the determinant function of A denoted by det 𝐀 or |𝐀| is the sum of all the
1.23. Each matrix 𝐀 is found by entirely discarding the entries in the first row and jth column of
A. Equally, 𝐀 can be obtained by deletion of the ith instead of the first row.
|𝐀| a 𝐀 1 or |𝐀| a 𝐀 1 ∀k 1
For larger matrices, it can be tedious to determine det(A) manually using Equation 1.23. In general,
if A is 2 2 , the expression of Equation 1.24 below may be used to manually compute det(A)
as performed in the following instance, illustrating all operations necessary to determine the
𝑎 ∙𝑎 1 𝑎 ∙𝑎 1 𝑎 ∙𝑎 𝑎 ∙𝑎
1 2 4
1 3 2 3 2 1
2 1 3 1 1 2 1 4 1
4 1 5 1 5 4
5 4 1
1 1 12 1 2 2 15 1 4 8 5 1
1 13 2 13 4 13 5 13 65
The row rank represents the maximum number of independent rows deemed as vectors. Likewise,
the column rank of a matrix is the maximum number of its independent column vectors. Below is
an example of the determination of row and column ranks of a matrix based on the section on the
illustration provided on linearly dependent vectors. For the matrix, A below, which columns
written as vectors are the vectors 𝐱 , 𝐱 , and 𝐱 shown to be linearly dependent. Thus, the column
2 2 2 2 2 2
𝐀 1 3 3 ;𝐱 1 ,𝐱 3 , 𝐱𝟑 3 with 𝐱 3𝐱 2𝐱
0 1 2 0 1 2
The row vectors are linearly dependent, with rows 1 and 2 linearly independent. Hence, the row
rank of matrix A is two equals to the column rank. This result is not surprising since a matrix's row
2 1 0 0
2 2 3 4 1 0
2 3 2 0
nonsingular if its determinant is different from zero. As a result, if A is nonsingular, its columns
𝐀 𝐱 𝟎 𝒙𝟏 𝒂𝟏 𝒙𝟐 𝒂𝟐 ⋯ 𝒙𝒏 𝒂𝒏 [Eq. 1-25]
n n n 1 n 1
where 𝒂𝒊 is the ith column of A. For the condition in Equation 1.22 to hold, x must be an n 1
zero matrix. Otherwise, A is considered a singular matrix. Equally, A square matrix is deemed to
When multiplied with a given element, an element that produces the identity element is called an
inverse of the given element. In terms of matrices, if a matrix A is a nonsingular square matrix of
provided.
𝐀𝐁 𝐁𝐀 𝐈 [Eq. 1-26]
where I is an identity matrix whose matrix notation is provided by Equation 1.17, matrix B is
called the inverse of matrix A, designated as B-1. In general, given a matrix A whose entries are
𝐀 [Eq. 1-27]
a 1
|𝐀|
where 𝐀 is the matrix resulting from deleting the ith row and jth column of A.
The expressions of Equation 1.28 and Equation 1.29 are, respectively, general formulas
necessary for the manual computation of any 2 2 or 3 3 matrices. They are worth now
reminding since manual calculations of inverses of matrices become tedious and cumbersome as
𝐀 a a 𝐀 𝟏 a a [Eq. 1-28]
a a , |𝐀| 0 → a a
2 2 2 2 |𝐀|
𝟏
From the expressions of A-1 given by Equation 1.28 or Equation 1.29, it is noticeable that for 𝐀
to exist, the determinant of A as provided in Equation 1.20 must be a non-zero number. In the
following example, the 2 2 matrix A, whose determinant, computed by the mean of Equation
1.23, is equal to 8.
2 1 |𝐀|
𝐀 , 8 0
2 3
3 1
𝑨 𝟏 8 8
1 1
4 4
Denoted by tr(A), the trace of an 𝑛 𝑛 square matrix whose entries are 𝑎 is the sum of its
[Eq. 1-30]
𝑡𝑟 𝑨 𝑎
Suppose that A is an 𝑛 𝑛 matrix, x is a non-zero vector from the set of real numbers Rn or
complex numbers Cn, and that λ is any scalar so that they are all related through Equation 1.31,
𝐀𝐱 λ𝐱 or 𝐀 λI 𝐱 0 [Eq. 1-31]
x. Both x and its corresponding λ occur in pairs (Dawkins 2007). The set of solutions of Equation
𝑨 𝜆 𝒖 𝒖 𝜆 𝒖 𝑢 𝜆 𝒖 𝒖 [Eq. 1-32]
⋯
𝑛 𝑛 𝑛 1 1 𝑛 𝑛 1 1 𝑛 𝑛 1 1 𝑛
𝒖 𝒖 1, ∀ 𝑖 1, ⋯
𝒖 𝒖 0, 𝑖 𝑗 , ∀ 𝑖, 𝑗 1, ⋯ , 𝑛
Characteristic equation: For A's eigenvectors A to contain vectors other than the zero vector, the
1.34. The nth degree polynomial in λ provided in this latest equation is a different representation
P λ λ c ∙λ ⋯ c ∙λ c [Eq. 1-34]
If λ1, λ2, …., λn is the complete list of eigenvalues of A, including repeats, any λi that occurs
precisely once is called a simple eigenvalue. In contrast, any that happens more than once is said
to have a multiplicity of k.
Eigenvalues and Eigenvector Properties: If A is a triangular matrix such as the one provided in
Equation 1.19, then its eigenvalues λ can be found by solving Equation 1.35 where
Let λ1, λ2, …, λn be the complete list of all A's eigenvalues, including repeats. Then, Equation 1.37
Its trace may also be computed using Equation 1.38 instead of Equation 1.30.
tr A λ λ ⋯ λ [Eq. 1-38]
A square matrix A whose rows, deemed as vectors, is orthogonal if its row vectors not only have
unit lengths but also are mutually perpendicular. This is translated by the reciprocal results
𝐀 𝐀 ↔ 𝐀 𝐀 𝐀 𝐀 [Eq. 1-39]
Given the 3 3 matrix A below, the lengths of its rows and columns considered as vectors can be
calculated using Equation 1.8 to verify that they are all of the unit lengths. In addition, the mutual
perpendicularity of any pair of rows or columns of matrix A may also be verified using Equation
⎡ 1 √3
0⎤
⎢ 2 2 ⎥
𝑨 ⎢ √3 1 ⎥
⎢ 0⎥
2 2
⎣ 0 0 1⎦
With all the conditions met, it can be concluded that A is an orthogonal matrix. Therefore, by
calculating the inverse 𝐀 of matrix A using Equation 1.29 and the transpose 𝐀 of A, obtained
by swapping the rows of A into columns (see Page 31), one can verify Equation 1.39.
⎡1 √3
0⎤
⎢2 2 ⎥
𝐀 𝐀 ⎢√3 1 ⎥
⎢2 0⎥
2
⎣0 0 1⎦
1.4.2.12 Singular Value Decomposition and Singular Values
Given an m k matrix A of real numbers, there exist two orthogonal matrices U and V of orders,
𝐀 𝐔⋀𝐕
[Eq. 1-40]
𝐀 𝐔 ⋀ 𝐕
m k m m m k k k
Where ⋀ is an m k diagonal matrix whose entries λ denoted as the singular values of A are
𝜆 0, ∀ 𝑚𝑖𝑛 𝑚, 𝑘 𝑖 𝑘
Another way of expressing the singular value decomposition (SVD) of matrix A is in terms of its
rank “r” as a matrix expansion. Namely, there exist: r real positive coefficients λ , λ , … , λ ; r
that A may also be expressed as in Equation 1.42 in terms of its rank r and coefficients λ .
[Eq. 1-42]
𝐀 λ𝐮𝐯 𝐔⋀𝐕
Where the columns of 𝐔 as below provided are called the left-single vectors, whereas the columns
of 𝐕 also as below provided are called the right-single vectors, and ⋀ is an (m k) diagonal
𝐔 𝐮 , 𝐮 , … , 𝐮 and 𝐕 𝐯 ,𝐯 ,…,𝐯
Moreover, the product matrix 𝐀𝐀𝐓 has eigenvalue-eigenvector pairs λ , 𝐮 , such that 𝐀𝐀𝐓 can be
Where the r real and positive numbers λ satisfy the conditions of Equation 1.44.
λ 0, λ 0, ⋯ , λ 0, ∀ m k [Eq. 1-44]
By combining the expressions of Equation 1.40, Equation 1.42, and Equation 1.430, one may
derive a different expression for the SVD of the matrix A in terms of the orthogonal left and right
𝐯 λ 𝐀 𝐮. [Eq. 1-45]
As previously mentioned, the pairs λ , 𝐯 are eigenvalues and eigenvectors of the matrix product
eigenvectors of 𝐀𝐀 as its columns, “k” orthogonal eigenvectors of 𝐀𝐓 𝐀 as its columns, and λ .as
Given a p-dimensional vector 𝒙, an 𝑝 𝑝 real and symmetric matrix 𝑴 is positive definite if the
square of the statistical distance (d), defined in Section 1.5.8.2, satisfies the condition provided in
Equation 1.46.
In the above inequality, the square of the distance d is known as the quadratic form of M. Quadratic
forms, also known as matrix products, and distances play a vital role in multivariate analysis. One
may refer to a multivariate statistical analysis book such as Johnson and Wichern (2019) for further
1.5.1 Prelude
collection of numbers or observations. For that reason, they provide context for or meaning for
exciting observations. For instance, a college professor may survey students to ascertain their
satisfaction with (or dissatisfaction with) the course. Indeed, statistics inform and are ingrained in
one’s life, and statistics are interpretable. Researchers can collect multiple variables
simultaneously, such as several genes from human populations in hundreds of locations across a
continent or country. Countless measurements are often taken, and procedures for organizing,
summarizing, and making sense of these measurements have been developed. These procedures,
data. Ad defined, descriptive statistics are methods for synthesizing, organizing, and making sense
tabulated (in tables), or as summary statistics (single values). Usually, when referring to data
a single measurement or observation, more commonly known as a score or raw score (SAGE
from sample observations. The analysis considers the probability that an observed difference
occurred by chance. The z-test is the simplest type of statistical test; it compares a sample to the
population when the variable is a measured quantity (Norman and Streiner 2003). This section
discusses descriptive statistics and statistical inference components and procedures, respectively.
As a discipline, statistics deals with collecting, organizing, analyzing, explaining, and displaying
data. When dealing with data, it is typical to specify the extent of the data. For example, is it a
sample or population data? Nevertheless, this section will cover all involved with statistics in the
same order.
The statistical population is quite different from the mainstream population except for census data
or Gallup polls (Norman and Streiner 2003). From an investigator's viewpoint, a population, also
called the universe, is a set of things, individuals, or data from which the investigator can draw a
from a population X. The sample size n is determined by the number of elements it contains, and
its sampled values (observations) are denoted 𝑥 , 𝑥 , ⋯ , 𝑥 . Figure 1.10 below illustrates all the
In most studies, such as the one involving multivariate analysis, the parameters of the population
(e.g., mean and variance) are unknown. To carry out such a study, the investigator would
"randomly" draw a set of a multivariate data sample—a portion—from the population to derive
the whole population characteristics necessary to make statistical inferences. Statistical inference
information contained in a sample” (Ramachandran and Tsokos 2014, p. 5). As discussed in the
subsequent section, employing representative samples eliminates biases and errors and ensures
fairness in making statistical inferences. Hence, randomness in data sampling is critical. In other
words, randomness guarantees that potential differences arising between the data sample and its
population are due to chance alone but not to biases or something the researcher may do to
influence the experiment's outcome. The following are a couple of examples to help better
Political polls: The subset of voters polled represents a sample rather than the entire population.
technician, who records the results after each run. An experiment's final measurements are a
Nevertheless, there are two types of population in the literature: finite and infinite. The elements
or units necessary to construct a random sample are selected without replacement for a limited
population. As a result, Ramachandran and Tsokos (2014, p.181) wrote that the resulting sample
“is not a random sample and 𝑋 s [elements, variables, or characteristics of the population] are not
random i.i.d random variables.” Whereas an infinite population, having a finite population factor
This section briefly describes a preliminary plan of actions a researcher must prepare to draw
samples from a population of interest. This step is crucial in preventing sampling errors and
systematic biases and preparing reliable and accurate plans. While systematic bias is an error
caused by an inadequate sampling procedure (e.g., reporting data with biases), sampling error
refers to the random fluctuations in the sample estimates about the true population parameters
(Kothari, 2004). The goal is to devise a sampling method to produce more minor sampling errors
and control systematic biases. The literature on the topic suggests that there exist various sampling
methods. Accordingly, devising a sampling plan is recommended and usually results in selecting
the appropriate sampling method for the investigation. During the planning process, researchers
often consider the following points to decide the sampling method: type of population (finite or
infinite), sample size, parameters of interest, budgetary constraints, and relevant sampling
procedures. The following paragraphs describe only three main sampling designs that are relatively
probability sampling allows a researcher to purposefully select specific items or units of the
population to form a sample. This sampling procedure (e.g., quota sampling for market surveys)
chance sampling, is based on the law of statistical regularity to select samples, as Kothari (2004)
noted. Under this law, each sample will exhibit the same composition and characteristics as the
population of interest. In addition, each sample will have an equal chance of being selected. A
sample drawn this way is called a simple random sample. Although simply random sampling
seems to present great advantages, such as significantly reducing the investigator’s biases and
having relatively simple analytical computations such as sample size given an error level, it “may
not be effective in all situations” (Ramachandran and Tsokos 2014, p. 9). Hence, considering other
random sampling methods such as systematic or stratified sampling would be more appropriate to
Systematic Sampling
Under systematic sampling, samples are extracted at evenly spaced times. The ith item in the
sampling frame – a set of elements, units, or individuals to draw from to create a representative
sample – can be chosen only after an appropriate random start for the first item. Hence, this
sampling procedure requires some order in the population items in sampling the desired fraction
needed to create a systematic sample. Courtesy of Ramachandran and Tsokos (2014), Below are
Step 1: Itemize from 1 to N the items in the sampling frame (e.g., 𝑁 1000).
Step 2: Set the sample size; let n (e.g., 𝑛 100 or 10% of the population) be that size.
Step 3: Select 𝑝 𝑁
𝑛 (e.g., 𝑝 10).
Step 5: Draw (with replacement) each pth element (e.g., selected items would be 4, 14, 24, …,994.
Strictly speaking, a systematic sample may not entirely be comparable to a random sample. Still,
it is reasonable to be treated as one and thus considered an improvement of a simple random sample
(Ramachandran and Tsokos 2014). This is because it provides individual population elements with
an equal probability of being selected to form a sample. In addition, it gives each combination of
samples with an identical probability (joint) of being drawn. Thanks to its simplicity, systematic
sampling is widely used and appropriate when the items (variables) characterizing a population
Stratified Sampling
random or systematic sampling (Ramachandran and Tsokos 2014). Under the stratified sampling
scheme, one subdivides the population of interest into homogenous strata or subpopulations, then
selects independently each stratum, whose size may vary from one another, to form a sample. To
Step 1: Decide on the stratification factors relevant to the research and define their criteria.
Step 2: Partition the whole population into strata or subpopulations, not necessarily of equal size,
Step 3: Choose the number of items to sample from each stratum using simple random or
systematic sampling. Note that this number may vary from stratum to stratum.
Stratified samples are widely used and are appropriate in situations when at least one of the strata
has an insignificant prevalence in other strata. Besides providing information on the entire
compared to the other two sampling schemes, stratified sampling produces accurate, reliable, and
detailed information. While this section has provided helpful background information to formulate
sounded sampling schemes, one may refer to the cited authors’ materials for more interest in
A researcher often arranges multivariate data methodically to collect the data necessary for a study
successfully. Depending on the study's purpose or intended use, the researcher may explore
capturing data in an array format, as illustrated in Equation 1.47 or a tabular form, as presented in
Table 1-3. In either of the arrangements, n specifies the number of units or features. Alternatively,
p is the number of variables of interest for which the measurements 𝑥 on the ith unit of the jth
variable are being recorded. In general, the measurements 𝑥 are stored in an array format similar
theoretical elements that characterize the univariate distributions of each variable X and their joint
distribution (Everitt and Hothorn 2011). The sampling distribution of the variables X , X , ⋯ , X
as a collective is a critical topic in multivariate analysis and will be examined in Section 2.3.2.2
Variables
Features 𝑋 𝑋 ⋯ 𝑋 ⋯ 𝑋
1 𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥
2 𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥
⋮ ⋮ ⋮ ⋮ ⋮
i 𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥
⋮ ⋮ ⋮ ⋯ ⋮ ⋯ ⋮
n 𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥
or,
𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥
⎡𝑥 𝑥 𝑥 𝑥 ⎤
⋯
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
𝐗 ⎢ ⎥
⎢𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥 ⎥ [Eq. 1-47]
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣𝑥 𝑥 ⋯ 𝑥 ⋯ 𝑥 ⎦
The following is a plausible justification for the array format preference for encoding observed
values on the p-variables. Measuring the entire set of variables on each unit alters the variables'
may obscure the inherent structure of the entire data set. In other words, analyzing each variable
on its own may be a missed opportunity for the researcher attempting to uncover the primary
characteristics of the multivariate data and identify any interesting "patterns" hidden within the
data. Typically, multivariate statical analysis reveals the entire data structure. A multivariate
statistical analysis may be defined as the simultaneous statistical analysis of a group of variables.
The primary objective is to improve univariate analyses of each variable performed independently
by incorporating information describing the variables' relationships (Everitt and Hothorn 2011).
1.5.4.1 Prelude
Statistical data visualization has long been used in science and technology by users from various
fields. R.A. Fisher's first methods were diagrams, which he published in 1925. Since then, data
visualization has made tremendous progress, thanks primarily to rapid advancements in computer
involving significant technical information to solve many problems confronting today's fast-paced
society. Examples of visualization methods are bar graphs, Pareto charts, stem and leaf plots, pie
charts, scatterplots, and probability plots. A vital link is a visualization method, such as a graph,
whose purpose is first to encode quantitative and categorical information into a display medium,
then decode it through visual perception (Cleveland 1993). Still, as Cleveland (1993, p. 2) pointed
out, "no matter how clever the choice of information, and no matter how technologically
Nevertheless, how one knows whether a given probability distribution is a good model for the data
is a critical question for statisticians or researchers. The reason is that several statistical models are
based on the assumption of a specific type of population distribution. As a result, statistical data
analysis to determine whether the data came from a particular probability distribution is commonly
used to validate such assumptions. Occasionally, the shape of the distribution can reveal
information about the underlying structure of the data. Hence, a graphical representation of the
data, say a histogram, may provide insight into the shape of the underlying distribution.
However, histograms are not always reliable predictors of distribution shape unless there is a
considerable sample size. Therefore, practitioners may use probability plots such as Q-Q and P-P
plots for small to moderate samples, which are more reliable than histograms. Probability plotting,
for example, is a subjective visual assessment of data that is used to determine whether a given
sample of data fits a conjectured distribution (Wilk and Gnanadesikan 1968). Even though there
is substantial literature on graphical techniques for statistical data analysis based on the empirical
cumulative distribution function and its implications, the following subsections are devoted to Q-
This section aims to provide background information on histograms and the frequency distribution
tables that serve as histograms' prerequisites. In the eighteenth century, the histogram was invented
as a data analysis tool for summarizing data (Chambers et al. 2018). As described in the following
sequel, a histogram condenses a data set into a compact image that illustrates the location of the
mean and modes of the data and the data's variation, mainly the range. In addition, it serves to
deduce patterns from data. A histogram is an excellent aggregate graph of a single variable that
should always be used as a starting point for determining the variability in the data (Ramachandran
Frequency Distributions
A frequency distribution (or table) summarizes data more concisely than a stem-and-leaf diagram.
To create a frequency distribution, divide the data range into intervals, commonly referred to as
class intervals, cells, or bins. The bins should be similar in width to enhance the visual information
in the frequency distribution. As directed by Chambers et al. (2018, p. 41), for more information
on the topic, consult Diaconis and Freedman (1981) and Scott (1979) for their exciting reading on
optimal procedures for choosing the interval width. Nonetheless, some discretion must be
The number of bins is determined by the number of observations and the data's degree of scatter
or dispersion. A frequency distribution with too few or too many bins will be uninformative.
Choosing between five and twenty classes is a general guideline from various authors on the
subject. Montgomery and Runger went on to say that the number of bins should increase with n
and be chosen to be roughly equal to the square root of the number of observations, which often
works well in practice. Finally, the goal is to use enough classes to show the variation in the data,
but not so many that many of the classes have only a few data points (Ramachandran and Tsokos
2014). The frequency distribution table contains the relative frequency distribution in addition to
classes. The observed frequency of a given bin, say i, divided by the total number of observations
n represents the bin's relative frequency (𝑓 ). That is the ratio 𝑓 /𝑛, and the bin's cumulative relative
frequency, provided in Equation 1.48, is defined as the sum of all the relative frequency 𝑓 of all
[Eq. 1-48]
𝑓 /𝑛
Histograms
is a graph where the horizontal axis represents classes, and the vertical axis represents frequencies,
relative frequencies, or percentages. Histograms may have equal or uneven bins. One advantage
of utilizing equal bin widths rather than unequal bin widths is that if the data contains few extreme
observations or outliers, employing a few equal-width bins results in practically all observations
falling within a few bins. On the other hand, using many equal-width bins results in many bins
with zero frequency. As a guideline, when the bins are different in width, the rectangle's area, not
the height, must be proportional to the bin frequency. This guideline entails that the rectangle's
height equals the bin frequency divided by the bin width. In addition to quantitative data, frequency
distributions and histograms can visualize qualitative or categorical data. Typically, the width of
the bins is equal in this scenario (Montgomery and Runger 2007). However, the steps for creating
Step 2: Indicate and name the frequencies or relative frequencies on the vertical scale.
Step 3: Draw a rectangle over each bin with a height equal to the frequency (or relative frequency)
audiences, they have several limitations as a data analysis tool (Chambers et al. 2018). Choosing
the appropriate interval width and the number of bins, as discussed previously, is one of the
problems. Montgomery and Runger (2007, p. 205) demonstrated this point with an experiment in
Minitab. They varied either or both the number of bins and the width of the same data on the
compressive strength of 80 aluminum-lithium alloy specimens. They found "That histograms may
be relatively sensitive to the number of bins and their width." In other words, for small data sets,
the appearance of histograms might alter substantially if either one or both the number bins or
width of the bins are changed. Additionally, histograms are more reliable when used with larger
data sets, ideally 75 to 100 or greater. Figure 1.11(a) and (b) illustrate one of the histograms and
cumulative distributions of compressive strength data that they plotted in their experiments,
respectively.
(a) Histogram of compressive strength for 80 (b) Cumulative distribution plot of the
aluminum-lithium alloy specimens compressive strength data
As indicated previously, Figure 1.10 (b) depicts the compressive strength data's cumulative
frequency plot. The height of each bar represents the total number of observations that are less
than or equal to the bin's upper limit. Cumulative distributions are also helpful when interpreting
data. For instance, one may deduce immediately from the exact figure that around 70 observations
(out of 80) are less than or equal to 200 psi. In other words, 87.5 % of the time, an observation x
selected from the dataset will be less than 200 psi. A histogram can be defined as an approximation
to the probability density function 𝑓 𝑥 . The bar area represents the relative frequency
(percentage) of the measurements in each histogram interval. Thus, the relative frequency can be
viewed as a probability estimate for whether a measurement falls within an interval. Similarly, the
area beneath 𝑓 𝑥 represents the true probability that a measurement falls in an interval
Moreover, as previously stated, histograms may present some challenges. However, when the
sample size is large sufficiently, a histogram can be a relatively reliable indicator of the general
distribution of the population of measurements from which the sample was selected. When data
are symmetrical, the mean and median coincide in most cases. Additionally, the mean, median,
and mode coincide if the data are unimodal. However, the mean, median, and mode do not coincide
when the data are skewed (asymmetric, having a long tail to one side). Typically, given a right-
skewed distribution, the mean, median, and mode satisfy mode< median < mean. The three metrics
are mode > median > mean with a left-skewed distribution. Refer to Montgomery and Runger's
book (2003) for more information and excellent illustrations on this subject.
The Q-Q plot is a highly effective visualization technique frequently used to produce a graphical
representation of the true pdf from which a given data may have originated. This plot represents
the quantiles of the empirical distribution of the provided data versus the quantiles of the
hypothesized true pdf under test graphically. Accordingly, it is referred to as a theoretical Q-Q plot
due to its distinction from an empirical Q-Q plot. If the resulting graph of these two distributions
is linear, the estimated pdf fits the given data reasonably well (Ramachandran and Tsokos 2014).
This result is Wilk and Gnanadesikan's brilliant finding, and their invention could not be more
straightforward or elegant than comparing one distribution's quantiles against the other's quantiles
(Cleveland 2003).
To describe the construction of a Q-Q plot, let 𝑥 , 𝑥 , ⋯ , 𝑥 be the “raw data” and 𝐹 𝑥 be the
cumulative distribution function CDF of the theoretical distribution in question. To define the
𝐹 𝑄 𝑞 𝑞 (a)
[Eq. 1-49]
𝑄 𝑞 𝐹 𝑞 (b)
In other words, A portion q of the probability of the distribution happens for x values less than or
equal to 𝑄 𝑞 , just as a fraction q of the data is less than or equal to 𝑄 𝑞 . The subscripts “e”
and “t” differentiate theoretical from empirical versions (Chambers et al. 1983). Now, construct a
Step 1: Rank the observations in the sample from smallest to largest. That is, the sample
second smallest observation, and so forth, and 𝑥 is the largest observation. Note that when
values of the 𝑥 are distinct, exactly j observations are less than or equal to 𝑥 . Theoretically,
this is always valid when the sampled values are of the continuous type assumed in most cases.
Step 3: Use Equation 1.50 to estimate the values 𝑥 ,𝑥 ,⋯,𝑥 of the random variable X
computer programs flip the axis, putting the observed quantiles 𝑥 on the vertical axis and the
theoretical quantiles 𝑥 on the horizontal. The interpretation remains the same in either case.
Step 5: Examine the "straightness" of the Q-Q plot to see how the points deviate from a straight
line. It is advantageous to draw a line by concentrating on spots close to the middle of a Q-Q plot
rather than on the plot’s extreme left and right. The data set conforms to the predicted probability
distribution if the overall pattern is nearly linear. On the other hand, the data is skewed and does
not conform to the expected probability distribution if the general pattern contains curves or
Chambers et al. (1983) and Wilk and Gnanadesikan (1994) provide guidelines for identifying and
interpreting Q-Q plots (1968). Still, the following is a quick tip on drawing the straight line “chosen
subjectively.” Montgomery and Runger (2007, p. 213) suggested a rule of thumb which “is to draw
the line approximately between the 25th and 75th percentile points.” By doing so, if the pairs of
points 𝑥 , 𝑥 , lie very nearly along a straight line, then the notion that the sample data arise
from the hypothesized distribution would not be rejected. In other words, if the plotted points
deviate substantially from the straight line, the hypothesized model is not suitable.
For illustration, one may consider a well-illustrated example courtesy of Montgomery and Runger
(2007, p. 215) on batteries’ usage times in a portable personal computer. Below are its Q-Q plot
and the necessary calculations based on ten 𝑛 10 observations on the effective service battery
service life (𝒙 𝒋 ) measured in minutes of battery usage. Measurements and calculations following
the five steps above outlined are provided in Table 1-4. Alternatively, Figure 1.12 illustrates the
Q-Q plot of pairs 𝑥 , 𝑧 , where 𝑧 represents the standardized normal scores (see page 94)
In assessing the closeness of the points to the straight line drawn on either one of the Q-Q plots in
Figure 1.12, it is apparent that most of the points can be covered by an imaginary “fat pencil”
placed along the straight line. With the points passing the fat pencil test, one can conclude that the
Gaussian distribution is indeed a suitable model for the data being studied. The authors also
provide other Q-Q plots showing a departure from the normal distribution. For more illustrations,
one may also refer to the book by Ramachandran and Tsokos (2014).
j 𝒙 𝒋 𝒋 𝟎. 𝟓 /𝟏𝟎 𝒛𝒋
1 176 0.05 -1.64
2 183 0.15 -1.04
3 185 0.25 -0.67
4 190 0.35 -0.39
5 191 0.45 -0.13
6 192 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 214 0.85 1.04
10 220 0.95 1.64
A P-P plot is a graphical tool used to determine how a given data set fits a specified probability.
This figure compares the given data's empirical cumulative probability distribution to the assumed
true cumulative probability distribution functions. If the plot of these two distributions is
approximately linear, it implies that the hypothesized true pdf fits the observed data reasonably
well (Ramachandran and Tsokos 2014). To illustrate how a P-P plot is constructed, consider a
random variable X and its assumed true cumulative distribution function 𝐹 𝑥 . Then, let
𝑥 , 𝑥 , ⋯ , 𝑥 be a random sample of X. Then, one can create a P-P plot by following these steps.
Step 1: Rank the observations in the sample from smallest to largest. That is, the sample
𝑥 , 𝑥 , ⋯ , 𝑥 is arranged as 𝑥 ,𝑥 ,⋯,𝑥 .
𝑗 1, 2, ⋯ , 𝑛.
Step 3: Calculate through Equation 1.52 the theoretical cumulative probability 𝑝 ,𝑝 ,⋯,𝑝 .
𝑝 𝐹𝑋 𝑥 𝑃 𝑋 𝑥 𝑝̂ , ∀𝑗 1, 2, ⋯ , 𝑛 [Eq. 1-52]
Step 5: Examine the "straightness" of the P-P plot to see how the points deviate from a straight
line. Everything described previously regarding the interpretation of Q-Q plots also applies to P-P
A couple of probability plots borrowed from different sources have been provided for illustrations.
The P–P plot for a sample size of m = 200 and n = 50,000 markers is depicted in Figure 1.13(a) on
the left of the panel, derived from Patterson et al. (2006). They found the fit to be excellent for
demonstrating the Johnstone normalization's appropriateness. Whereas Figure 1.13(b) on the right
of the panel, derived from Dupuis (2010), depicts the P-P plot for the state of Nebraska, where X
represents the duration of the dry period, and Y represents the duration of the successive wet
period. As can be seen from the graph, model 1 based on X and Y does not entirely capture the
observed behavior.
After discussing some graphical and tabular ways for describing data sets in the prior section, this
section discusses certain numerical features of a collection of measurements. For example, assume
that one has a sample consisting of the values a, b, and c. This data set exhibits various
characteristics, including central tendency and variability. The sample's mean, median, or mode
measures central tendency, while the sample variance, standard deviation, or interquartile range
estimates dispersion or variability. Please note that the formulae and subjects discussed in this
section are available in most statistics literature. (e.g., Ramachandran and Tsokos 2014).
Given a sample of n observations 𝑥 , 𝑥 , ⋯ , 𝑥 , the sample mean, also known as the empirical
1 [Eq. 1-53]
𝑥̅ 𝑥,
𝑛
As a measure of the central location of the data, the sample mean is heavily influenced by extreme
values or outliers. However, the trimmed mean is a more robust measure of central location as it
is relatively unaffected by outliers. How is the trimmed mean determined? Given 0 𝛼 1, one
can determine a 100𝛼% trimmed mean as follows: (1) order the data, (2) discard the data values
with the lowest and highest 100𝛼 percentages, (3) determine the mean of the remaining data
values. The notation for the 100𝛼% trimmed mean is 𝑥̅ . To illustrate the trimmed mean concept,
A sample median is a central value in a set of data. That is the value that partitions the data set into
two groups of the same size. Let 𝑥 ,𝑥 ,⋯,𝑥 be the ranked observations 𝑥 , 𝑥 , ⋯ , 𝑥 in
increasing order, from small to large. If n is odd, the median is jth value 𝑥 with 𝑗 𝑛 1 ⁄2.
Otherwise, n is even, and there are two values of 𝑥 equally close to the middle. Then, the
1 [Eq. 1-54]
𝑥 𝑛 𝑥 𝑛
2 2 2 1
Unlike the mean, the median is much less susceptible to outliers in data.
A sample mode represents the data value that occurs the most frequently. As a result, it shows
where the data tend to be the most concentrated. However, as Ramachandran and Tsokos (2014)
wrote, if all data values are distinct, the data set has no mode by definition.
They are two quantiles worth mentioning as they measure the spread of the data set. Those
quantiles are the lower and upper quartiles, denoted by the abbreviations Q(.25) or 𝑄 and Q(.75)
or 𝑄 , respectively. They represent 25% and 75% of the data, respectively. The interquartile range
(IQR) is the distance between the first and third quartiles, 𝑄 𝑄 . It can be used to determine the
The sample variance, shorthand 𝑠 or var, is given by Equation 1.55 below provided.
𝑛
1 [Eq. 1-55]
2
𝑠 𝑥𝑖 𝑥
𝑛 1
𝑖 1
The sample standard deviation, denoted s, is defined as the square root of the variance 𝑠 . Both
the sample variance 𝑠 and the sample standard deviation s are measurements of the variability or
"scatteredness" of data values in the vicinity of the sample mean 𝜇̅ . The greater the variance, the
wider the spread. It is worth noting that both 𝑠 and s are nonnegative.
𝑛 𝑥 𝑥̅ [Eq. 1-56]
𝑔
𝑛 1 𝑛 2 𝑠
The kurtosis of the same sample can be computed using Equation 1.57 below provided.
𝑛 𝑥 𝑥̅ [Eq. 1-57]
𝑘
𝑛 1 𝑛 2 𝑠
To interpret the values of 𝑔 and 𝑘 , one may refer to page 90, where both terms are discussed.
In the classical sense, quantile is synonymous with percentile. On the one hand, the term
"percentile" refers to any of the 99 numbered points dividing an ordered collection of scores into
100 segments, each containing one-hundredth of the total scores. The percentile of a particular
number, say x, is determined by the percentage of values less than x. For example, a test score in
the 90th percentile is greater than 90% of the available scores but less than 10% of the remaining
scores. Quantile, on the other hand, is a phrase that refers to one of the groups of values in a variate
(random variable) that divides either the units (elements) of a sample into subgroups of the same
size and adjacent values or a probability distribution into elementary distributions of equal
probability. In other terms, a quantile is a value that divides a set of data into identical proportions.
Returning to the previous example, given the magnitude of the data set or score, the. The 90
quantile of a set of data is the value that separates the data into two classes or groups such that a
proportion of the observed values falls below and a fraction of the observed values falls above that
value. It is worth noting that the main distinction between percentile and quantile is that the former
refers to a percentage of the data set, while the latter refers to a fraction of the data set.
Because the terms percentile and quantile are synonymous, the focus will exclusively be on the
quantile. Quantiles may present some difficulties when computing them from a set of data, as
implied by their definition. For instance, suppose a data set contains ten observations. One can
only split off a fraction of the data: 0.1, 0.2,…,0.9. If a fraction other than those provided, say 0.33,
must be split, there will be no value that separates off a fraction of.33.
Additionally, suppose one chooses to locate the split point at the nearest observation. They may
be unsure whether to count the observation in the lower or upper part of the observed data set's
scale. Practitioners employ a different but more practical operational definition of quantile to
circumvent these difficulties. The following step-by-step process is necessary for this definition of
Step 1: Consider first, for this new definition of quantile, a set of n raw data 𝑥 , 𝑥 , ⋯ , 𝑥 .
𝑖 0.5
of the fractions 𝑝 𝑛, with 𝑖 1, ⋯ , 𝑛.
As a result, the data's quantiles 𝑄 𝑝 are simply the ordered observations themselves, 𝑥 . Several
authors have proposed different expressions for pi. For example, Wilk and Gnanadesikan (1968)
Given a sample of n observations 𝑥 , 𝑥 , ⋯ , 𝑥 , the sample mean, also known as the empirical
1 [Eq. 1-58]
𝑥̅ 𝑥 ,𝑘 1,2, … , 𝑝
𝑛
The sample variance which measures the spread of the observations of the kth variable of X around
1 [Eq. 1-59]
𝑠 𝑠 𝑥 𝑥̅ ,𝑘 1,2, … , 𝑝
𝑛
Another measure of spread is termed the sample covariance s . As defined in Equation 1.60, the
sample covariance assesses the spread pairwise between the ith and kth variables of X.
1 [Eq. 1-60]]
𝑠 𝑥 𝑥̅ 𝑥 𝑥̅ ,𝑖 1,2, … , 𝑝; 𝑘 1,2, … , 𝑝
𝑛
The square root of the variance is known as the sample standard deviation. That is 𝑠 . It is worth
noting that when i equals k, the sample covariance reduces to the Variance.
The final descriptive statistic is the sample correlation coefficient or Pearson’s product-moment
correlation coefficient, which measures the linear association between two variables. For the ith
and kth variables, the sample correlation coefficient 𝑟 is given by Equation 1.61 below.
𝑠 ∑ 𝑥 𝑥̅ 𝑥 𝑥̅ [Eq. 1-61]]
𝑟
𝑠 𝑠
∑ 𝑥 𝑥̅ ∑ 𝑥 𝑥̅
where 𝑖 1, 2, … , 𝑝 and 𝑘 1, 2, … , 𝑝.
While both 𝑟 and 𝑠 are adequate for determining the linear association of two variables, they
may be less useful for other types of associations. Moreover, they can be unreliable when
observations contain outliers, revealing associations when there is barely one. Therefore
questionable observations must be detected and corrected if needed. The following are some
properties of the sample correlation coefficient simply denoted as r. First, values of r are
inclusively bounded between 1 and 1. Second, an r value of “0” signifies a lack of linear
association between the variables, whereas a positive or negative sign of r indicates the direction
of their association. Lastly, the value of r remains the same whether or not the factor1 𝑛 or
1
𝑛 1 is chosen to calculate 𝑠 , 𝑠 , and 𝑠 . More on sample correlation coefficients and
descriptive statistics can be found in any statistics book, such as the book by Johnson and Wichern
(2019).
Researchers usually rely on mathematical models to study various real-world phenomena for
simulation. Most often, the formulated model serves to draw random samples 𝑋 , 𝑋 ⋯ , 𝑋 from
the population of interest. The values of the samples, known as observations or measurements,
usually are denoted by 𝑥 , ⋯ , 𝑥 and represents values of some sort of a subject of interest. For
example, these measurements could represent subway arrival times, household electricity
consumption, antenna spectrum sensing in cognitive radios, etc. In the field of probability and
statistics, to understand the behavior of these phenomena, one must first identify the probability
distribution from which the given data are drawn (Ramachandran and Tsokos 2014). This is
supported by the fact that, as stated by Chambers et al. (1983, p. 191), "at the heart of probabilistic
statistical analysis is the assumption that a set of data arises as a sample from a distribution in some
In any case, there are several reasons for making distributional assumptions about data. First, if a
set of data can be described as a sample from a specific theoretical distribution, say a normal
distribution, then the data can be described more compactly. For example, in the normal case, the
data can be succinctly described by providing the mean and standard deviation and stating that the
normal distribution well approximates the data's empirical (sample) distribution. The use of
distributional assumptions can also lead to statistical procedures. The assumption that normal
probability distributions generate data, for example, leads to an analysis of variance and least
squares. Third, the assumptions enable one to characterize the sampling distribution of statistics
computed during the analysis, drawing conclusions and making probabilistic statements about
unknown aspects of the underlying distribution. For example, assuming the data is a sample from
a normal distribution, one can use the t-distribution to calculate confidence intervals for the
theoretical distribution's mean. A fourth reason for making distributional assumptions is that
understanding the distribution of a data set can sometimes shed light on the physical mechanisms
Analyses based on specific data distributional assumptions are invalid if the premises are not met
to a reasonable degree. According to Chambers et al. (1983), “Garbage in, garbage out.” When
attempting to validate the assumption about the distribution of the sampled data, one is trying to
verify if the empirical distribution can sufficiently be approximated by the assumed one. Clearly
defining the task prompts the investigator to take action to shed light on the issue. For instance, it
may encourage the investigator to look closely at how the empirical distribution of a set of data
In the previous sections, graphical displays validate distributional assumptions about data;
however, in practice, investigators use goodness-of-fit testing to complement the latter. The testing
serves to identify a probability distribution function that would likely characterize the behavior of
the data or the phenomenon of interest. Due to its importance and relevancy to the current study,
this section discusses a couple of statistical tests (methods). These tests are the chi-square and
Kolmogorov-Smirnov goodness-of-fit tests which are practical to use in determining how well the
data fits a specific probability distribution necessary to achieve one’s goal of identifying the
underlying probability distribution of a given data set. However, it should be noted that other
methods exist that do not rely on population distributional assumptions. These methods are known
as nonparametric or distribution-free tests. While the nonparametric tests are beyond the scope of
this study, to further one’s knowledge, refer to Ramachandran and Tsokos (2014) or Soong (2004),
for instance.
To study an unknown phenomenon's behavior, one must first collect a random sample of data via
experiments or other means and test whether their probability distribution well-fit a known
probability distribution. Pearson's (1900) chi-squared (𝜒2 ) goodness-of-fit test is one of the most
popular and versatile tests designed for this purpose (Soong 2004). For instance, in applied
statistics, one may refer to McAssey (2013)’s paper for applications to multivariate distributions
with known hypothesized distribution functions. Nevertheless, when conducting a Pearson's chi-
square goodness-of-Fit test (𝜒 -test), it is commonly assumed that the population X of interest
distribution is known. The 𝜒 -test is based on the test statistic 𝑄 which is equivalent to the
difference between a frequency graph (e.g., a histogram) built from the sample values and one
population X characterized by the CDF 𝐹 𝑥 with known parameters [for unknown parameters,
for e.g., see Soong (2004)]. One may divide the sampled data range of the population X into k
mutually exclusive intervals (classes or bins) 𝐼 , ⋯ , 𝐼 and let 𝑂 be the number of X falling into
𝐼 , with 𝑗 1, 2, ⋯ , 𝑘. Note that 𝑂𝑗 is also referred to as the jth observed frequency from which
𝑂𝑗 [Eq. 1-62]
𝑝𝑗 ℙ 𝑋 ∈ 𝐼𝑗 𝑛, 𝑗 1, 2, ⋯ , 𝑘 .
Moreover, if 𝐹0 𝑥 denote the hypothesized (expected) CDF of the data being tested, then the
theoretical probability 𝑝𝑗 associated with the jth interval can be determined using Equation 1.63:
𝐸𝑗 [Eq. 1-63]
𝑝𝑗 ℙ 𝑋 ∈ 𝐼𝑗 | 𝐹0 𝐹0 𝑥𝑢 𝐹0 𝑥𝑙 𝑛, 𝑗 1, 2, ⋯ , 𝑘 .
Where 𝐸𝑗 is the jth expected (theoretical) frequency for the jth interval expressed in terms of 𝐹0
evaluated at the upper (𝑥𝑢 ) and lower (𝑥𝑙 ) bounds (limits) of the interval, and n is the sample size.
Then, the test statistic 𝑄2 , provided in Equation 1.64, is a measure of deviation between the observed
and expected outcome frequencies expressed in the sum of differences between the observed and
expected outcome frequencies (counts of observations), each squared and divided by the
expectation.
𝑘 2 [Eq. 1-64]
2 𝑂𝑗 𝐸𝑗
𝑄
𝐸𝑗
𝑗 1
Theorem: Assuming that the hypothesis below is valid, the distribution of the test statistic 𝑄 is
Versus
Through the above theorem, a test of the hypothesis 𝐻 versus 𝐻 , also known as a significance
test, can be formulated by assigning a probability 𝛼 of a type -I error. As discussed in Section 1.8,
One commits a type I error when one fails to reject the null hypothesis a priori assumed to be true.
Accordingly, if one aims to attain a type-I error probability of α, the 𝜒 -test recommends rejecting
the one-sided hypothesis test each time 𝑄 satisfies the criterion given by Equation 1.65 below.
ℙ 𝑄 𝜒 , 𝛼 [Eq. 1-65]
Otherwise, accept the hypothesis. For the convenience of calculations, given the probability 𝛼 of
the type-I error, also referred to as significance level, one can use a 𝜒 lookup table to find 𝜒 ,
corresponding to the test statistic. On the lookup table [e.g., see Montgomery and Runger (2007,
p. 655)], 𝜒 , corresponds to the upper 100α percentage point of the chi-square distribution, as
shown in Figure 1.14. In practice, the most common alpha values are 0.001, 0.01, and 0.05. A
value of between 5% and 1% is almost significant; a value between 1% and 0.01% is significant,
Equation 1.65 represents the critical region, also known as the rejection region, for defining the
set of values for the test statistics for which the goodness-of-fit test is being tested.
Next, the following step-by-step procedures are necessary for performing a goodness-of-fit test
Step 1: Divide the range of values of the random variable X into k non-overlapping intervals
𝐼1 , ⋯ , 𝐼𝑘 . Let 𝑂𝑗 be the total number of the values of the n samples that fall in the interval 𝐼𝑗 , with
Step 2: Calculate each theoretical probability 𝑝̂ , as in Equation YY, associated with each interval
j by in terms of 𝐹 .
Step 4: Select a value of α to construct the critical region as specified in Equation 1.65 since the
Step 5: Conduct the 𝜒 -test below under the assumptions that n roughly sufficiently large than 50.
𝐻 :𝐹 𝑥 𝐹 𝑥
𝐻 :𝐹 𝑥 𝐹 𝑥
Step 6: reject the hypothesis 𝐻0 if 𝑄2 𝜒2𝛼,𝑘 1 , and conclude that the data does not follow or fit
the specified probability distribution. Otherwise, accept 𝐻0 and deduce that the data set fits the
It is worth noting that if one uses statistical software, the software may compute a p-value based
on the test statistic 𝑄2 and indicate the significance level threshold value for the acceptance of 𝐻0
at all significance levels α less than that of the determined p-value. For more information on the
𝜒 -test including illustrations, one may refer to Soong (2004), Ramachandran and Tsokos (2014),
The Kolmogorov-Smirnov goodness-of-fit, denoted as the K-S test, is a test regarding a statistic
used to measure the deviation of the empirical cumulative histogram from the assumed cumulative
science of deriving information from chemical systems by data-driven methods. For instance,
Saccenti et al. (2011, p. 648) assessed the equivalence of the empirical equipercentile function
Nevertheless, to define the test statistic of the K-S test, let n be the set of realizations 𝑥 , ⋯ , 𝑥 of
the random samples 𝑋 , 𝑋 ⋯ , 𝑋 drawn from the population of interest X. Let 𝐹 𝑥 be the
unknown true CDF of X, and 𝐹 𝑥 be the assumed CDF of X, which parameters are entirely
From the set of observed data 𝑥 , 𝑥 , ⋯ , 𝑥 , a cumulative histogram can be plotted in three steps:
Step 1: Rank the observations in the sample from smallest to largest. That is, the sample
𝑥 , 𝑥 , ⋯ , 𝑥 is arranged as 𝑥 ,𝑥 ,⋯,𝑥 .
𝑗 1, 2, ⋯ , 𝑛. In this case, let 𝐹 𝑥 without a hat represent the unknown CDF of X, which is being
sought to be equal to 𝐹 .
The below equation defines the test statistic to consider in this case.
𝑛 𝑛 [Eq. 1-66]
𝐷2 𝑚𝑎𝑥 𝐹 𝑋 𝑗 𝐹0 𝑋 𝑗 𝑚𝑎𝑥 𝑗 𝑛 𝐹0 𝑥 𝑗
𝑗 1 𝑗 1
where X(i) represents the jth-order statistic of the sample. As defined, 𝐷 thus quantifies the
maximum absolute values of the n differences between the observed and hypothesized CDFs
evaluated for the observed samples. When the parameters in the theorized distribution must be
estimated, the values for 𝐹 𝑋 are calculated using the distribution’s estimated parameter
values. While obtaining the distribution of 𝐷 analytically is complicated; its distribution function
at various values can be computed numerically and tabulated. It can be demonstrated that the
of n, the sample size (e.g., see Massey, 1951). At this point, the K–S test becomes similar to the
𝜒 -test. At a given significance level α, the operating rule is to reject the hypothesis 𝐻 , as below
Versus
𝐻 : The theoretical CDF 𝐹 𝑥 is not the true CDF 𝐹 𝑥 of the empirical data.
In this case, as given in Equation 1.67, 𝑑 is the sample value of 𝐷 , and 𝑐 , represents the critical
values of the maximum absolute difference between a sample and population CDFs defined by
Equation 1.69.
ℙ 𝐷 𝑐 , 𝛼 [Eq. 1-67]
The values of 𝑐 , for 𝛼 0.001, 0.01, and 0.05 can be found numerically and tabulated as
functions of n [e.g., see Soong (2004, p. 372)]. For instance, for larger 𝑛 (𝑛 40) and 𝛼 0.05,
𝑐 1.36 .
,
√𝑛
The following is a step-by-step procedure for carrying out the K–S test:
𝑗 [Eq. 1-68]
𝐹 𝑥 , 𝑗 1, 2, ⋯ , 𝑛
𝑛
Step 3: Using the hypothesized distribution, compute the theoretical distribution function 𝐹 𝑥 at
each 𝑥 . If necessary, the distribution 𝐹 ’s parameters are estimated from the data.
Step 4: Work out the differences between the empirical and assumed CDF evaluated at each 𝑥
𝐹 𝑥 𝐹 𝑥 , 𝑗 1, 2, ⋯ , 𝑛 [Eq. 1-69]
Step 5: Compute 𝑑 using Equation 1.67 for D2. Note that plotting 𝐹 𝑥 and 𝐹 𝑥 as functions of
x and noting the location of the maximum by examination may save time.
Step 6: Select a value of α and find 𝑐 , by using a table such as Table 1-5 hereby provided. One
Table 1-5: Kolmogorov-Smirnov Test - Critical Values between Data Sample and
Hypothesized CDFs
Courtesy of Soong (2004, p. 372)
It's worth noting the significant differences between this test and the 𝜒 -test. The K–S test is valid
for all values of n, whereas the 𝜒 -test is a large-sample test. Furthermore, the K–S test employs
unaltered and unaggregated sample values, whereas data lumping is required in the execution of
the 𝜒 -test. On the downside, the K–S test is only strictly valid for continuous distributions. It
should also be mentioned that available tabulated 𝑐 , values are based on a completely specified
hypothesized distribution. There is no rigorous method of adjustment available when the parameter
values of the assumed CDF must be estimated. In these cases, the only thing that can be said is
To see an example of the K-S test in action, see Ramachandran and Tsokos’ (2014) book, which
includes an in-depth example. Following and providing all of the processes outlined herein, the
authors make it easy for one to obtain a hands-on hypothesis testing approach. Additionally, one
may refer to another well-illustrated example by Soong (2004). Finally, see Saccenti et al. (2011)
and Herzog et al. (2007) for chemometrics and structural equation modeling for real-world
applications, respectively.
Practitioners in various fields, especially in probability and statistics, use covariance matrices to
matrices are essential objects for multivariate statistical analysis. In most methods employed in
multivariate statistical analysis, the underlying structure of the population units is unknown or
assumed. Accordingly, theoretical covariance matrices play an enormous role by serving as objects
to describe the true interdependency structure of the units of the underlying population (Bejan
2005). Still, through statistical measurements collected from a population, the sample covariance
matrices typically help clarify, to “some extent,” the interdependence structure present in the
population items.
random vector 𝑿 𝑋 ,⋯,𝑋 . Intuitively, one may see a covariance matrix representing
vector x as the realizations of the p random 𝑋 , ⋯ , 𝑋 . In p-dimensional space, the elements of x can
be interpreted as the coordinates of a point in the space. In addition, the distance of the point 𝐱
𝑥 ,𝑥 ,…,𝑥 to the space, the origin can be regarded as standard deviation, square root of the
variance, or scaled difference between units. In this manner, the inherent uncertainty or variability
in the observations should be accounted for. In addition, points of similar related “uncertainty”
should be considered as if they were at an equal distance from the space origin.
Many multivariate techniques are based on the simple concept of distance. This section aims to
define the Euclidean distance and the non-ordinary statistical distance. Let the points P and Q,
whose coordinates are defined by Equation 1.70(b), be represented in the Euclidian plane of origin
O, as illustrated in Figure 1.15(a). Equation 1.70(d) depicts the straight-line distance between point
P and origin O using the Pythagorean theorem. Equation 1.70(d) represents the expression of the
d O, P x x ,P x ,x ,O 0,0 (a)
(d)
d P, Q 𝑥 𝑦 𝑥 𝑦 ⋯ 𝑥 𝑦 ,
P x2
d O, P x x 𝑐 𝑠
P
x2
x1
𝑐 𝑠 O 𝑐 𝑠
O
x1 𝑐 𝑠
By transforming the coordinates of P and Q, the formulas provided above can be expressed in
terms of statistical distances. One may perform this transformation in the way that the variability
in the 𝑥 direction is identical to the one in the 𝑦 direction, and the measurements or values of 𝑥
and 𝑦 vary independently to apply the Euclidian distance. In other words, all points P located at
a constant squared distance from Q must lie on a hyperellipsoid centered at Q whose major and
minor are parallel to the coordinate axes, as shown in Figure 1.15(b) for the case of two-dimension
points. As Johnson and Wichern (2019) derived, Equation 1.71(a) through Equation 1.71(b) is the
x y x y x y
d P, Q ⋯ (a)
s s s
Q 𝑦 ,𝑦 ,⋯,y (c)
where s , s , …, and s in Equation 1.71(a) are the sample variances constructed from p
measurements on 𝑥 , 𝑥 , ⋯ , 𝑥 .
1.6.1 Preface
In a broader sense, probability refers to the branch of mathematics concerned with the study of
probabilities dealing with the likelihood of events to happen. It applies daily to assessing risks and
devising mathematical models for calculations and predictions. More specifically, the term
probability, a probability measure, quantifies the possibility or chance (e.g., 25%) that a random
event (e.g., rain) will occur. Thus, it is a statement about the likelihood of an event. The probability
assigns to this event a value from the interval [0,1] to the outcome as a ratio in percentage. As
defined, one may interpret the probability of an outcome as a subjective degree of belief under
which the outcome will occur. Another interpretation may be that probability depends on the
theoretical model of repetitive replicas of the random experiment. To compute the probability of
an event, one needs first to define the sample space (Ω), which represents the collection of all
outcomes or results on any given attribute or variable of the population in question. The rule of
thumbs when assigning probabilities to outcomes making up the sample space is that one of the
probabilities of all results must sum up to one (Montgomery and Runger 2007). One can construct
subsets of Ω known as events from the sample space. An event (ℱ) represents a set of the
experiment’s possible outcomes given an investigation. One would say that this experiment has
occurred if all the experiment outcomes are within ℱ, which is closed under complements and
countable unions.
A sample space can either be discrete or continuous. When the set of its results is finite or countable
infinite, then Ω is discrete. Whereas, when the set of its outcomes falls in an interval of real
numbers, then Ω is continuous. A probability space (Ω, ℙ, ℱ) is a triple consisting of a set of subsets
of Ω and the probability measure ℙ. The following sections will discuss this topic in depth. The
purpose, for now, is to provide some background information on probability distributions for
random variables. While Section 1.6.2 will be concerned with univariate random variables, Section
1.6.3 will focus on multivariate random variables considering the latter's generalization as it
Everyday life is full of endless systems. An investigator trying to understand any of them can
randomly create a mathematical model to experiment with all the system events. Usually, the
investigator uses random variables to analyze the system, and the resulting results may serve in
other applications. For example, in many experiments involving probability, knowing the
expression of the function that relates an experiment to its possible outcome is more informative
than the outcome itself. A random variable is that function assigning a numerical value to each
possible outcome after the realization of each experiment contained in the sample space Ω. For
instance, in a coin-tossing example where Ω represents the set of all tails and heads, X may define
the number of times the coin shows a tail. In this case, X is a univariate discrete random variable
that differs from a continuous variable. Though, the probability distribution of a random variable
(Montgomery and Runger 2007). Concerning both types of random variables, the following
sections describe the probability distributions of each variable and the parameters used to
summarize distributions.
Let Ω be a discrete sample space and ℱ a subset of Ω containing the outcomes of a given
experience. In addition, let X be the random variable representing the outcomes contained in ℱ.
𝑝 . Each 𝑝 represents the distribution of the random variable X at its discrete value 𝑥 . From the
given probability information on X, one may formulate the probability of X in terms of a function
as the probability that X takes on a value that is smaller or equal to a preselected value “x.” This
𝐹 𝑥 𝑃 𝑋 𝑥 𝑝 [Eq. 1-72]
where 𝐹 𝑥 represents X's cumulative distribution function (CDF) satisfying the following
0 𝐹 𝑥 1 (a)
[Eq. 1-73]
If 𝑥 𝑦, then 𝐹 𝑥 𝐹 𝑦 (b)
It is customary to associate F(x), defined earlier by Equation 1.73, with a probability mass function
𝑓 𝑥 as defined by Equation 1.74(c) and satisfying the conditions provided in Equation 1.74 (a),
Equation 1.74(b), and Equation 1.74(d) to ensure the conservation of the physical mass of the
system represented by X.
𝑓 𝑥 0 (a)
𝑓 𝑥 𝑃 𝑋 𝑥 𝑝 (c)
𝐹 𝑥 𝑃 𝑋 𝑥 𝑓 𝑥 (d)
For an illustration of the relationship between 𝐹 𝑥 and 𝑓 𝑥 , one may refer to Figure 1.16.
Meanwhile, the literature contains countless examples of discrete probability distributions. Among
the classic ones is the discrete uniform distribution where X is the simplest discrete variable
assuming a finite number of possible values, with equal probability each and 𝑓 𝑥 1 . For
𝑛
more examples on this topic and its applications to various fields, one may refer to the book by
The expected value of the discrete variable X, denoted by 𝐸 𝑋 or 𝐸 𝑋 , represents the mean μ of
X. Its expression is provided in Equation 1.75 as a weighted average of all possible outcomes of
X.
[Eq. 1‐75]
𝜇 𝐸 𝑋 𝑥𝑝
Note that 𝐸 𝑋 is not necessarily a value that X can assume and may be different from the most
probable value of X. In addition, if one is interested in the expected value of X or any function
g(X) of X, the following should apply, let 𝑋 be of a function of X. Equation 1.76 would then
serve to determine the resulting expected value 𝐸 𝑋 . For this function, 𝐸 𝑋 is referred to as
[Eq. 1‐76]
𝜇 𝐸 𝑋 𝑥 𝑓 𝑥 𝑥 𝑝
Also of interest as summaries of the probability distribution of X are the variance 𝑣𝑎𝑟 𝑋 and
standard deviation 𝜎 of X. As defined in Equation 1.77(b), 𝜎 represents the square root of the
𝑣𝑎𝑟 𝑋 𝜎 𝑋 𝐸 𝑋 𝜇 𝑥 𝜇 𝑝 (a)
[Eq. 1‐77]
𝜎 𝑣𝑎𝑟 𝑋 (b)
Both 𝑣𝑎𝑟 𝑋 and 𝜎 measure the spread in the values of X. It is worthwhile noting that the variance
of X represents the expected value of the random variable X μ . In addition, the variance σ X
does not have the same dimension as the values of X. Instead, its dimension is the square root of
the dimension of values taken by X. Skewness and kurtosis are other well-known measures of the
observed data distributions that can be determined using Equation 1.82 and Equation 1.83.
Unlike the discrete random variable in the previous section, a continuous random variable would
have a very distinct distribution because the possible values of X will be uncountable. On the other
hand, because an interval of real numbers represents the range of the values of X, one may regard
this range as a continuum. The continuous distributions find extensive applications in various
fields, including engineering, describing physical systems. For instance, the density of a cantilever
with a uniform load on a beam between points a and b (see Lindeburg 2011, p. A-120, Montgomery
and Runger 2007, p. 99). As Montgomery and Runger (2007) wrote, the integral of the density
function over the interval [a, b] represents the total loading in this interval. One may interpret the
integral as the summation of all the infinity loading points over the interval [a, b]. Nonetheless,
similar to discrete random variables, a density function 𝑓 𝑥 and a probability distribution function
𝐹 𝑥 serve to describe the probability of a continuous random variable X between the reals a and
b. They both 𝑓 𝑥 and 𝐹 𝑥 satisfy the properties given in Equation 1.78(a) through Equation
1.78(d).
𝑓 𝑥 0 (a)
𝑓 𝑥 𝑑𝑥 1 (b)
[Eq. 1-78]
𝐹 𝑥 𝑃 𝑋 𝑥 𝑓 𝑥 𝑑𝑥, ∀𝑥 ∈ℝ (c)
(d)
𝑃 𝑎 𝑋 𝑏 𝐹 𝑏 𝐹 𝑎 𝑓 𝑥 𝑑𝑥, ∀ 𝑎, 𝑏 ∈ ℝ
Figure 1.16: Probability Distribution and Mass (resp. Density) Functions Illustrations for
Discrete (resp. Continuous) Random Variables
Adapted from Montgomery and Runger (2007, p. 65, 99)
Let X be a continuous random variable equipped with a PDF 𝑓 𝑥 and a CDF 𝐹 𝑥 as defined in
Equation 1.78. The expected value 𝐸 𝑋 of X which also corresponds to the mean 𝜇 of X is defined
by Equation 1.79.
[Eq. 1‐79]
𝜇 𝐸 𝑋 𝑥𝑓 𝑥 𝑑𝑥
function 𝑋 , also referred to as the rth moment about the origin of X, is given by Equation 1.80.
[Eq. 1‐80]
𝜇 𝐸 𝑋 𝑥 𝑓 𝑥 𝑑𝑥
In addition, when it exists, the rth moment about the mean μ of X, also called the central rth
𝑣𝑎𝑟 𝑋 𝜎 𝑋 𝐸 𝑋 𝜇 𝑥 𝜇 𝑓 𝑥 𝑑𝑥
[Eq. 1‐81]
While the mean 𝜇 and standard deviation 𝜎 are useful descriptive statistics for locating the center
and describing the spread or dispersion of the probability density function f (X), they do not give
a detailed description of the distribution. For instance, two distributions may have identical mean
and variance but are different (Ramachandran and Tsokos 2014). Thus, to approximate the
probability distribution of a continuous random variable more accurately, the higher moments or
rth moment defined earlier serve to introduce a couple of other measures of skewness and kurtosis.
The third standardized moment about the mean μ, as described in Equation 1.82, is referred to as
𝐸 𝑋 𝜇 𝜇
𝛼 𝑋 ⁄ [Eq. 1‐82]
𝜎 𝑋 𝜇
Note that skewness measures a density function's asymmetry (lack of symmetry) around its mean.
A distribution, or data set, is said to be symmetrical if it appears identical to the left and right of
the center point. If 𝛼 𝑋 0, the distribution is symmetric around the mean; if 𝛼 𝑋 0, the
distribution has a longer right tail; and if 𝛼 𝑋 0 the distribution has a longer left tail. Thus, a
The Kurtosis, defined as the standardized fourth moment of the mean 𝜇 and given by Equation
𝐸 𝑋 𝜇
𝛼 𝑋 [Eq. 1‐83]
𝜎 𝑋
In addition, kurtosis is determined by the size of the tails of a distribution. When the kurtosis is
positive, there are not enough observations in the tails of the distribution; when it's negative, there
are way too many. Leptokurtic distributions have substantial tails, while platykurtic distributions
have negligible tails. Mesokurtic distributions have equal kurtosis as a normal distribution. The
functions, the four examples provided in this section have been selected for relevance to the
proposed study. However, they are part of the most extensively used distributions in various
applications by users of different fields. The four identified distributions are uniform, triangular,
normal, and beta distributions. The following paragraphs cover each distribution by defining its
CDF and PDF functions and summaries’ values described in Section 1.6.2.2 devoted to continuous
random variables. In addition, each paragraph outlines a few examples of their applications to
construction engineering and management. For additional information on these or possibly other
distributions, see a probability and statistics book such as by Tijms (2007) or Montgomery and
Runger (2007).
Uniform Distribution
A continuous random variable X is said to exhibit over the interval [a, b] a uniform distribution if
its probability density function 𝑓 𝑥 and cumulative probability distribution function 𝐹 𝑥 are as
given in Equation 1.84(a) and Equation 1.84 (b). Equation 1.84 (c) and Equation 1.84 (d) provide the
1
, ∀ 𝑥 ∈ 𝑎, 𝑏
𝑓 𝑥 𝑏 𝑎 (a)
0, ∀ 𝑥 ∉ 𝑎, 𝑏
0, ∀ 𝑥 𝑎
⎧
⎪
1, ∀ 𝑥 𝑏
𝐹 𝑥 (b)
⎨ [Eq. 1-84]
⎪ 1
⎩𝑏 𝑎 , ∀ 𝑎 𝑥 𝑏
𝑎 𝑏
𝜇 𝐸 𝑋 (c)
2
𝑏 𝑎
𝑣𝑎𝑟 𝑋 (d)
12
Below provided in Figure 1.17, from left to right, are the graphical representations of 𝑓 𝑥 and
𝐹 𝑥 in [a, b].
f(x) F(x)
1 1
𝑏 𝑎
0 0
a b x a b x
Figure 1.17: Uniform Probability Density and Distribution Functions’ Graphs
The uniform distribution is frequently used to simulate quantities that vary randomly and assume
values between a and b with little knowledge outside of a and b (Tijms 2007). For example, in
construction management and scheduling, when deciding which distribution to employ for project
activity durations, Williams (1992) noted that the uniform distribution is only relevant in rare
instances. Among those uncommon occurrences, this investigation discovered a small number of
studies that used the uniform distribution. De Reyck and Herroelen (1996) investigated the
potential use of both the coefficient of network complexity (CNC) and the complexity index on
networks with activity periods taken from the uniform distribution in the range [1, 10]. Dodin and
Elmaghraby (1985) used randomly generated networks with discrete and uniform distributions of
activities to approximate the criticality indices of activities in PERT Networks. Liu et al. (2006)
suggested an evolutionary method for minimizing the overall weighted tardiness of all jobs to be
performed using a single machine with random machine breakdowns. The algorithm was evaluated
using job parameters such as work weights and job release time produced from discrete uniform
distributions. Mehta (1999) used the same idea: job processing times are produced from a discrete
uniform distribution. For all activities required in designing software that builds project networks,
Agrawal et al. (1996) employed uniform distributions over specified ranges to randomly assign
Triangular Distribution
The behavior of a random variable X is said to follow a triangular distribution if its probability
density function 𝑓 𝑥 and cumulative distribution function 𝐹 𝑥 over an interval [a, b] are
respectively given by Equation 1.85(a) and Equation 1.85(b). Equation 1.85(c) and
2 𝑥 𝑎
⎧ , ∀ 𝑎 𝑥 𝑚
⎪ 𝑚 𝑎 𝑏 𝑎
⎪
𝑓 𝑥 2 𝑏 𝑥 (a)
⎨ , ∀ 𝑚 𝑥 𝑏
𝑚 𝑎 𝑏 𝑎
⎪
⎪
⎩ 0, ∀ 𝑥 ∉ 𝑎, 𝑏
0, ∀ 𝑥 𝑎
⎧
⎪1 , ∀ 𝑥 𝑏
⎪
⎪ [Eq. 1-85]
𝐹 𝑥 𝑥 𝑎 (b)
⎨ 𝑚 𝑎 𝑏 𝑎 , ∀ 𝑎 𝑥 𝑚
⎪
⎪
⎪ 1 𝑏 𝑥
, ∀ 𝑚 𝑥 𝑏
⎩ 𝑏 𝑎 𝑏 𝑚
1
𝜇 𝐸 𝑋 𝑎 𝑏 𝑚 (c)
3
1
𝑣𝑎𝑟 𝑋 𝑎 𝑏 𝑚 𝑎𝑏 𝑎𝑚 𝑏𝑚 (d)
18
Below provided in Figure 1.18, from left to right, are the graphical representations of 𝑓 𝑥 and
𝐹 𝑥 in [a, b].
f(x)
2
𝑏 𝑎
0
a m b x
The values a, b, and m carefully chosen are the three parameters of the triangular distribution.
Figure 1.18 from left to right depicts the triangular probability density respectively 𝑓 𝑥 and
on the interval [a, m] and decreasing on [m, b]. The triangular distribution is often recommended
for modeling purposes when there is only a little information on the variable of interest, such as
its likely lowest value a, mode m, and highest value b. Unlike the uniform distribution, the
triangular distribution can be skewed positively or negatively. Hence, unlike other curves, it is a
skewed curve that is either negatively or positively skewed. A high curvature characterizes a
positively skewed curve to the left and a tail to the right of the curve. Unlike a positively skewed
curve, a negatively skewed curve has its curvature to the right and tail to the left (Holliday et al.
2008).
Let [a, b] be a real-valued interval. A continuous variable X is said to have a normal distribution
if its probability density 𝑓 𝑥 and cumulative probability 𝐹 𝑥 functions are as given by Equation
1.86(a) and Equation 1.86(b). In both equations, the constants µ and σ (positive) are as provided
in Equation 1.86(c) and Equation 1.86(d), representing the parameters mean and variance of the
distribution.
1 1
𝑓 𝑥 𝑒𝑥𝑝 𝑥 𝜇 , ∞ 𝑥 ∞ (a)
𝜎√2𝜋 2𝜎
𝐹 𝑥 𝑃 𝑋 𝑥 𝑓 𝑥 (b)
[Eq. 1-86]
1 1
𝑃 𝑎 𝑋 𝑏 exp 𝑥 𝜇 𝑑𝑥, ∀ 𝑎, 𝑏 ∈ ℝ (c)
𝜎√2𝜋 2𝜎
𝐸 𝑋 𝜇 (d)
𝑣𝑎𝑟 𝑋 𝜎 (e)
Below provided in Figure 1.18, from the left (a) to the right (b), are illustrations of the graphical
the normal density curve is symmetric around its mean µ, where it peaks. Its statistics median and
mode are identical and coincide with the mean µ. In addition, any linear combination of normally
distributed random variables is again a normally distributed variable (Tijms 2007, p. 144). For
example, the standardized random variable Z of the normally distributed random variable X is a
𝑋 𝜇
𝑍 , 𝐸 𝑍 0 , 𝑣𝑎𝑟 𝑍 1 [Eq. 1-87]
𝜎
Z has a zero mean and unit variance. As defined, Z represents the distance of X from its mean 𝜇
with regards to its standard deviation σ, as depicted in Figure 1.19(c) below. As Montgomery and
Runger (2007, p. 113) said, using Z instead of X is “the key step to calculate a probability for an
arbitrary normal random variable” such as X. For instance, Figure 1.19(d) depicts the probability
milliamps (mA).
Figure 1.19: Graphs of the Normal Probability Density and Distribution Functions
To find this probability, one needs first to calculate the corresponding z-value (𝑧 𝑥 𝜇 ⁄𝜎)
obtained through standardization of x through Equation 1.87, then use a lookup table (e.g., Johnson
by the unshaded area in this Figure 1.19(d). Nevertheless, any plot of 𝑓 𝑥 in Figure 1.19(a), is
known as the bell or Gaussian curve named after Carl Friedrich Gauss, the famous mathematician
who first discovered it between 1777 and 1855. The normal curve is a sort of natural law, whose
formulation in part was made possible thanks to the three well-known mathematical constants √2,
π=3.141, and e=2.718 (Tijms 2007). In addition, from its additive property, the central limit
theorem was derived (see Section 2.3.2.3, page 153). Regarding the normal distribution
applications, they are countless across different fields. Its popularity is undoubtedly due to its
university, which was first acknowledged by the Belgian statistician Adolphe Quetelet (1796-
1874) while fitting a large number of data collected from different areas of science. Owing to its
universality, many in the eighteenth to the nineteenth century considered it a God-given law (Tijms
2007). The following are a few applications of the Gaussian distribution: the study of the height of
randomly chosen persons, the annual rainfall in a specific area, and the time of occurrence between
is not favored among practitioners because of the probability of being negative, as Williams (1992)
The beta family of distributions is traced to 1676 in a letter from Sir Issac Newton to Henry
Oldenbeg. Because they can be fitted nearly to any data representing a system, they have been
used extensively in statistical theory and practice for over a century in diverse fields (Nadarajah
and Kotz 2007). The behavior of a continuous random variable X governed by a beta density of
positive shape parameters α and β would have the probability density 𝑓 𝑥 and distribution 𝐹 𝑥
functions are given by Equation 1.88(a) and Equation 1.88(b) and mean and variance as in
𝑐𝑥 1 𝑥 , ∀ 0 𝑥 1
𝑓 𝑥 (a)
0, ∀ 𝑥 ∉ 0, 1
0, ∀ 𝑥 0
⎧
⎪𝛤 𝛼 𝛽
𝐹 𝑥 𝑢 1 𝑢 𝑑𝑢 , ∀ 𝑎 𝑥 𝑚 (b)
⎨𝛤 𝛼 𝛤 𝛽 [Eq. 1-88]
⎪
⎩ 1, ∀ 1 𝑥
𝛼
𝜇 𝐸 𝑋 (c)
𝛼 𝛽
𝛼𝛽
𝑣𝑎𝑟 𝑋 (d)
𝛼 𝛽 𝛼 𝛽 1
where the parameter c expressed in terms of the gamma function Г, which already computed values
𝛤 𝛼 𝛽
𝑐 (a)
𝛤 𝛼 𝛤 𝛽
[Eq. 1‐89]
(b)
Г 𝑎 𝑒𝑥𝑝 𝑦 𝑦 𝑑𝑦 , ∀ 0 𝑎
Based on the shape parameters α and β, the graph of the beta density function can take a wide
range of different shapes, as depicted in Figure 1.20 (Tijms 2007) for different values of the pair
𝛼, 𝛽 in (a) and (b) of this figure. Note that an extreme case that reduces the beta PDF to a constant
𝑓 𝑥
𝑓 𝑥
Nevertheless, thanks to its versatility over a finite interval, the beta distribution models many
physical quantities (Soong 2004), including random proportions (Tijms 2007). Areas of
application include tolerance limits, quality control, and reliability (Soong 2004, p. 223). However,
construction engineering and management practitioners seem not to favor this distribution. Indeed,
this is due to its four degrees of freedom which render the beta distribution complex to understand
and its parameters challenging to determine, as Williams (1992) asserted. He said, “If the
distribution should be unlimited, it is not applicable, but where it can be limited, the beta fails on
its understandability.” Yet, Schexnayder et al. (2005) used the beta distribution to develop a PDF
for construction simulation modeling applications. Their proposed technique assumed the
existence of a ratio relating the 75th percentile to the mode of a set of data durations given a
Chi-Square Distribution (𝜒
As Montgomery and Runger (2007, p. 131) wrote, “the chi-squared distribution is a special case
of the gamma distribution.” Although not covered in this section, the gamma probability
hydrologic engineering, the gamma probability distribution has been used to estimate flood
quantiles in hydraulic design (Ashkar and Ouarda, 1998). Like its parent distribution, the Chi-
Square distribution is extensively used in various fields thanks to its usefulness in statistical
inference and confidence intervals. Hence, including it as an example is worthwhile. To define the
Equation 1.90 provides the expression of the𝜒 distribution pdf with n degrees of freedom as well
as the expressions of its mean 𝜇 and variance 𝑣𝑎𝑟 𝑋 (Ramachandran and Tsokos 2014).
1
⎧ 𝑥 ⁄
exp 𝑥 ⁄2 , ∀ 0 𝑥 ∞
⎪𝛤 𝑛
2 ⁄
𝑓 𝑥 2 (a)
⎨
⎪
⎩ 0, ∀ 𝑥 0 [Eq. 1-90]
𝜇 𝐸 𝑋 𝑛 (b)
(c)
𝑣𝑎𝑟 𝑋 2𝑛
So far, the previous sections from Section 1.6.2 focused solely on probability distributions of
univariate random variables. The following sections will discuss the probability distributions of
multivariate random variables whose applications abound in various probability and statistics-
related fields. As addressed in Section 1.6.3, multivariate variate random variables arise when an
investigator seeks a system characterized by more than one univariate random variable.
Accordingly, one may view this section as a generalization of the previous section in the sense that
concurrent realizations of the p random variables can be determined in terms of the CDF F and
PDF f of the p variables. One may view the given set as a 𝑝 1 random vector
own defined marginal probability distribution 𝑓 𝑥 and 𝐹 𝑥 . The same applies to their marginal
For simplification, let the p random variables be real numbers. Accordingly, for each set of real
numbers 𝑥 , ⋯ , 𝑥 and using Equation 1.91(c), one can calculate ℙ as the probability of the p real
numbers falling in a set ℱ in the p-dimensional Euclidian space. In addition, given the continuous
function F, through the applications of operational calculus involving differential operators (e.g.,
see Anderson 2003), one can compute the density function f as in Equation 1.91(b) and the
𝜕 𝐹 𝑥 ,⋯,𝑥
𝑓 𝑥 ,⋯,𝑥 (b) [Eq. 1-91]
𝜕𝑥 ⋯ 𝜕𝑥
Moreover, if the random variables 𝑋 , 𝑋 , ⋯ , 𝑋 are statically independent, then one may express
Equation 1.91(a) and Equation 1.91(b) in terms of the marginal probability distributions
𝐹 𝑥 ,⋯,𝑥 𝐹 𝑥 ⋯ 𝐹 𝑥 ℙ𝑋 𝑥 ⋯ ℙ𝑋 𝑥 (a)
[Eq. 1-92]
𝑓 𝑥 ,⋯,𝑥 𝑓 𝑥 ⋯ 𝑓 𝑥 (b)
Whether the p variables 𝑋 , 𝑋 , ⋯ , 𝑋 are discrete or continuous, one may substitute each marginal
1.74 and Equation 1.78 to find the joint probability density and distribution functions of the p
variables as a collective. In the meantime, one may need to describe the behavior of any pair of
variables 𝑋 and 𝑋 by means of their joint probability function which represents the linear
association between the variables in terms of their covariance 𝜎 as provided in Equation 1.93.
𝜎 𝐸 𝑿 𝜇 𝑿 𝜇 , ∀ 𝑖, 𝑗 1, ⋯ , 𝑝
if 𝑋 is a
⎧ 𝑥 𝜇 𝑥 𝜇 𝑓 𝑥 , 𝑥 𝑑𝑥 𝑑𝑥 continuous random (a)
⎪
⎪ variable with a [Eq. 1-93]
PDF 𝑓 𝑥
⎨
⎪ if 𝑋 is a discrete (b)
⎪ 𝑥 𝜇 𝑥 𝜇 𝑝 𝑥 ,𝑥
random variable
⎩
with a PDF 𝑝 𝑥
𝜎 𝐸 𝑿 𝜇 , ∀𝑖 1, ⋯ , 𝑝
1.7.1 Preface
To understand certain social or physical phenomena, a researcher would often collect data on more
than one attribute. In Probability and statistics or any other field of application, though,
multivariate data originates when a researcher experiments to observe and record values of
numerous random variables on several subjects or units. Usually, the researcher performs an
investigation as part of a study intended to understand the relationships between the variables of
interest. Consequently, the researcher decides on the number of variables p and the number n of
observations to collect on each variable. For illustration, Table 1-6 below identifies n and p for
studies performed in five different fields and provides an example for each study area.
Hence, a systematic multivariate data compilation is crucial for successfully recording data
necessary for the investigation. It is then imperative to record each variable as either a specific
item or experimental unit (Johnson and Wichern 2019). For that reason, to display n values
observed on p variables set forth for the study, the researcher would consider one of the approaches
provided in Section 1.6.3 to tabulate the data. In addition, he would indicate the extent of the data:
As seen in the above table, some of the n and p-values can be pretty significant, consistent with
the well-known contemporary fact that multivariate analysis deals with a considerable amount of
data tabulated in matrices. Thus, in addition to the number of variables and sample sizes, a
researcher would, as Johnson and Wichern (2019) indicated, choose an acceptable multivariate
Sorting and grouping. Variables or objects are grouped according to defined classification rules
Instigation of the dependence among variables. The nature of the relationships between
Prediction: Relationships between variables serve to predict the values of one or more variables
Hypothesis construction and testing: In terms of population parameters, the formulation of specific
Applicable to this study, the following rules can be set regarding vectors and matrices used
throughout the reminding sections of this manuscript. The rules are as follows:
Rule 2: When a superscript T is added to a vector or matrix, it indicates its transpose (e.g., 𝑿 ).
Rule 4: Vectors or matrices at the sample or observations level will be denoted by lowercase letters.
observations.
Rule 6: The notation 𝑥 will be used to denote the observed value of the jth variable regarding the
Making inferences about the underlying characteristics of a population assumed large and of
unknown parameters based on a random sample drawn from it is called statistical inference
(Norman and Streiner 2003, Upton and Cook 2014). In statistics, researchers often formulate a null
hypothesis, a priori known to be false, and then seek to reject it. a researcher is required to define
both a null hypothesis 𝐻 and an alternative hypothesis 𝐻 to employ this process, also known as
the NHST. Thus, the researcher draws data from the population to compute the necessary test
statistic, which serves to decide whether the observed evidence rejects 𝐻 . In this process, the
researcher sets the critical regions of rejection of 𝐻 in terms of the test significance level 𝛼.
Alternatively, if the gap between the observed and predicted values is statistically significant, then
the researcher would commit a type I or II error by rejecting 𝐻 . The decision is usually based on
a p-value determined according to the observed data (Rebba et al. 2006). Later in this section, one
will notice that as Warner (2012) pointed out that the NHST implies the researcher draws samples
from a well-defined population so that the test statistics derived are reasonable estimates of the
actual population in question. Nevertheless, Ronal Fisher was the first to introduce the p-value
concept in the 1920s as “an index of the evidence against the null hypothesis” (Vidgen and Yasseri
2016). As a guide in carrying out an NHST, the following development provides the necessary
steps to conduct an NHST by defining all the terms introduced in this paragraph.
Although the null distribution is often unknown or assumed, the first step consists of constructing
a statistical model that characterizes the distribution of the population being studied. This is
necessary to ensure that either chance or random processes alone would determine the outcomes
of the required experiments. When the null hypothesis is true, the probability distribution of the
test statistic—quantity or single numerical statistic to which a sample data is reduced, so it could
be used to test the hypothesis—is called the null distribution. The model of the test statistic,
obtained through a random process, determines the statistical model known as the distribution
under the null hypothesis. In an NHST, the observed test statistics or results are compared against
the distribution under the null hypothesis, and the probability of obtaining those results is therefore
contradictory assertion to the null hypothesis about the population of interest. Both alternatives are
mutually exclusive and used to make statistical conclusions on whether to reject or accept the null
hypothesis. The hypothesis is said to be "null" because every so often, it states a status quo belief
referring to a "no difference" or "no effect" resulting from a change or improvement in the
population of interest. Conversely, the alternative hypothesis refers to a case of an actual difference
or effect (Lindeburg 2011). For instance, after making changes to a course syllabus, an instructor
may be interested in detecting the impact of change in terms of the difference between students'
Depending on its purpose, a test will be denoted as a two-tailed test or simply a nondirectional test
if the researcher wishes to demonstrate a "no change" or "no effect" (e.g., 𝐻 : 𝜇 12 and 𝐻 : 𝜇
12 for no change in the students' scores). Otherwise, depending on the direction of the change, it
will be denoted as a directional one-tailed test which could be left-tailed (e.g., 𝐻 : 𝜇 12 and
12 for an increase in the students’ scores). Table 1-7 below, which information derived from
Warner (2012), summarizes the decision rules to reject or accept 𝐻 when testing the alternative
hypothesis 𝐻 .
As one may notice, the term “tail” indicates the extreme regions of sample distributions, where the
observed values lead to the rejection of 𝐻 . Accordingly, asymmetric distributions that possess a
single tail, such as the chi-squared distribution, are often used in one-tailed tests. However,
symmetric distributions, such as the normal distribution, are used for two-tailed tests.
In general, the results of experiments are seldom right 100%, so when designing hypotheses for
testing, researchers account for the risk of being wrong. Therefore, before conducting a hypothesis
either reject or accept the null hypothesis 𝐻 . That threshold is known as the significance level 𝛼
or “alpha risk or producer risk, as well as the probability of type I error” (Lindeburg 2011, p. 11-
14). One commits a type I error when one fails to reject the null hypothesis a priori assumed to be
true. Conversely, if the null hypothesis is considered false and the researcher fails to reject it, the
researcher would commit a type II error. However, the researcher will make the right decision with
a confidence level of either (1 𝛼) when they do not reject 𝐻 under the assumption that it is true
or (1 𝛽) when they fail to reject 𝐻 under the assumption that it is false. The term 1 𝛽
Table 1-8 summarizes all different scenarios above described, where the probability 𝛽 of
committing a type II error is determined by the distribution of the test statistic under 𝐻 (Warner
2012).
The significance level of 0.05 (5%) or 0.01(1 %) is customary for hypothesis testing. For instance,
when designing a one-tailed hypothesis test, selecting a 5% significance level suggests that the
results obtained under the null hypothesis 𝐻 would have around a 5% of chance of being wrong.
In other words, the results would be correct at a confidence level of 95% and said to be significant
(Lindeburg 2011).
This step consists in randomizing a sample from the population of interest and then calculating the
test statistic value of the sample. This value will vary between samples according to the distribution
under the null hypothesis 𝐻 . For example, in a two-tailed test where the observed data X are
normally distributed, using Equation 1.94, one can calculate the test statistic 𝑧.
𝑿 𝜇 [Eq. 1-94]
𝑧 𝜎
√𝑛
where µ and σ represent the population mean and standard deviation, n is the sample size, and 𝑿
Under the assumption of the null hypothesis 𝐻 as in most cases, a p-value or simply p represents
the probability (ℙ) of getting a test statistic (t) value deemed as extreme compared to the
mainstream observed test results. In other words, a p-value serves to verify a statistical hypothesis
by quantifying the concept of statistical significance of evidence, with the proof being represented
by the actual observed value. Accordingly, given an observed t value, one may define a p-value as
the prior probability of realizing a test statistic value at least as “extreme” as t under the assumption
that H was true, where H represents either 𝐻 or 𝐻 . Depending on the direction of the statistical
hypothesis test H, Equation 1.95(a) through Equation 1.95(c) represents the mathematical
Using Equation 1.91(a), one may express any of the three above expressions of p in terms of a cdf.
As expressed, the direct computation of a p-value may be complex. Thanks to available statistical
software or look-up tables, one can determine p-values quickly. A p-value takes values between
zero and one as a probability, including zero and one. The rule of thumb when interpreting p-
values is that a very small p-value would suggest an extreme observed event. Such an event occurs
with a very large statistical significance α, which is unlikely to occur under 𝐻 . In other words, a
very small p-value corresponds to a very large statistical significance test α—value chosen by the
Although p-values are necessary for testing hypotheses, some researchers (e.g., Hwang et al. 1992,
Kim and Bang 2016, Vidgen and Yasseri 2016) have argued that they have logical flaws leading
to misleading conclusions (Rebba et al. 2006). Accordingly, being aware of its deficiencies,
including misconceptions and limitations, is as crucial as using them. Among the misconceptions
that even an inexperienced researcher may be unaware of is that a p-value is not the probability
that 𝐻 is true or 𝐻 is false or vice versa. Bayesian hypothesis testing deals with addressing such
inquiries where for instance, one seeks to assess, as Rebba et al. (2006, p. 169) wrote, “how much
evidence is there for the null hypothesis or for the model being right.” This quest corresponds to
determining the probability ℙ 𝐻 | 𝑋 of the hypothesis 𝐻 given the observed data 𝑋, which
In a one-sided test (resp. a two-sided test), one compares the p-value computed in step 5 against
the significance level α (resp. 𝛼 2). Depending on the direction of the test, then one decides to
either accept or reject 𝐻 . The four decision rules provided in Table 1-7 may serve as a guide.
Occasionally, one may fail to reject 𝐻 . If so, that would not constitute evidence for making 𝐻
true. It just means that there is no sufficient evidence to reject 𝐻 . Be wary that hypothesis testing
can be frustrating when no samples yield the desired result after many trials. For that reason, in
addition to using p-values, researchers sometimes employ confidence intervals (CI) as p-value
companions. As Kim and Bang (2016, p. 5) wrote, “They both are complementary in the sense that
the p-value quantifies how ‘significant’ the association /difference is while the CI quantifies the
The sections that follow provide an overview of each chapter in this dissertation.
The introduction chapter is required for any research study since it lays the groundwork for the
task. In other words, the introductory chapter should serve as a repository for the background data
needed to comprehend the development of more advanced concepts in later chapters. The reason
is that this research study allows for the adaptation and implementation of concepts from
disciplines other than construction and engineering—the background resources for this research
study address concepts from applied mathematics and probability. Furthermore, the chapter covers
random vectors and matrices as inputs to multivariate techniques such as principal component
analysis, which will be explored in the third chapter. Because principal component analysis (PCA)
relies on hypothesis formulation to make inferences, the chapter also covers descriptive and
result, the requirements for probability distributions and random variables are covered in this
chapter.
Furthermore, it provides context for multivariate data and their analysis. Finally, the chapter
finishes with an overview of the chapters to follow. However, before using its framework to
develop new ones, the notion of project scheduling and their accompanying planning and
The topic of this chapter is ambitious because, if completed, it will pave the way for new methods
underutilized in project scheduling and management, but the subject is also scary in and of itself.
Nonetheless, this issue was motivated by a Czech physicist who discovered a typical pattern after
charting thousands of collected bus departure times at a Cuernavaca bus stop and finding bus
timings to have the same behavior pattern (Wolchover 2014). This characteristic was nicknamed
"universality" by scientists, and it is frequently observed when random matrix theory (RMT) is
used to investigate the behaviors of complex systems. As a result, the goal is to adapt and use the
notion that has proven successful in random matrix theory applications to construction project
networks. This is because the project network schedule comprises nodes representing activities
linked by precedence connections. The interactions between the many parties involved in
developing a project allow it to fall into the category of complex systems. Modern mathematicians
have used RMT to simulate complex systems and analyze their behavior by employing the
eigenvalues of covariance matrices that describe a system. Because of these parallels, the new
universal law based on the Tracy-Widom distribution is expected to reveal insights into the
behavior of project networks. The upcoming literature review should aid in learning more about
the topic and establishing parallels between project network timelines and fields of application for
this new universal law. This will assist in its application to project networks.
The assumption that everything in nature, including things and persons, is governed by laws leads
one to believe there must be a universal law governing project network schedules. If this universal
law exists, it may aid project planning and management. Because the activities that comprise the
project contribute to defining the total project cost and time, this universal law or method of project
scheduling may be valuable to project managers, engineers, and the academic community. PCA
might be applied to project network schedules if the universality employed in RMT to describe the
behavior of complex and complex systems could also be used to describe and model project
network schedules. All of these subjects are expected to be covered in Chapter 3. Then, use any
significant discoveries to design procedures for using PCA, a fantastic multivariate statistical
analysis tool used for data reduction. Data reduction is typically performed by analyzing the
principal components of specific matrices to select only a few variables describing a complex
The research organization described in the preceding section brings this chapter to a close.
Nevertheless, it has accomplished its intended goal of providing background information for
further use in the subsequent chapters and better preparing readers for the challenges in the next
section.
Adeli, H., and Karim, A. (2001). Construction scheduling, cost optimization and
management. CRC Press.
Agrawal, M. K., Elmaghraby, S. E., and Herroelen, W. S. (1996). "DAGEN: A generator of
testsets for project activity nets." Eur.J.Oper.Res., 90(2), 376-382.
Anderson, T. W. (. (2003). An introduction to multivariate statistical analysis. Wiley-
Interscience, Hoboken, N.J.
Ashkar, F., and Ouarda, T. B. (1998). "Approximate confidence intervals for quantiles of gamma
and generalized gamma distributions." J.Hydrol.Eng., 3(1), 43-51.
Bagaya, O., and Song, J. (2016). "Empirical study of factors influencing schedule delays of
public construction projects in Burkina Faso." J.Manage.Eng., 32(5), 05016014.
Baik, J., Borodin, A., Deift, P., and Suidan, T. (2006). "A model for the bus system in
Cuernavaca (Mexico)." Journal of Physics A: Mathematical and General, 39(28), 8965.
BD+C Staff. (2018). "LIVE STREAMING: Watch select Accelerate Live! talks today." Building
Design & Construction.
Bejan, A. (2005). "Largest eigenvalues and sample covariance matrices. Tracy-Widom and
Painlevé II: computational aspects and realization in S-plus with applications." Preprint:
Http://Www.Vitrum.Md/Andrew/MScWrwck/TWinSplus.Pdf, .
Berchie, H., Gilbert J., C., Beatrice Luchin, John A., C., Daniel, M., and and Roger, D.
(2008). Algebra 2. Glencoe/McGraw-Hill.
Castellana, M., and Zarinelli, E. (2011). "Role of Tracy-Widom distribution in finite-size
fluctuations of the critical temperature of the Sherrington-Kirkpatrick spin glass." Physical
Review B, 84(14),.
Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983). "Graphical Methods
for Data Analysis."
Cleveland, W. S. (1993). Visualizing data. AT & T Bell Laboratories, Murray Hill, New Jersey.
Cleveland, W. S., and McGill, R. (1987). "Graphical Perception: The Visual Decoding of
Quantitative Information on Graphical Displays of Data." Journal of the Royal Statistical
Society.Series A.General, 150(3), 192-229.
Dawkins, P. (2007). "Linear Algebra." (9/2/ 10:37:40 AM, 2017).
117
118
119
120
121
Abstract
A methodology based on Random Matrix Theory (RMT) has been proposed to aid in the
investigation of project network schedule underlying behavior. The methodology is based on three
assumptions. The first assumption demands that the probabilistic activity durations have an
identical triangular distribution with known parameters. A repetitive joint sampling of activity
durations serves to create a sample data matrix 𝑿 using one of the 13 strategies identified as
suitable for translating a project network of size p into a random matrix utilizing its dependency
structure matrix. This joint sampling distribution was unknown. Yet, it served to figuratively draw
each of the n rows of 𝑿 . The second assumption is that the Tracy-Widom (TW1) limit law is
the natural distribution of each row of 𝑿 's sampling. The interactions between various parties
involved in managing and constructing projects, defined in pairwise linked activities, cause a
project network schedule to fall under complex systems marked by a phase transition with a tipping
point supporting this assumption. In addition, the striking similarities discovered in this study
between the fields of application of the TW1 distribution and those of project scheduling support
this assumption. The last assumption is that a project network schedule with sufficient correlation
in its structure, like that of complex systems, can be investigated within the framework of RMT.
122
enabled the application of RMT’s universal results, such as those associated with the TW
distributions, to the study of project schedules’ underlying behavior. In RMT, the appropriately
scaled eigenvalues of sample covariance matrices serve as test statistics for such a study.
, ,
As a result, a carefully engineered sample covariance matrix 𝑺 was developed, and a couple
of standardization approaches (Norm I and Norm II) for its eigenvalues. Both standardization
approaches relate to the universality of the TW limit law, which many authors have extended (e.g.,
Soshnikov 2002, Péché 2008) to a broad class of matrices that are not necessarily Gaussian under
relaxed assumptions. Although some of these assumptions have been eased, others must still be
, ,
met. Among these extra requirements, the formulation of 𝑺 was chosen. Its formulation is
based on n samples of p centered and scaled EF times of the project network activities in question,
which distribution assumption is to be tested at an α significance level. The α values of 5%, 10%,
MATLAB and Visual Basic facilitated the various data preparation and empirical study of a
handful of project networks. 35 networks of diverse sizes and complexity were chosen from the
study's benchmark networks 2040 obtained from the Project Scheduling Problem Library
(PSPLIB). The project networks ranged from 30 to 120 activities, with restrictiveness (RT) values
indicating complexity. Project network schedules were scheduled through Kelly's (1961) forward
and backward passes to determine activities' early finish (EF) times. The resulting largest
, ,
eigenvalue of 𝑺 revealed three exciting conclusions. First, the scatterplot of 100 pairs of the
, ,
normalized largest eigenvalue (𝑙 ) of 𝑺 and the sample size n revealed a distinct and
consistent trend. The pattern is a concave upward curve, like the stress-strain curve used in
materials science and engineering. The curve steepens to the left and flattens to the right as n
increases. Surprisingly, networks of varying sizes and complexity showed the same pattern
The deviations ∆ of the empirical means of 𝑙 from the mean of the TW distribution (𝜇 )
were determined using the same 100 outputs. They enabled the graphing of scatterplots of sample
size n against ∆ . The resulting pattern highlighted the association between n and 𝑙 generated
from probabilistic project schedules. The deviations ∆ between the variances of 𝑙 and 𝑣𝑎𝑟
were calculated similarly. The resulting pattern, consistent across networks, helped determine an
optimum sample size n that would maximize variance in a project network schedule's sampled
data. This sample size corresponds to the abscissa at the mean deviation curve's intersection with
the horizontal axis (n-axis). One may view this ideal sample size as a stress curve yield point or
the required pixel count for high-quality printing. The optimum sample size (𝑛 ) was found to
be related to the network size p but not its RT value. Moreover, any network that meets the study's
, ,
criteria has an 𝑛 . Also, α is not required in the expression of 𝑺 . For a project network,
Subsequently, the derived 𝑛 was used in a series of 1000 simulations to validate the
distributional assumption on activity durations. The test statistics for the K-S test were the
, ,
normalized first through fourth-largest eigenvalues 𝑙 ,𝑙 ,𝑙 , and 𝑙 of the matrix 𝑺 .
According to a comparison of data from both normalization approaches, Baik et al. (1999) and
Johansson (1998)—Norm II may be better suited to studying project network scheduling behavior
than Johnstone (2001)—Norm I. Under Norm I, 18 of the 35 project networks validated the null
, ,
hypothesis when using 𝑙 and 𝑙 of their matrices 𝑺 . Norm II supported the null hypothesis
, ,
for 19 of the 21 networks evaluated when using 𝑙 and 𝑙 of the matrices 𝑺 . The null
hypothesis states that the Tracy-Widom of order 1 is the natural limiting distribution of the joint
sampling of project network activity durations. This conclusion is significant, perhaps expected
with the use of the CPM, since Baik et al. (1999) introduced Norm II while studying the length of
the longest increasing sequence of random permutations, which was governed by a TW limit law.
Nevertheless, the empirical and theoretical distribution plots' agreement was displayed and
compared using Q-Q plots and histograms. The graphs corroborated the limiting distributional
assumption. Furthermore, the networks' Q-Q plots showed that appropriately rescaling and
, ,
recentering of the mth largest eigenvalue of the matrix 𝑺 will increase their test performance.
, ,
In sum, the extensive empirical investigation validated evidence of a covariance structure 𝑺 in
project network schedules. In addition, the study discovered a universal pattern for project network
schedules which assisted in establishing the Tracy-Widom distribution of order 1 as the natural
limiting joint distribution of project activity durations. The optimum sample size establishing this
2.1 Introduction
The construction of engineering projects is a complex task that requires systematic planning to
complete projects on time, within allocated budgets, and avoid delays. For project managers,
identifying the origin of delays is crucial to assign its cost to the appropriate party. Various authors
have performed analyses to identify their causes and proposed solutions that mitigate their effects
scheduling techniques that employ activity durations determined using deterministic, probability,
scheduling: Bar charts introduced in the 1700s (Priestley 1764), the Critical Path Method (CPM)
introduced in the early 1920s by Kelley and Walker (1959, 1989), and Linear Scheduling Method
which earlier version derived from line-of-balance (Lumsden 1968) and flowline (Nezval 1958).
The CPM uses a three-step technique: forward pass, backward pass, and comparison. The last step
is to determine whether activity floats delay their start dates without impacting the project's total
duration. Critical activities have "0" floats. "Its seductive simplicity has earned CPM much
criticism, e.g., not facilitating planning and using unrealistic fixed durations" (Jaafari 1984). The
Linear Scheduling Method (LSM, Johnston 1981) or linear scheduling (Lucko et al. 2014) models
activities as progress curves on a two-dimensional space (work against time). Not widely used
worldwide (Seppänen 2009, Kemmer 2006), it has the weakness of stressing flow over the network
structure. Due to the criticisms of using fixed durations in deterministic scheduling, the Program
Evaluation and Review Technique (PERT) was introduced in 1958 (Malcolm et al. 1959) as a
PERT is based on three parallel CPM estimates and designed to handle variability in activity
durations through an assumed beta distribution, which prompted criticisms right after its
introduction and after that (van Slyke 1963). Simulation approaches to determine inputs to
construction schedules have many advantages, including creating many observations, such as in a
Monte Carlo Simulation (MCS). Its applications are countless across scientific areas, such as risk
analysis in project scheduling (Chantaravarapan et al. 2004). Although it is helpful for statistical
analysis before an eigenvalue analysis, no system patterns can derive from an MCS. As Lee et al.
(2013) stated, simulation quality depends on its probability distribution for sampling durations. If
actual data are not available, uncertainty must be assessed (Schexnayder et al. 2005). Construction
As a result, the primary goal of this study is to determine whether and how the recently discovered
This Quanta Magazine paper (Wolchover 2014) exemplifies its attraction: "At the Far Ends of a
New Universal Pattern: A potent theory has developed describing a mystery statistical law that
develops throughout science and mathematics." The Tracy-Widom distribution, introduced in the
field of random matrix theory (RMT) by C. Tracy and H. Widom (1993, 1994), is a continuous
mathematical function that characterizes the behavior of a system of independent but randomly
interacting entities that approaches a tipping point. In other words, it emerges when a phase
transition occurs between the two phases of a system's weakly linked versus strongly coupled
components (Majumdar and Schehr 2014). This behavior progresses from 'insufficient' to
The steeper left tail of the lopsided curve, proportional to the peak 𝑁 , describes the system's
energy during the strong-coupling phase. On the other hand, the right side illustrates the system's
energy in the weak coupling phase based solely on the number of system components N. It assures
continuity during phase transitions. As a result, it represents a second-order phase transition as per
the physicist Ehrenfest's (1880-1933) classification (Kadanoff 2006; Sauer 2017). This phase
transition type is characterized by correlation length and power-law decay of correlation near the
criticality zone. Nevertheless, since its discovery, the distribution has appeared in various systems.
Can the interdependency structure established by the numerous links between project activities in
a project network schedule represent evidence of a covariance structure that may aid in the study
of their behavior as systems characterized by phase transitions with a tipping point? The question
is whether such a structure of population covariance exists. If such a structure exists, it may aid in
examining a project network schedule with linked activities that can shift from stable to unstable
(typical of delay) in the manner of a Tracy-Widom universality class system. Furthermore, around
the tipping point, the Tracy-Widom distribution may reflect a natural distribution for project
activity durations, which may prevent delays. Can the universality found in the minuscule margins
of the TW distribution characterizing the behaviors of systems with second-order phase transitions
from strong-coupling to weak-coupling phase help explain the behavior of construction project
network schedules with analogous phase transitions? As a result, the primary purpose of this
chapter is to use the Tracy-Widom distribution's universality (with regards to its phase transition)
Thus, finding the natural distribution of project networks' durations in a critical zone would help
prevent delay.
Because of its pioneering nature, the following actions are planned to complete the study for this
chapter. This chapter begins with a survey of the literature to provide a current understanding of
the Tracy-Widom distributions. It does so by analyzing and evaluating supporting results and
contributions. The chapter will then present a research strategy for reaching the chapter's primary
goal. The study objectives, data collection and preparation, and any model development to execute
any pertinent approach from the literature evaluation will all be part of the research methodology.
Subsequently, the research findings and conclusions will be provided. Finally, this chapter will
formulate the research contributions to the body of knowledge and make recommendations for
The advancement of technology has brought the world an enormous volume of complex data
generated throughout many devices in various disciples, including but not limited to genomics,
communications, sciences, and economics. In most cases, researchers seek to study and plot such
data in a nominal high-dimensional coordinate system going beyond the limit of classical
multivariate analysis. Hence, random matrix theory (RMT) has become a prominent framework
allowing researchers to probe conceptual questions associated with the multivariate analysis of
high-dimensional data. According to Ergun (2007), statisticians first introduced the RMT in the
early 1900s. However, it saw significant progress in the hands of physicists such as Wigner, Mehta,
Gaudin, and Dyson throughout the 1960s. The RMT is a method used to study the statistical
behavior of large complex systems through the definition of an ensemble necessary to account for
all possible laws of interactions occurring within the system Ergun (2007). Topics that have
emerged from the researchers’ inquiries in various fields of mathematics and physics include
universality, which significantly impacts the techniques developed to solve problems associated
with high-dimensional data. The term universality perhaps appears to originate from statistical
mechanics, where one can describe critical phenomena and phase transitions of the second order
in a small region of critical or transition points independently of all except a few properties of the
model (e.g., size, symmetry of interaction). Typically, the simplified description is obtained by
express phase transitions in entirely different materials, such as a liquid-gas transition near the
approaches the Curie temperature (Pastur and Shcherbina 2011). As Ergun (2007, p. 3) stated,
“The reason for such universality is not so clear but may be an outcome of a law of large numbers
the structure of systems exhibiting universality. Thanks to universality, a random matrix may serve
to describe the behavior of a system as if its behavior acts like a signature assuring evidence of
Areas of applications of the RMT include statistical physics of disordered systems (Stein 2004),
finance (Ledoit 2004, Frahm 2004), telecommunication networks (Taherpour et al. 2010),
electrical and computer engineering (Liao et al. 2020), and number theory (Jakobson et al. 1999).
One usually solves problems in these fields by applying multivariate analysis techniques such as
PCA, hypothesis testing, regression analysis, and covariance estimation. In most cases, resolving
these problems involves studying the bulk or edge spectrum of eigenvalues of random matrices
used to represent the investigated systems. While the bulk spectrum focuses on exploring the local
properties of eigenvalues or the interactions between neighboring eigenvalues (Péché 2008), the
edge spectrum is concerned with studying the behavior of extreme eigenvalues. However, the
study of the largest eigenvalue λ of large random matrices has drawn particular attention
among scholars due to its suitability in addressing questions associated with its fluctuations.
The questions of extreme value statistics “arise naturally in the statistical physics of complex and
disordered systems” (Majumdar and Schehr 2014, p.2). The work of May (1972) on probing the
stability of large complex ecosystems is the first known direct application of the statistics of λ
(Majumdar and Schehr 2014). Through his work, May (1972) found that these systems, connected
at random, are stable until they reach some critical level of connectance—the proportion of links
per species—as it increases to become unstable abruptly. By their behaviors, construction project
network schedules are like complex systems. The intricacy and pairwise links (thousands in large
projects) between activities making up a project network schedule may help explain this similarity.
Examples of such complex systems include the New York City subway system (Jagannath and
Trogdon 2017), bus departure time in Cuernavaca (Krbálek and Seba 2000), and stability in
will allow this research study to adapt and adopt their techniques to the study of behaviors of
project network schedules. Among those techniques is the hypothesis testing technique widely
used to probe evidence of a covariance structure at the edge spectrum. Similarly, these techniques
may be suitable for probabilistic construction project schedules deemed complex. Proving the
project schedules exhibit universality. Hence, they could be modeled, and their behaviors studied
like many other real-world systems. To device an approach that will help establish universality in
project networks, it is necessary first to provide sufficient background literature on the topic of
RMT as it applies to the study of the behavior of complex systems. Thus, the following sections
introduce successively random matrix ensembles, random matrix formulation, bulk spectrum
behaviors, edge spectrum behaviors, some applications of the universality theorems, numerical
In probability theory and statistics, when attempting to use random matrices—matrices filled with
random numbers—to model complex systems or random processes, various ensembles come into
play depending on the type of a researcher’s investigation. Usually, to describe all possible random
approximate a random process. Next, the investigator would use these random variables to
construct a 𝑛 𝑛 random matrix. The resulted matrices would then form a group of random
group. Following the three steps would create a random matrix model (RMM).
As defined in the literature, an RMM represents a triple object (Ω, ℙ, ℱ), where Ω is a set,
ensemble, or group of all possible matrices of interest, ℙ is the probability measure defined on Ω.
unions. When the group 𝛺 is compact or locally compact (endowed with a topological space—
e.g., Euclidean space—that is closed and bounded), the Haar measure ℙ is often used in
mathematical analysis to assign in terms of probabilities an invariant (see Section 2.3.1.2) volume
to subsets of Ω. Meanwhile, the literature on random matrix ensembles suggests the existence of
multiple ensembles. However, the following sections only introduce a few notable ones: the
Wigner ensemble, the Gaussian β-Ensembles, and the Wishart ensemble. For other ensembles, one
The Wigner random matrix ensemble is one of the most celebrated ensembles in RMT. This
ensemble was named after the physicist Eugene Wigner (1950-1955), who first introduced random
matrices in nuclear physics while studying the statistics of energy levels of systems of particles in
terms of the eigenvalues of matrices randomized from some ensembles (Péché 2008). By
considering the lines representing the spectrum of a heavy atom nucleus, Wigner hypothesized that
the spacings between successive wavelengths in the spectrum of nuclei of heavy atoms should be
analog to the ones between the eigenvalues of a random matrix and depend solely on the symmetry
class of the complex nuclei (Mehta 2004). As defined in the literature, complex nuclei consist of
many strongly interacting particles, making their spectral properties nearly impossible to describe
using rudimentary computations. Remarkably, Wigner's approach led to the discovery that
revolutionized the study of the spectral properties of complex nuclei. This discovery allowed one
to identify the local statistical behavior of energy levels of charged particles, considered in a simple
Except for the simple sequence restriction requiring all levels to have the same spin, parity, or any
other strictly conversed quantities resulting from the symmetry of the system, there were no other
restrictions on the random elements of H sampled from a known distribution (e.g., Gaussian)
(Mehta 2004). However, in the attempt to prove the correctness of Wigner's hypothesis by various
researchers (e.g., Porter and Rosenzweig, Dyson, Moore), the following paragraph provides the
unanimous details and requirements on H's entries found in the literature (e.g., Péché 2008). While
empirical analysis (e.g., Monte Carlo) validated Wigner's hypothesis, significant findings resulted
from theoretical studies conducted for the same purpose. For instance, as Mehta (2004, p. 5) wrote,
"From a group theoretical analysis Dyson found that an irreducible ensemble of matrices, invariant
under a symmetry group G, necessary belongs to one of the three classes, named by him,
orthogonal, unitary, and symplectic.”). Section 2.3.1.2 will expand more on these three classes of
matrices.
Nevertheless, the elements of the Wigner ensemble are 𝑛 𝑛 complex Hermitian (resp. real
complex (resp. real random) variables with a centered probability distribution 𝜇 (resp. µ) on ℂ
(resp. ℝ) that has a finite variance 𝜎 . Alternatively, their diagonal entries 𝐻 are i.i.d. real random
variables independent of the off-diagonal entries. Because 𝑯 is a complex Hermitian (resp. real
symmetric), its eigenvalues 𝜆 exist and are all real numbers. Therefore, by defining suitable
functions (e.g., level spacings) or some appropriate quantity, one may determine the statistical
The distribution law of level spacings is known as the Wigner Surmise. One may refer to Mehta
(2004) or Basor et al. (1992) for its formulation and derivation in terms of energy levels E as well
as its applications (e.g., the Zeros of the Riemann Zeta function). In thermodynamics, the same
equilibrium temperature T and the potential energy of the Coulomb gas of pairwise interacting
𝜆 represents the position of the ith particle on a line via a two-dimensional Coulomb force. That
expression consists of a couple of terms. While one term, accounting for pairwise repulsions, tends
to outspread the charges, the other term explaining the external harmonic potential, tends to
aggregate charges around the origin. Thus, both terms represent the competitions taking place in
the system of charges, which eventually stabilize into equilibrium configuration on an average
(Mehta 2004, Majumdar 2014). In that equilibrium configuration, the average joint density 𝜌 of
the charges can be expressed in terms of angular brackets as in Equation 2.1, representing an
1
𝜌 𝜆 〈 𝛿 𝜆 𝜆 〉 [Eq. 2-1]
𝑛
As Mehta and Gaudin (1960) pointed out, the interactions are so numerous and complex in heavy
nuclei that the use of average density in explaining the average properties like level-density
distribution of level spacings is common, enabling the applications of theories that are statistical.
The rapid development in RMT, leading to the so-called "new kind of statistical mechanics,"
allowed a researcher to study a complex system as if it was a black box. He would ignore the
knowledge of the system's inherent characteristics and substitute it with an ensemble of systems
in which all possible laws of interactions were identically probable (Ergun 2007, Mehta 2004). To
institute the mathematical rigor ensuring equal probability of the laws of interactions for the new
statistical theories of random matrices, Dyson derived three universal classes for all random
matrices. These three classes of ensembles, ushered by Dyson in the attempt to unify all the
statistical theories of random matrices mechanics, are orthogonal ensemble (OE), unitary ensemble
(UE), and symplectic ensemble (SE). When the entries of the Wigner matrices 𝐻 , are Gaussian
distributed, one refers respectively to the three classes as GOE, GUE, and GSE (Ergun 2007).
Because of the great interest in the eigenvalue distributions of large random matrices 𝐻 ,
depending on the probability measure ℙ to sample these matrices’ entries, “one can extract
different sub-ensembles from the ensemble of Wigner matrices.” (Bejan 2005, p. 8). Among them
are the Gaussian (e.g., see Mehta 2004), circular (e.g., see Mehta 2004), deformed (e.g., see
Johansson 2007, Péché 2008), disordered (e.g., see Bohigas et al. 2009), and Ginibre ensembles
(Collins et al. 2014, Mays 2013, Ginibre 1965). Although Mehta (2004, p.4) announced that the
Gaussian ensembles are equivalent to the circular ensembles for large orders, this section focuses
only on the classical Gaussian ensembles, also referred to as Gaussian β-ensembles. Not only
because these ensembles are the most common ones, but they are relevant to the scope of the
current work. The Gaussian β-ensembles are probability spaces on n-tuples of eigenvalues 𝒍
𝑙 , ⋯ , 𝑙 , with joint density functions ℙ (Dieng 2005, Dieng and Tracy 2011) defined as in
Equation 2.2 derived from Mehta (2004, p.58) and below provided.
1
ℙ 𝑙 ,⋯,𝑙 ℙ 𝒍 𝐶 exp 𝛽 𝑙 𝑙 𝑙 [Eq. 2-2]
2
⁄ ⁄ ⁄ 𝛽 𝛽𝑗
𝐶 2𝜋 2 𝛤 1 2 𝛤 1 2
and the 𝑙 are eigenvalues of randomly selected matrices from their corresponding family
,⋯,
of distribution. Because the expression of the law ℙ depends on the same 𝛽 introduced in Section
2.3.1.1, this distribution law has a physical meaning. Depending on the argument of the weight
function exp in Equation 2.2, the normalization constant 𝐶 may differ from one author to
another. For other formulations, see Dumaz and Virág (2013), and Borot et al. (2011). When 𝛽
1, 2, 4, this family of β-ensembles corresponds to the GOE, Gaussian Unitary Ensemble (GUE),
Gaussian Symplectic Ensemble (GSE), respectively. For other specific values of β, such as 𝛽 4,
see Mehta (2004) and Its and Prokhorov (2020). Additionally, for further interest on these
ensembles, one may refer to the works of Forrester (2005), Mehta (2004), Ergun (2007), and Tracy
Nevertheless, these three classical compact groups are known as invariant ensembles since one
can explicitly compute the joint densities of their distribution laws ℙ with regards to the Haar
measure of the groups. Moreover, it is worth adding that the term invariant, as Pastur and
Shcherbina (2011, p.3) stressed, “play an important role in [RMT] and its applications since the
early 1960s when Dyson introduced them…to determine the basic Gaussian ensembles [GOE,
GUE, and GSE] and their circular analogs (COE, CUE, and CSE…).” The followings are brief
introductory paragraphs of each of the three 𝛽-ensembles: GOE, GSE, and GUE.
The GOE, denoted 𝘖 , is a probability space on the set 𝛺 ℋ of 𝑛 𝑛 real (symmetric) Wigner
random matrices 𝑯 𝐻 , whose Gaussian entries (up to a choice of mean and variance σ )
(i) the real entries 𝐻 with 1 𝑖 𝑛, on the diagonal of H are i.i.d. 𝒩 0, σ , and 𝐻
(ii) the probability ℙ 𝑯 𝑑𝑯 that a system of 𝘖 will belong to the infinitesimal volume
H. That is, the density of the probability measure ℙ 𝑯 is invariant concerning all
invertible, i.e., 𝑶 =𝐎 ).
(i) the real ℜ 𝐻 and imaginary ℑ 𝐻 parts of each entry 𝐻 above the diagonal of H
real i.i.d. 𝒩 0, σ . For reference, see Dieng (2005, p.128) and Bejan (2005, p. 9),
(ii) the probability ℙ 𝑯 𝑑𝑯 that a system of 𝑈 will belong to the infinitesimal volume
is, the density of the probability measure ℙ 𝑯 is invariant for all unitary 𝑯 → 𝑼 𝑯𝑼
The GSE, denoted 𝑆𝑝 𝑛 or 𝑆𝑝 , is the probability space on the set 𝛺 of 2𝑛 2𝑛 self-dual and
Hermitian Wigner-like random matrices Q. These matrices are self-dual and written as an 𝑛 𝑛
𝑎 𝑏
quaternionic matrix 𝐻 𝐻 whose quaternion Gaussian elements 𝐻 , with
, 𝑐 𝑑
1 𝑖 1 𝑖
𝑞 𝑎 𝑑 𝑒 𝑎 𝑑 𝑒 𝑏 𝑐 𝑒 𝑏 𝑐 𝑒
2 2 2 2 [Eq. 2-3]
where 𝑒 , 𝑒 , 𝑒 , 𝑒 are the algebra of dimension four over ℝ satisfying the following conditions:
𝑖 𝑗 𝑘
𝑒 𝑒 𝑒𝑒 𝑒;𝑒 𝑒 ;𝑒𝑒 𝑒𝑒 𝑒 where 1 𝑖, 𝑗, 𝑘, 3 and is an even
1 2 3
permutation. Through elaborate reasoning found in Dieng (2005, p. 129-136) or Mehta (2004, p.
𝑨 𝑩
𝑯
𝑩 𝑨 [Eq. 2-4]
For more on the development leading to the expression of Equation 2.3 and Equation 2.4, one
may refer to Nevertheless, up to a choice of mean and variance σ , the following are the conditions
(ii) the probability ℙ 𝑯 𝑑𝑯 that a system of 𝑆𝑝 will belong to the infinitesimal volume
is, the density of the probability measure ℙ 𝑯 is invariant concerning all symplectic
In all the three above cases of the 𝛽-ensemble, the invariant restriction on ℙ 𝑯 𝑑𝑯 requires:
𝑃 𝐻 exp 𝑎 𝑡𝑟 𝐻 𝑏 𝑡𝑟 𝐻 𝑐
where a is a real and position number, and b and c are real numbers.
And the volume restriction on ℙ 𝑯 𝑑𝑯 requires, as Mehta (2004) wrote, 𝑑𝑯 to factor as follows
𝑑𝐻 ∏ 𝑑𝐻 , for 𝛽 1
𝑑𝐻 ∏ 𝑑𝐻 ∏ ℜ 𝐻 ℑ 𝐻 , for 𝛽 2
𝑑𝐻 ∏ 𝑑𝐻 ∏ ∏ 𝑑𝐻 , for 𝛽 4
The Wishart ensemble is named after its inceptor Wishart (1928), whose pioneering work laid
down the foundation of the theory of random matrices (Péché 2008, Paul and Aue 2014). In
classical RMT, to define a Wishart matrix, one needs first to specify a couple of sequences of
integers n and p. Where n represents the sample size and 𝑝 𝑝 𝑛 the number of variables or
dimensions of the sample selected in the way that 𝑝 → ∞ as 𝑛 → ∞ and lim → 𝛾 ∈ 0, ∞ . Next,
→
one needs to create a random matrix 𝑿 , known as a sample data matrix, whose columns
𝑋 𝑋 entries 𝑗 1, ⋯ 𝑝 are i.i.d. complex (or real) random variables with a centered
,⋯,
probability distribution 𝜇 (resp. µ) on ℂ (resp. ℝ) that has a finite variance 𝜎 . Then, one can
𝑺 1 ∗ 1 𝑻
𝑛 𝑿𝑿 (resp. 𝑺 𝑛 𝑿 𝑿) [Eq. 2-5]
𝑺 is known as a white and uncentered sample covariance matrix (see Frahm 2004, p.102 for
explanation). With the true population covariance matrix 𝜮 of the samples assumed positive and
definite, one can compute the more general form of 𝑺. This matrix, denoted by 𝑺 and expressed
𝑺 1 ∗ 1 𝑻 [Eq. 2-6]
𝑛 𝜮 𝑿𝑿 𝜮 resp. 𝑺 𝑛 𝜮 𝑿 𝑿𝜮
Note that the Wishart ensemble represents a subgroup of a larger class of covariance matrices by
its matrix construction. However, when the columns entries of X are sampled from a normal
Wishart matrix. In addition, when 𝜮 𝑰 and X’s entries are complex (resp. real) i.i.d. Gaussian,
the Wishart ensemble is called LUE (resp. LOE), where ‘L’ stands for Laguerre. Alternatively, for
double Wishart matrices, beyond the scope of this study but found in the literature mostly related
to canonical analysis (e.g., Johnstone 2006 and 2009), their resulting Wishart ensembles are JUE
(resp. JOE), where ‘J’ stands for Jacobi (Paul and Aue 2014).
The name Wishart matrix originated from the fact that if the columns’ entries of X are i.i.d.
according to 𝒩 𝟎, 𝜮 and 𝑛 𝑝, then 𝓢 “is said to have Wishart distribution with n degree of
freedom and covariance matrix 𝜮,” denoted as 𝓢 ~ 𝑊 𝑛, 𝚺 (Dieng and Tracy 2011, p. 4).
Nonetheless, it is customary to center the X columns by removing the mean 𝝁 through the operation
researchers usually assume 𝜇 0. Nevertheless, the density function of this distribution, derived
/ ⁄
1 [Eq. 2-7]
𝑓 𝓢 𝐶 , |𝜮| |𝓢| 𝑒𝑥𝑝 𝑡𝑟 𝜮 𝓢
2
greater than 𝑝 1 (e.g., see Johnstone 2006.) Since its inception, the Wishart model 𝑊 𝑛, 𝜮 has
become the focal point of various studies in multivariate statistical analysis. Perhaps regarding its
popularity, as Johnstone (2006, p.5) wrote, “this idealized model of independent draws from a
Gaussian is generally at best approximately true—but we may find some reassurance in the dictum
Focusing exclusively on real sample covariance matrices will simplify the development to
continue developing this section. Additionally, this is justified because the current research study
focuses solely on project activity durations derived from numbers. Moreover, the general case of
𝒮 will be used since one can reorganize the expression of 𝑺 to obtain an expression like 𝒮. Now,
entries of the diagonal matrix L and the orthogonal 1 𝑝 unit vectors 𝒖 represents the columns
one can show that the single values 𝑑 of X are the square roots (𝑑 𝑙 ) of the
,⋯,
eigenvalues 𝑙 ,⋯, of 𝒮, where r is the rank of X such that 𝑟 min 𝑛, 𝑝 . In Equation 2.9,
are orthogonal 1 𝑝 unit vectors, and those of the 𝑟 𝑟 diagonal matrix D are 𝑑 .
In general, as Bejan (2005, p. 10) wrote, if a sample covariance matrix A possesses a density
function 𝑓, such as the one provided in Equation 2.7 or the one expressed in terms of the weight
functions (e.g., see Johnstone 2006 and Paul and Aue 2014), then Equation 2.10 below provided
𝜋 ⁄
𝑙 𝑙 𝑓 𝑯𝑳𝑯 𝑑𝑯 [Eq. 2-10]
𝛤 𝑝 ⁄2
𝒪
𝒪 is the orthogonal group of 𝑝 𝑝 matrices (see Section 2.3.1.2), 𝛤 denotes (see Bejan 2005) a
⁄
𝛤 𝑧 𝜋 ∏ 𝛤 𝑧 𝑘 1 , ℜ𝑧 𝑝 1 , [Eq. 2-11]
where 𝑑𝑯 represents the Haar invariant probability measure on 𝒪 normalized in a way that:
𝒪
𝑑𝐻 1.
of A is given by Equation 2.12 (see Bejan 2005, Johnson and Wichern 2019) as follows,
⁄
𝜋 2 ⁄ 𝑑𝑒𝑡𝚺 ⁄
⁄ 1 𝟏 Eq. 2-12
𝑙 𝑙 𝑙 𝑡𝑟 𝚺 𝐻𝑳𝐻 𝑑𝐻
𝛤 𝑛 ⁄2 𝛤 𝑝 ⁄2 2
𝒪
Because the gamma function 𝛤 is related to the famous 𝜒 test introduced by Pearson (1900), it
is worth mentioning that when 𝑛 → ∞, the sampling distribution in Equation 2.12 converges to the
𝜒 distribution (e.g., see Dieng and Tracy 2011). Nevertheless, like the spectral decomposition of
Note that 𝓢 𝑛𝑺, up to a constant 1⁄ 𝑛 1 , coincides theoretically with the unbiased estimator
of the population covariance matrix 𝜮. In most studies, 𝜮 is unknown. Accordingly, a few exciting
questions emerge from the study of the eigenvalues of the Wishart ensemble, especially when
𝑛→∞. One of them is concerned with how the eigenvalues of a sample covariance matrix S, also
called by Baik et al. (2005) the “sample eigenvalues,” are related to the population eigenvalues
(from 𝜮)? To answer this intriguing question, the following excerpt from Johnstone (2001, p.2)
perhaps provides a clue “A basic phenomenon is that the sample eigenvalues 𝑙 are more spread
out than the population eigenvalues 𝜆 .” He goes on to add that the spreading effect is more
pronounced in the null cases (𝜮 𝜎 𝑰). The following sections attempt to answer this question by
drawing on the extensive literature on RMT. Before that, and in light of their significance and
relevance to this study, it is necessary to expand more on the subject of the Wishart matrices.
Due to the importance of Wishart matrices and models in RMT and their relevance to this study,
this section will elaborate on the subject of Wishart matrices and models discussed in the preceding
section. Moreover, it is necessary to understand the results derived from this class of matrices and
apply them to solve this study’s problems. Accordingly, the following subsections, devoted to the
Wishart matrices, define the basic terms associated with their formulation, state their well-known
properties and results, and conclude with a summary table of the elements of a Wishart model.
2.3.2.1 Definitions
One may refer to Section 1.8.3 for the notations and symbols used throughout this chapter.
covariance matrix, respectively. With this population, one may create a simple model to describe
by the population parameters μ and 𝛴. In terms of the population variables and parameters,
Equation 2.14(a) establishes the equivalence between 𝛴 and the variance-covariance matrix of 𝒳
is a square and symmetric 𝑝 𝑝 matrix whose diagonal 𝜎 and off-diagonal 𝜎 (𝑖 𝑗) entries with
𝑖, 𝑗 1, ⋯ , 𝑝, are respectively the variances of the variables 𝐗 and covariances between pairs
of variables 𝑿 and 𝑿 . Equation 2.14(b) and Equation 2.14(c) provide respectively the expressions
of 𝜎 and 𝜎 .
𝜮 ≡ 𝐶𝑜𝑣 𝒳 ≝ 𝔼 𝑿 𝝁 𝑿 𝝁 (a)
𝜎 𝐸 𝑿 𝜇 ≡ 𝑣𝑎𝑟 𝑿 (c)
Now, let 𝓧 , 𝓧 , ⋯ , 𝓧 be a random sample of size n drawn independently from the p-variate
normal population 𝒩 𝝁, 𝜮 . From the n drawn samples, one can define, as in Equation 2.15, an
X X ⋯ X
⎡ ⎤
X X ⋯ X
𝒳 𝑿 𝐗 ⋯𝐗 ≝ ⎢ ⎥ Eq. 2-15
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣X X ⋯ 𝑋 ⎦
record future realizations of the random vector 𝒳. Subsequently, once the random matrix 𝒳 is
observed, one can create, as provided in Equation 2.16, a sample data matrix 𝐗 𝐱 ,⋯, ,
with 𝐱 x ,x ,⋯,x .
𝑥 𝑥 ⋯ 𝑥 x Eq. 2-16
𝑥 𝑥 ⋯ 𝑥 ⎡ ⎤
𝑿𝒏 𝐱 𝐱 ⋯𝐱 ⎢x ⎥
𝒑 ⋮ ⋮ ⋱ ⋮ ⎢⋮ ⎥
𝑥 𝑥 ⋯ 𝑥 ⎣x ⎦
Through Equation 2.16, one may view each row i of the matrix 𝑿 as a row-vector 𝐱
𝑛 𝑝, by multiplying the matrix X by its transpose 𝑿𝑻 , one obtains a Wishart real matrix 𝑨
referred to as the sample covariance matrix of X. The matrix A is said to possess a Wishart
distribution with degree of freedom n, denoted as 𝑨~𝑊 𝑛, 𝜮 . When this distribution exists, then
definite if the square of the statistical distance (d) satisfies the condition in Equation 2.17.
In the above inequality, the square of the distance d is the quadratic form of M. Quadratic forms,
also known as matrix products, and distances play an essential role in multivariate analysis. One
may refer to a multivariate statistical analysis book such as Johnson and Wichern (2019) for further
From the observations on the p variables 𝑿 , 𝐗 , ⋯ , 𝐗 , one can compute the actual values of 𝜎
and 𝜎 in terms of the sample covariance and variance 𝑠 and 𝑠 as given by Equation 1.57 and
Equation 1.56, respectively, then record them in a matrix 𝑺 as in Equation 2.18 below provided.
s s ⋯ s ⋯ s Eq. 2-18
⎡s s s s ⎤
⋯
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
𝑺𝒏 ⎢ ⎥
⎢s s ⋯ s ⋯ s ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣s s ⋯ s ⋯ s ⎦
The sample covariance matrix 𝑺𝒏 is a biased estimator of 𝜮, whereas the one without a subscript
𝑛 1
𝑺 𝑺𝒏 𝑥 𝑥̅ 𝑥 𝑥̅ 𝑖 1, ⋯ , 𝑝, 𝑘 1, ⋯ , 𝑝 [Eq. 2-19]
𝑛 1 𝑛 1
where 𝒙 x x ⋯x , in Equation 2.19, represents the observed mean vector of X with each
x as defined in Equation 1.50. In many multivariate test statistics, the definition of the sample
With retrospective to Section 1.6.6.4, by standardizing each 𝑠 in the expression of Equation 2.18
to obtain the sample correlation coefficient, also known as the Pearson’s product-moment 𝑟
defined in Equation 1.58, one can create a matrix R defined by Equation 2.20. The resulting
matrix R represents the sample coefficient matrix or simply sample correlations (Johnson and
Wichern 2019).
1 r ⋯ r ⋯ r Eq. 2-20
⎡r 1 ⋯ r r ⎤
⎢ ⋮ ⎥
𝑹 ⎢ ⋮ ⋮ ⋮ ⎥
⎢r r ⋯ 1 ⋯ r ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣r r ⋯ r ⋯ 1⎦
Note that by creating a diagonal matrix D with the square root of the diagonal entries s of
𝑠
𝑹 𝑫 𝑺𝑫 𝑟
𝑠 𝑠 [Eq. 2-21]
2.3.2.2 Wishart Distribution: Sampling Distribution Law of the Sample Mean 𝑿 and
Covariance Matrix S
Since their inceptions, scholars have intensively employed Wishart random matrices as a
framework for developing various multivariate analysis methods. This class of matrices certainly
owes popularity to its classical results, particularly those formulated for real matrices under the
Gaussian assumption. As mentioned earlier in Section 2.3.1.3, the Wishart distribution is named
after its inceptor and is known as the distribution law of the sample covariance matrices. In other
words, it represents the joint distribution law of independent and repetitive sampling of
multivariate random variables. Each sampled random variable is distributed according to the
products of the variables 𝒁𝑗 . As defined, it represents the joint distribution of the m independent
There is no need to provide the general expression of the probability density of the Wishart
distribution 𝑊 ∙ | 𝜮 because of its complex form, which renders its evaluation intricate.
However, it is interesting to provide its expression still to help understand any other distribution
function derived from it. To define this density function, let 𝑨 be the positive definite matrix as
𝑨 𝑿 𝑿
𝑝 𝑝 𝑝 𝑛 𝑛 𝑝 [Eq. 2-22]
Formulated based on n independent samples greater than the number of variables p. Under these
conditions, there exists a joint density function distribution 𝑊 𝑨 | 𝜮 . This function, whose
expression is given in Equation 2.23, represents the joint density function of the independent
samples, which happen to represent the observations of the random row vectors 𝑿 , 𝑿 , ⋯ , 𝑿 .
/ 𝑨𝜮 ⁄
|𝑨| 𝑒
𝑊 𝑨|𝜮 [Eq. 2-23]
2 / 𝜋 / |𝛴| / ∏ 𝛤 𝑛 𝑖 ⁄2
Using the above expression, one can demonstrate the following basic properties of the Wishart
distribution.
(i) the sampling mean 𝑿, as expressed below in Equation 2.24 follows the normal distribution
𝑁 𝝁, 𝜮 .
𝑿 𝑿 ⋯𝑿
𝑿 𝑛 [Eq. 2-24]
(ii) the sampling covariance matrix 𝑛 1 𝑺, as in Equation 2.25, has a Wishart distribution
estimator of Σ.
𝑻
𝑛 1 𝑺 𝑿 𝟏𝐱 𝑿 𝟏𝐱 𝑻
𝑝 𝑝 𝑝 𝑛 𝑛 𝑝 [Eq. 2-25]
(iii) the sampling mean 𝑿 and covariance matrix 𝑺 are independent and sufficient statistics.
According to Johnson and Wichern (2019), adequate statistics mean that all the information
about the population mean 𝝁 and covariance matrix 𝜮, can be found in the observations of the
sampling mean 𝐱 and covariance matrix S, despite the sample size n. However, “this generally
is not true for nonnormal populations.” (Johnson and Wichern 2019, p.173).
Because of the similarities between the distribution law of the sampling mean 𝑿 of Wishart real
matrices and the celebrated law of large numbers and the central limit theorem, it is worthwhile
Because they are well-known and intensively used in statistics, this section reminds readers of the
law of large numbers and the central limit theorem. To state the law of large numbers, let
important result, which direct consequence is to say that 𝑋 , ,⋯, converge in probability to
observations drawn from a normal population with mean 𝝁 and finite covariance matrix 𝜮 – that
is each 𝑋 ~𝒩 𝝁, 𝜮 . Then, for large sample sizes with n large relative to p, √𝑛 𝑿 𝝁 has an
approximate distribution 𝒩 𝝁, 𝜮 . As for the previous results, one can also find the proof of this
result in the book by Johnson and Wichern (2019, p. 176). From the central limit, the following
according to the chi-square distribution with p degrees of freedom, denoted as 𝜒 , for 𝑛 𝑝 large.
with 𝑖 1, ⋯ , 𝑝 represents an independent 𝒩 0,1 random variable (Johnson and Wichern 2019,
zero mean vector (𝝁 𝟎) is referred to as the “null case.” The set of sample covariance matrices
constructed from sample data matrices 𝑿 drawn from the “null case” population is known as a
white Wishart ensemble and denoted by 𝑊 𝑛, 𝜮 𝑿 𝑿 where 𝑿~𝒩 𝟎, 𝜮 . The term “null
case” is, as Johnstone (2001) noted, “in analogy with time-series settings where a white spectrum
is one with the same variance at all frequencies.” In nuclear physics, the spectral properties of the
white Wishart matrices, mainly the white Wishart ensemble of complex matrices, are of a long-
Generally, it is tricky to evaluate the integral in Equation 2.12. However, in the null case, the
1 1
𝑒𝑡𝑟 𝚺 𝟏
𝐻𝐿𝐻 𝑑𝐻 exp 𝑙 [Eq. 2-26]
2 2𝜆
𝒪
⁄
𝜋 2 ⁄ 𝑑𝑒𝑡𝚺 ⁄
1 ⁄
exp 𝑙 𝑙 𝑙 𝑙 [Eq. 2-27]
𝛤 𝑛⁄2 𝛤 𝑝⁄2 2𝜎
For more on the density of the eigenvalues of the Wishart real matrices, including the non-null
case (𝜮 𝜎 𝑰) with 𝝁 𝟎 and other properties of 𝑊 𝑛, 𝛴 , one may refer to the works of
Muirhead (2009) and Anderson (2003), Bejan (2005), Dieng and Tracy (2011), and Paul and Aue
(2014).
From all the elements introduced in the previous section on Wishart real matrix, one may create a
Wishart model by defining the triplet 𝛺, ℙ, ℱ in terms of random samples 𝓧 as follows in Table
Suppose Y is an 𝑝 𝑝 random complex Hermitian (resp. real symmetric) matrix. As defined, the
eigenvalues of Y exist and are all always real numbers and can be sorted and ordered as 𝜆 𝜆
⋯ 𝜆 . Thus, the ESD, also known as the spectral statistics of the eigenvalues of Y, is defined as
1 [Eq. 2-28]
𝐺𝒀 𝑡 𝟏 ; ∀𝑡 ∈ℝ
𝑝
As Frahm (2004, p. 100) pointed out, “an eigenvalue of a random matrix is random but per se not
Nevertheless, the asymptotic behavior of the ESD, reasonably well understood now, plays a crucial
role in studying the behavior of the random matrices in RMT. For example, as the dimension p of
the matrix increases, one of the key questions that usually emerges is: does the ESD of an
In the context of the two fundamental classes of random matrices introduced in the previous
sections, the celebrated semicircle for the Wigner matrices and its counterpart known as the
Marchenko-Pastur law for the Wishart matrices help answer this question in part as they are
concerned with the bulk spectrum—properties of the whole set of eigenvalues. Hence, answering
the same question from the edge spectrum or extreme eigenvalues’ perspective will be crucial in
incorporating the entire range of eigenvalues. Accordingly, the following sections state both laws
Considered one of the greatest discoveries in physics, it paved the way for developing a new field
of mathematics and physics devoted to studying quantum chaos, disordered systems, and
significant law discovered by Wigner (1952, 1958) while studying the energy levels of complex
centered distribution that is independent of n and has a finite variance 𝜎 (moment of order 2).
Then, as n tends to infinity, it is well known that the ESD 𝐺 𝑯 of H approaches an n independent
limiting law 𝑔 , which takes a semi-circular shape on the compact subset of the real line
2𝜎, 2𝜎 . Equation 2.29: provides the often-represented expression of the Wigner semicircle
𝑑𝐺 𝑯 𝑡 1
𝑔𝑯 𝑡 ⎯⎯ 𝑔 𝑡 4𝜎 𝑡 𝟏 , 𝑡 [Eq. 2-29]
𝑑𝑡 → 2𝜋𝜎
When 𝜇 𝜇 0 and 𝜎 1, then Equation 2.29: becomes Equation 2.30 below provided.
1
𝑔 𝑡 4 𝑡 𝟏 , 𝑡 [Eq. 2-30]
2𝜋
Equation 2.29: represents the limit law for the empirical distribution function of the eigenvalues
of the random matrix H considered as not being normalized (Frahm 2004). Accordingly,
depending on the normalization of the random matrix or scaling its eigenvalues, so they lie in a
compact, this law takes various expressions. Among other expressions of the Wigner semicircle
law found in the literature is the one provided in Equation 2.31 below (see Tracy and Widom 1992
𝑔 𝑡 √1 𝑡 𝟏 , 𝑡
[Eq. 2-31]
For instance, to illustrate the Wigner semicircle law in MATrix LABoratory (MATLAB), one can
easily use the function “randn” to generate 25 of 100x100 symmetric matrices whose entries are
i.i.d. Gaussian random variables of mean 0 and variance one and plot the density of their
normalized eigenvalues in the form of the histogram depicted in Figure 2.2(a). In addition, to verify
the universality of this law, one can replace the Gaussian distribution with the uniform distribution
on [-1,1] (“rand”) and plot the density of the normalized eigenvalues to obtain the histogram in
(a) (b)
Figure 2.2: Illustrations of the Wigner Semicircle Law: Random Matrices with Normally
and Uniformly Distributed Entries
Adapted from Tracy and Widom (1992)
Deemed as an analog of the Wigner semicircle law for sample covariance matrices, the following
is a significant result proved by Marchenko and Pastur (1967). To state this result to the white case
or general case of sample covariance matrices, let 𝑿 be an 𝑛 𝑝 random matrix whose entries
𝑋 ,⋯, ; ,⋯,
and associated sample covariance matrix 𝑺 are defined in Table 2-2 below. In
addition, one may consider the two sequences of integers n, for the sample size, and 𝑝 𝑛 , for
matrix 𝑺 almost surely converges in distribution to the Marchenko-Pastur law denoted by 𝐺 (see
Johnstone 2001, Frahm 2004, Péché 2008, Paul and Aue 2014). If 𝐺 has a p.d.f 𝑔 , then one may
𝑑𝐺 𝑡 𝑏 𝑡 𝑡 𝑎
𝑔 𝑡 𝟏 , 𝑡 [Eq. 2-32]
𝑑𝑡 2𝜋𝛾𝑡𝜎
where 𝑎 𝜎 1 √𝛾 and 𝑏 𝜎 1 √𝛾 .
As Paul (2014) noted, and as shown in Figure 2.3 below, the implication of this result is the
𝑡 1 when 𝜎 1), and the increase in the spread (𝑏 𝑎) as the ratio 𝛾 𝑝/𝑛 increases from 0
to 1 (see the small figure in Figure 2.3). In addition, the larger 𝛾, the more spread the eigenvalues.
when 𝛾 → 0, both the largest (b) and the smallest (a) eigenvalues of S converge toward 1 and the
1.5
Density
b-a
1
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
t
Over the years, considerable literature has rediscovered and extended this limiting law (Johnstone
2001). For instance, the law has been extended to a much broader class of random sample
but only square sample data matrix (Frahm 2004). In addition, the law can still be applicable even
if the sample size n is smaller than the number of variables p (Frahm 2004). That is, 𝛾 1.
The study of the asymptotic behaviors or properties of the largest eigenvalues of the Wigner
random matrices and sample covariance matrices has several other exciting applications (e.g., see
Patterson et al. 2006 for application to genetics, Plerou et al. 2002 for applications to mathematical
finance). The Wigner semicircle and Marchenko-Pastur laws can be perceived as a so-called
universality result. They both apply to the study of global statistics of the spectrum of random
matrices (e.g., rates of convergence, large deviations) under the assumption of a finite variance of
the matrix entries. One may refer to Bai and Yin (1988) and Péché (2003) for more on the topic.
Nevertheless, concerning the Central Limit Theorem (see Section 2.3.2.3), the asymptotic global
behavior of the eigenvalues of Wigner random matrices and sample covariance matrices is not
contingent on the characteristics of the sampling distribution law of the matrices’ entries —𝜮 and
𝜎 (Péché 2008). In addition, the asymptotic global behavior has already been proven by various
Accordingly, the following sections will only be concerned with local properties of the spectrum
of large random matrices. More specifically, with the asymptotic behavior of the largest
eigenvalues of random matrices but emphasizing the asymptotically universal properties of the
largest eigenvalues. Why such a focus? For many scholars, such as Péché (2008), and Paul and
Aue (2014), the first motivation originates from mathematical statistics, which is concerned with
the largest eigenvalues of sample covariance matrices (high-dimensional data). For example, in
PCA, the behavior of the principal components as the number of variables 𝑝 → ∞, and the sample
size n is kept fixed is now well known. Unlike the traditional assumptions, the current trend has
been toward studying the case where n is of the same magnitude as p. The second motivation is
that the limiting behavior of the largest eigenvalues of the non-white Wishart random matrices
(S(𝛴)) is crucial for testing hypotheses on the population covariance matrix 𝛴. For instance, when
the tests involve the null hypothesis H0 and its alternative H1, one may propose a test of H0 to
Before providing the details on the asymptotic properties of the largest eigenvalues, it’s important
to make the following observation. If b denotes the common top edge of the support of the Wigner
semicircle and Marchenko-Pastur distributions, then one can, as given by Equation 2.33, derive
lim 𝑖𝑛𝑓𝜆 𝑏
→
[Eq. 2-33]
As a result, the following significant questions can arise: Would the greatest eigenvalues nearly
indeed converge to b? What is the joint limiting distribution of the largest eigenvalues, then? Is
this limiting distribution a universal law like the Wigner semicircle or the Marchenko-Pastur law?
Finally, what is the class (ensemble) of this limiting distribution law if this is the case? Responses
This section discusses the joint limiting distribution of the largest eigenvalues of matrices
belonging to the Gaussian ensembles (GOE, GUE, and GSE) and a broader class of matrices. The
theorems chosen from various writers are noteworthy results that answer several of the problems
asked in Section 2.3.3's conclusion section. They are all based on the important work of Tracy and
Widom (1993, 1994, 1996), who identified the limit laws related to the GOE, GUE, and GSE
Let 𝑨 𝐴 be a matrix that is an element of one of the ensembles GOE, GUE, or GSE
,
specified in Section 2.3.1.2 and whose random variables 𝑨𝒋 are centered and normalized to avoid
nontrivial limits. If the following function denotes the distribution function for the largest
𝐹 , 𝑡 : ℙ , 𝜆 𝑡 , 𝛽 1, 2, 4
[Eq. 2-34]
then the limiting laws provided by Equation 2.35
𝐹 𝑥 : lim 𝐹 2𝜎√𝑛 𝜎𝑥 , 𝛽 1, 2, 4
,
→ 𝑛 [Eq. 2-35]
(a)
𝐹 𝑠 ℙ 𝑙 𝑠 𝑒𝑥𝑝 𝑥 𝑠 𝑞 𝑥 𝑑𝑥
[Eq. 2-36]
1 (b)
𝐹 𝑠 ℙ 𝑙 𝑠 𝐹 𝑠 𝑒𝑥𝑝 𝑞 𝑥 𝑑𝑥
2
1 (c)
𝐹 𝑠 ℙ 𝑙 𝑠 𝐹 𝑠 𝑐𝑜𝑠ℎ 𝑞 𝑥 𝑑𝑥
2
With reference to Section 2.3.1.2, 𝜎 is the standard deviation of the Gaussian distribution on the
off-diagonal matrix elements in the above equations, and q denotes the unique solution to the
𝑞 𝑥𝑞 2𝑞
[Eq. 2-37]
1 ⁄
2
𝜋 𝑥 exp 𝑥
2 3 [Eq. 2-38]
Meanwhile, it is worth mentioning that the six Painlevé differential equations developed a century
ago have several applications in diverse branches of contemporary physics. The general solutions
to these equations, on the other hand, are transcendental. In other words, they cannot be expressed
in terms of any previously defined function, including any commonly used special functions (Zeng
and Hou, 2012). Nevertheless, as previously stated, the Gaussian ensembles are characterized by
𝑑𝐹
invariant measures. Therefore, the joint density functions 𝑓 , 𝑓 , and 𝑓 (with 𝑓 𝑑𝑥 ) of the
largest eigenvalues associated with the TW distributions (Lebesgue measures) 𝐹 , 𝐹 , and 𝐹 exist
Figure 2.4: Joint Density Functions f1, f2, and f4 of the Largest
Eigenvalues Associated with the TW Laws F1, F2, and F4
Courtesy of Dieng and Tracy (2011)
Note that these are only asymptotic graphs obtained with the approximations of 𝐹 , , as 𝑥 →
∞. This is referred to as the tail behavior or the edge scaling limit of 𝐹 , , . This will be
Following the introduction of the Tracy-Widom distributions, the following are key theorems
referred to as universality theorems (e.g., see Tracy and Widom 2008). As university theorems,
they relax the Gaussian and invariance assumptions required for applying the limit laws 𝐹 , , ,
so extending their applicability to a variety of complex processes that are not necessarily Gaussian
in nature.
Let 𝑨 be an element of the Wishart ensemble 𝑊 𝑛, 𝐼 with its eigenvalues ordered as follows
𝑙 ⋯ 𝑙 . For more on this class of matrices, one may refer to Section 2.3.1.3. In addition, let
𝜇 √𝑛 1 𝑝 (a)
/ [Eq. 2-39]
1 1
𝜎 √𝑛 1 𝑝 (b)
√𝑛 1 𝑝
The following result establishes, under the null hypothesis 𝐻 : 𝜮 𝑰 (versus 𝐻 : 𝜮 𝑰 ) and
the requirements on n and p specified below, the largest eigenvalue 𝑙 of A converges in law to the
edge eigenvalue distribution function 𝐹 for the GOE ( see Equation 2.36 (b)).
If 𝑛, 𝑝 → ∞ such that 𝑛 𝑝 → 𝛾, with 0 𝛾 ∞, then Equation 2.40 defines the limit law of 𝑙 .
𝑙 𝜇
⎯ 𝐹 𝑠, 1 [Eq. 2-40]
𝜎
Karoui (2003) extended Johnstone (2001)’s result to γ ∈ R+ by demonstrating that the result is true
behind his 2001 breakthrough has significant statistical significance since, in many situations 𝑝 ≫
𝑛, Johnstone (2006) stated. Later, Johnstone (2006) published an ad hoc modification to make a
second-order correction to his 2001 result, which surprisingly resulted in an improvement in the
accuracy of the Tracy-Widom approximation. The ad hoc merely altered the formulae for the
scaling function constants 𝜇 and 𝜎 from Equation 2.39(a) and Equation 2.39(b) to Equation
1 1
𝜇 𝑛 𝑝 (a)
2 2
/
[Eq. 2-41]
1 1 1 1
𝜎 𝑛 𝑝 ⎛ ⎞ (b)
2 2 1 1
⎝ 𝑛 2 𝑝
2⎠
Meanwhile, the following result by Soshnikov (2002) generalizes Johnstone (2001)’s theorem to
the mth, also referred to as the next-largest eigenvalues of A. For the mth largest eigenvalue
distribution, one may refer to Section 2.3.4.4. It should be noted that Johnstone's work followed
that of Johansson (2000), who proved a limit theorem for the largest eigenvalue of a complex
Wishart matrix. However, due to the fundamental difference in constructing real and complex
models, these two distinct models necessitated independent investigations. Soshnikov (2002)
extended Johnstone and Johansson's findings by demonstrating that the same limiting laws apply
to the covariance matrices of subgaussian real (complex) populations in the following mode of
convergence: 𝑛 𝑝 𝒪 𝑝 .
as the limiting distribution law of the mth largest eigenvalue, 𝑙 , of the sample covariance matrix
𝑨 .
𝑙 𝜇
⎯ 𝐹 𝑠, 𝑚 , 𝑚 1, 2, ⋯ [Eq. 2-42]
𝜎
Thanks to the latest results on the distribution of the mth largest eigenvalues for the GOE and GSE
by Dieng (2005), Dieng and Tracy (2011) remarked that the additional assumption 𝑛 𝑝
𝒪 𝑝 under which Soshnikov proved his (2002)’s result, denoted by theorem 2.2 at this moment,
could be removed. Note that the distribution of the mth largest eigenvalues for the GUE was
already examined by Tracy and Widom (1994). Consequently, Karoui (2003) extended Theorem
2.2 to 0 𝛾 ∞. As Dieng and Tracy (2011) pointed out, the extension is critical for modern
Furthermore, Soshnikov (2002) and Péché (2008) lifted the Gaussian assumption, re-establishing,
therefore a 𝐹 universality theorem. In other words, as Tracy and Widom (2009) specified, they
assumed that the data matrix X's matrix elements 𝑥 are independent random variables with a
common symmetric distribution and moments that grow no faster than Gaussian ones. For a
description of the amended centering and norming constants similar to 𝜇 and 𝜎 necessary to
generalize Soshnikov (2002)’s theorem, one may refer to Péché (2008). However, it is crucial to
redefine matrix A and new the assumptions in the case of real sample covariance matrices.
(i) 𝔼 𝑥 0, 𝔼 𝑥 1,
(iii) All even moments of 𝑥 are finite, and their decay rate at infinity is at least as fast as
(iv) 𝑛 𝑝 𝒪 𝑝 .
Then, as stated in Section 2.3.4.2, the preceding theorem 2.2 is restated using the same limit law
as Equation 2.42 but with the previously stated four conditions in (i) through (iv).
Moreover, Péché (2009)’s remarkable contribution to Soshnikov’s work extended this result to
the scenario when the ratio 𝛾 approaches an arbitrary finite number and another strategy when 𝛾
In the meantime, Deift and Gioev (2007), who already had expanded on the early work of Tracy
and Widom (1998, 1999) by establishing a 𝐹 universality in the bulk for GOE, also proved the
universality at the edge of GOE. They obtained their result by replacing the Gaussian weight
function exp 𝑥 by exp 𝑉 𝑥 in the expression of the joint density function ℙ of the largest
eigenvalues of randomly selected matrices from the GOE, where V stands for an even degree
polynomial with a positive leading coefficient (Deift and Gioev 2007, Dieng and Tracy 2011).
Notably, the authors also established comparable results for the GSE and GUE.
2.3.4.3 Limiting Law in Terms of Baik et al. (1999) and Tracy and Widom (2000)
Let A be a randomly chosen matrix from one of the (finite n) GOE, GUE, or GSE. If the
eigenvalues are ordered as follows 𝑙 𝑙 using Equation 2.43 below, one can compute the
𝑙 √2𝑛
𝑙 ⁄
,𝑚 1, 2, ⋯ [Eq. 2-43]
2 𝑛 ⁄
It is worth noting where the scaling formula for 𝑙 came from: Baik et al. (1999) and Johansson
(1998). Baik et al. (1999) derived the distribution of the length of the longest increasing sequence
combinatorial probability and random matrix theory distribution functions. Later, Tracy and
Widom (2000), Dieng and Tracy (2011), and other scholars extended their findings to derive the
For the largest eigenvalue in the β-ensembles (𝛽 1, 2, 4), it was proven that 𝑙 in Equation 2.43
is governed by Tracy-Widom distributions as stated in Equation 2.44 below (Tracy and Widom
𝑙 →𝑙 , [Eq. 2-44]
Tracy and Widom (1994 and 1996) showed that the theoretical distribution law of 𝑙 in Equation
in Equation 2.36(b), or 𝐹 in Equation 2.36(c), respectively. The distribution Law of the mth largest
eigenvalue in GOE, GUE, and GSE (Dieng and Tracy 2011). For the study of largest eigenvalues,
not only the first largest ones but also the next-next largest ones are important. For instance, the
second largest eigenvalue of a Ramanujan graph, which has critical applications in communication
network theory, is well modeled by the 𝛽 1 Tracy-Widom distribution (Miller et al. 2008).
These graphs are crucial as they enable the construction of super concentrator and nonblocking
networks in coding theory and cryptography. The following are the expressions of their
distribution laws in terms of the Tracy-Widom distributions in the case of the GOE, GUE, and
GSE.
Case 𝛽 2, GOE
Let 𝐹 𝑠, 𝑚 1 ≡ 0, then Tracy and Widom (1994) derived the expression of the series in
1 𝑑
𝐹 𝑠, 𝑚 1 𝐹 𝑠, 𝑚 𝐷 𝑠, 𝜆 ⃒ ,𝑚 0, [Eq. 2-45]
𝑚! 𝑑𝜆
where 𝐷 𝑠, 𝜆 , given in Equation 2.46, has the following Painlevé representation in which 𝑞 𝑥, 𝜆
𝐷 𝑠, 𝜆 exp 𝑥 𝑠 𝑞 𝑥, 𝜆 𝑑𝑥
[Eq. 2-46]
Let 𝐹 𝑠, 0 ≡ 0, then an analog combinatorial argument that led to the recurrence relation in
1 𝑑
𝐹 𝑠, 𝑚 1 𝐹 𝑠, 𝑚 𝐷 𝑠, 𝜆 ⃒ ,𝑚 0, 𝛽 1, 4 [Eq. 2-47]
𝑚! 𝑑𝜆
where one can obtain the expression of 𝐷 𝑠, 𝜆 , similar to the one of 𝐷 𝑠, 𝜆 in Equation 2.46,
Alternatively, thanks to the interlacing property between GOE and GSE—“In the appropriate
scaling limit, … More generally, the joint distribution of every second eigenvalue in the GOE
coincides with the joint distribution of all the eigenvalues in the GSE, with an appropriate number
of eigenvalues.” (Dieng and Tracy 2011, p. 13)—Dieng (2005) derived in the edge scaling limit
the limiting distributions for the mth largest eigenvalues in the for GOE and GSE, followed by its
𝐹 𝑠, 𝑚 𝐹 𝑠, 2𝑚 , 𝑚 1
[Eq. 2-48]
For the implementation of this remarkable result to compute 𝐹 𝑠, 𝑚 and its related density
Section 2.3.6.2). In Figure 2.5 are plots created for validation purposes. As depicted, solid curves
are, from right to left, are theoretical limiting densities for the first through the fourth largest
Figure 2.5: Illustrations of Theoretical Limiting Densities for the 1st (Curve in Right)
through 4th (Curve in Left) Largest Eigenvalue of 104 Realizations of 103 x 103 GOE
Matrices
Courtesy of Dieng and Tracy (2011, p.15)
The stunning graph above, borrowed from Dieng and Tracy (2011), ends this section on the
limiting distribution of the greatest eigenvalue of a particular class of matrices under the discussed
assumptions. From this section, numerous academics have established that the Tracy-Widom laws
are the joint limiting distribution of the largest mth scaled and centered eigenvalues at the
spectrum's edge for a broader class of matrices. As it contributed to the relaxation of numerous
constraints, particularly Gaussian assumptions, imposed on earlier versions of the universal type
of theorems, the following section elaborates on the concept of universality discussed in this
section.
This section aims to expand more on the concept of the universality of the Tracy-Widom laws
outlined previously. In relation to the previous section, significant work has been devoted in the
recent decade to attaining the universality of results on the behavior of random matrices'
eigenvalues. The term "universality" refers to the fact that the limiting behavior of the eigenvalue
statistics is independent of the distribution of the matrix’s entries (Paul and Aue 2014). While the
statement may not hold in all cases, in many cases, the behavior of both bulk and edge eigenvalues
is primarily determined by the first four moments of the distribution of the entries (e.g., see
Soshnikov 2002). The investigation into limiting ESDs conducted by various researchers,
including contemporary ones, demonstrated that the behavior is universal at the level of first-order
convergence. Their finding depended on the assumptions that the entries of the sample data matrix
With the work of Soshnikov, who proved the Tracy-Widom limit of the normalized largest
eigenvalues on Wigner (Soshnikov, 1999) and Wishart (Soshnikov, 2002) matrices, the more
refined characteristics, such as the limiting distribution of normalized extreme (or edge)
eigenvalues, began to receive increased attention. However, these results still required the
existence of all moments (particularly sub-Gaussian tails), symmetry of the entry distribution, and,
in the Wishart case, an assumption that the dimension to sample size ratio approaches one. As
noted in the previous section, Péché (2009) extended Soshnikov's (2002) results by allowing the
dimension-to-sample-size ratio to approach any nonnegative value. Significant progress has been
made in Péché and Soshnikov's relaxation of the symmetry requirement and Gaussian
assumptions. As expressed in terms of the limiting behavior of the correlation functions of the
eigenvalues, bulk universality has been achieved using various methods (e.g., Johansson 2001,
Ben Arous and Péché 2005). In addition, multiple authors, such as Erdős and Yau (2012) and Tao
and Vu (2010, 2012), have made significant new developments in the universality phenomenon.
They have been using analytical techniques to study bulk and edge universality questions. Through
their remarkable work, they have managed to remove many restrictions on the distribution of the
matrices’ entries.
For instance, Tao and Vu (2010), through a “four moments theorem,” effectively demonstrated the
universality of local eigenvalue statistics at the spectrum's edge for the Wigner and Wishart cases.
In another example, Erdős and Yau (2012) extended previous universality results based on a local
semicircle law for Wigner matrices to so-called generalized Wigner matrices. Feldheim and Sodin
(2010) and Bao et al. (2012) all investigated universality at the extremes of the spectrum of sample
covariance and correlation matrices. Benaych-Georges et al. (2012) investigated large deviations
of the extreme eigenvalues. Bao et al. (2015) established Tracy-Widom universality for suitable
case). Bao et al. (2015) examined their findings regarding their applications to statistical signal
detection and structure recognition of separable covariance matrices, as have most other
researchers in various fields. For more on this topic, one may refer to the exhaustive review by
Paul and Aue (2014). This section has expanded on the concept of the Tracy-Widom limit laws'
universality, which has dramatically aided researchers in various fields in explaining the behavior
of the eigenvalues of random matrices whose entries are not necessarily Gaussian thus; the
It is well known that for large n, the “PDF of 𝜆 consists of a central part characterized by
Tracy-Widom distributions edged on both sides by a couple of large deviations tails (Majumdar
and Schehr 2014). As depicted in Figure 2.6 below, these left and right tails correspond to very
different physics when interpreted in terms of the underlying Coulomb gas. The left corresponds
to a pushed Coulomb gas, whereas the right to a pulled Coulomb gas. The third chapter will cover
both topics well as they explain phase transitions in various physical problems, as shown below in
Figure 2.6: Illustration of the Left and Right Tail Behavior of the TW Fβ
Courtesy of Majumdar and Schehr (2014)
Below in Equation 2.49 is the right tail of the 𝛽-TW distribution which Dumaz and Virág (2013)
𝒪 2
1 𝐹 𝑥 𝑥 exp 𝛽𝑥 [Eq. 2-49]
3
Dieng (2005) provided in his Ph.D. thesis MATLAB codes that are greatly valuable for evaluating
and plotting the TW density and distribution functions. Three arguments are required for each
function: “the first is “beta,” which is the beta of RMT so it can be 1, 2, or 4; then “n” which is the
eigenvalue need; finally, “s” which is the value where you want to evaluate the function.” Equation
2.50 and Equation 2.51 are the asymptotic expansions of function “q” needed to compute the
Tracy-Widom distribution F1, F2, and F4, respectively, provided in Figure 2.4. These codes can be
obtained by contacting the author. As a result, requested codes were received for this research
It is worth noting that besides Dieng (2005)’s invaluable contribution to the work of Tracy and
Widom (1992, 1994), other scholars such as Bejan (2005), Bornemann (2009), and Chiani (2014)
have also proposed numerical evaluations of the Tracy-Widom distribution function Fβ using
different approaches.
While studying the distribution of the largest eigenvalues for real Wishart and Gaussian random
matrices, Chiani (2014) discovered a simple approximation of the Tracy-Widom limit law. Thanks
to her contributions, the TW can be approximated accurately by an adequately scaled and shifted
gamma distribution. The exact CDF (𝐹 ) of the largest eigenvalue of quite large matrices for
finite-dimensional Wishart and Gaussian matrices (GOE, GUE) can easily be computed without
the need for asymptotic approximations. For the CDF implementation, an algorithm in
Courtesy of Bornemann (2009), Table 2-3 and Table 2-4 provide statistical properties of Fβ for the
first six edges scaled eigenvalues ─ the first, the two, and the six largest eigenvalues ─ in all the
three β-ensembles namely GOE (β=1), GUE (β=2), and GSE (β=4). Those statistical properties
are the first four moments—mean (µ), variance (σ2), skewness (S), and kurtosis (K)—
characterizing the CDF Fβ. Note that because F4(k; s) = F1(2k, s), values for F4 can be derived from
Table 2-3, providing values for F1, whereas those for F2 (k; s) are provided in Table 2-4. In addition,
each of the values is provided with four (4) digits or decimal points greater than the ones provided
in Dieng (2005). For the numerical calculations of these high precision values, the author
developed and used a MATLAB Toolbox, which he kindly and gratefully made available for this
study.
F1 (1; s) −1.20653 35745 1.60778 10345 0.29346 45240 0.16524 29384 4.59
F1 (2; s) −3.26242 79028 1.03544 74415 0.16550 94943 0.04919 51565 12.45
F1 (3; s) −4.82163 02757 0.82239 01151 0.11762 14761 0.01977 46604 30.04
F1 (4; s) −6.16203 99636 0.70315 81054 0.09232 83954 0.00816 06305 51.24
F1 (5; s) −7.37011 47042 0.62425 23679 0.07653 98210 0.00245 40580 77.49
F1 (6; s) −8.48621 83723 0.56700 71487 0.06567 07705 −0.00073 42515 112
F2 (1; s) −1.77108 68074 0.81319 47928 0.22408 42036 0.09344 80876 1.84
F2 (2; s) −3.67543 72971 0.54054 50473 0.12502 70941 0.02173 96385 5.44
F2 (3; s) −5.17132 31745 0.43348 13326 0.08880 80227 0.00509 66000 10.41
F2 (4; s) −6.47453 77733 0.37213 08147 0.06970 92726 −0.00114 15160 17.89
F2 (5; s) −7.65724 22912 0.33101 06544 0.05777 55438 −0.00405 83706 25.56
F2 (6; s) −8.75452 24419 0.30094 94654 0.04955 14791 −0.00559 98554 34.72
application of the Tracy-Widom law to Construction Network Schedules. While the simulation
approach is used to build an artificial environment in which data can be generated, the inferential
approach helps for either characteristics or relationships of the population (Kothari 2004). Both
approaches are worth adopting to achieve the primary goal of this chapter. Paterson et al. (2006)
applied the Tracy-Widom law, well known for its applications in various fields, to infer population
structure from a genetic structure. The Tracy-Widom law, which first arises “as the limiting law
for normalized largest eigenvalue in the GUE of Hermitian matrices” (Tracy and Widom 2001), is
can be used as an input to generate many probabilistic durations. For example, Fente et al. (2000)
used the beta distribution in defining a probability distribution function for construction
simulation. From this perspective, a methodology centered on the following points is adopted: (1)
conceptual analogical discovery of similarities between knowledge areas applying the Tracy-
Widom distributions and construction project network scheduling; (2) identification of appropriate
TW methods for the study of project network schedules underlying behaviors; (3) data collection
of benchmark schedules; (4) analysis of data for numerical applications (5) simulation runs for the
emergence of the TW behaviors in construction schedules and correlation analysis. Setting forth
the objectives of this chapter, the 5-phase methodology is illustrated in Figure 2.7.
This section provides the chapter's objectives, which are primarily based on adapting and
Research Objective 1: Parse scientific literature in other knowledge areas for Tracy-Widom law
applications to map and match their conceptual analogies with construction scheduling elements.
Research Objective 2: Classify methods to identify those that enable the study of network
Research Objective 3: Acquire several benchmark schedules (e.g., project networks from the
Project Scheduling Problem Library (PSPLIB) database) of varying sizes from smaller to larger to
matrices to allow the generation of random matrices through a repetitive sampling of activity
Research Objective 5: Establish the existence or absence of correlations between the size,
complexity, and number of simulations required to validate the Tracy-Widom limit law s'
universality.
Figure 2.8 outlines the sequence of procedures crucial to achieving the objectives of this chapter.
Goal: verify if the universality of the TW limiting law is applicable to benchmark project network schedules.
Assumption: the triangle distribution governs project activity durations, with known parameters a, b, and c
preselected as 90 %, 100 %, and 150 % of their deterministic durations.
2.4.3 Map and Match Conceptual Analogies Between Study Fields of Interests
A methodology of conceptual analogy discovery is frequently used to map and match the elements
and restrictions among the various knowledge fields covered by a research investigation. This
methodology necessitates a thorough literature search and meticulous analysis to find previous
scientific applications in each area of interest and commonalities between them. For this research
study, information gathered from numerous sources, such as academic publications, conference
papers, and books, was synthesized and presented in tabular formats to highlight parallels between
the Tracy-Widom distribution laws and project network schedules. Although the literature on both
themes suggests numerous applications, only seven and eight applications for each topic have been
included in this study. While Table 2-5 provides a synopsis of a few Tracy-Widom distribution
applications in various fields and their sources, Table 2-6A and Table 2-6B provide further
specifics on these applications. The same applications are expressed in Table 2-7 in terms of the
distribution laws, the following tables include the eight identified applications and concepts of
Tracy-Widom Distributions
Model/Parameter/Process Probability Distribution
Areas Probability Space / Constraints Model Approximation
(Algorithm)
ISπ: set of k (≤N) ascending Random permutations π are
uniformly Lengths lN statistically behave like
numbers having their
the largest
Groups SN of all permutations π permutations π in the same distributed. lN are distributed
of order. according to an inverse function in eigenvalues of a random Hermitian
matrix with an eigenvalue density
N positive integers lN: Length of the longest N and n, with n being the greatest
1- having the form of a discrete
increasing subsequence of π length
Permutation/ Coulomb gas on Z
ISπ of a group SN
Combinatory
Models Groups of random words (w) of - Length of the longest weakly Each w has a probability of 𝑘 .
increasing subsequence in w 𝑙 𝑤 or is 𝑙 𝑤 distributed Increasing or decreasing lengths
length N from an alphabet of k
or 𝑙 𝑤 according to a function in which statistically behave like the largest
letters (Groups of permutations
- Length of the longest variables are N, k, and n. Where or smallest eigenvalues of a
σ
decreasing subsequence in w 𝑛 𝑙 𝜎 or n 𝑙 𝜎 resp. random matrix that has a trace zero
on N words)
or 𝑙 𝑤
Fluctuations of the height Defined as a function of four
2 - Oriented Space of occupied sets function (h) characteristic of variables: The ODB problem is approached
Digital changing the system state are defined as one-dimensional space-time to studying an increasing path
Boiling with time in a growing interface a function of space-time variables, problem by studying eigenvalues
(OBD) in variables and independent height function and probability of random matrices with IBRV as
Models the two-dimensional lattice Z2. Bernoulli related entries.
random variables (IBRV). to the IBRV.
Parameters xij are normally Studying the finite-size
3 - Ising Systems of Sherrington- The total energy of a
distributed fluctuations
Spin Kirkpatrick configuration as a function of
and their eigenvalues density are of the Tc of the SK model is
Glass (SK) Ising spin vectors defined spin components, coupling
defined by the susceptibility matrix reduced to studying the
Models by parameters xij for external
inverse characteristic of the phase distribution of the
Disordered their energy configuration along quenched disorder, parameter
transition at critical temperature Tc. largest eigenvalue of a random
Magnetic the axis of rotations (Si=±1; J for the strength of the energy
matrix, which depends on the
Systems i=1…N) between spins.
sample xij.
Tracy-Widom Distributions
Model/Parameter/Process
Areas Probability Space / Constraints Probability Distribution Model Approximation
(Algorithm)
• With some scaling, Dk is
Series of n single-server queues The quantity of interest for this model
Service times are i.i.d. and follow equal in distribution to the
4 - Queueing each with unlimited waiting is D(k,n)~Dk, the departure time of
the largest eigenvalue of a k xk
theory space with a first-in and first-out customer k (the last customer to be
Poisson distribution V. Gaussian random matrix
service. served) from the previous queue n
• Dk is independent of V.
• Sets of Aztec diamonds An.
• A weight of “ω” or “1” for each
• Each An has 2n rows of squares • Domino height function per tiling A Random tiling of the Aztec
vertical or horizontal domino
whose centers lie on a vertical • Dominoes corners are connected to diamond of size n problem can
• A tile τ has a weight of ωm where
5 - Aztec line. obtain paths be analyzed using zig-zag paths
m is the number of vertical tiles in
Diamonds of • Number of squares per row is • Each An has four regular brick wall in the tiling, which solves the
τ
Order/Size n 2k with k=1-n from top to base pattern regions and a central region of longest increasing subsequence
• τ selected with a probability P as
and inversely irregular tiling patterns or temperate problem for random
a function of ω given a tile and
• Squares are tiled with dominoes zone. permutations.
paths
or 1x2 rectangles
• Reveal 1st card i1 then 2nd i2. If i2>i1 • A shuffled deck can be
(rank), start a new pile to the right of thought as
• equals the number of piles at
i1. Otherwise, place i2 on i1. a random permutation.
6- the end of the game started with a
• Reveal i3. If i3>i1 & i2, start a new • Patience sorting is closely
Solitaire/Patience Groups of shuffled deck of cards deck σ
pile to the right of i2. Otherwise, place related
Card Games. σ = {i1, i2,. .,iN} • PDF defines
i3 on higher rank i1 or i2. to the problem of the longest
(Floyd’s Game) as a function of N and a certain
• If i1>i3 & i2>i3, place i3 on the increasing subsequences for
variable t
smaller ranked card permutations
• Play until all cards are revealed π ∈ SN = {1, 2,. .. , N}
7 - Growth Set of plateaus of crystal growing Height (h) function on the time-space • Nucleation
Solving a PNG model problem
Models- layer by layer through random defined based on sets of nucleation events occur independently and
is equivalent to solving the
PNG Droplet deposition of particles to form events inside of a rectangle. uniformly in space-time.
longest increasing subsequence
Model islands that spread laterally with Nucleation events resulting from the • Events are Poisson distributed
of permutation p problem with
(Continuous time constant speed on a one- reunion of adjacent islands of the with
h as length.
process) dimensional time-space. same level. density one.
As mentioned earlier, as presented in Table 2-6A and Table 2-6B, the same applications are
expressed in the following Table 2-7 in terms of the Tracy-Widom laws' universality.
Like the ones on the use of Tracy-Widom distribution laws in various fields, the following tables
contain several project scheduling applications and theories used in construction and
engineering. While Table 2-8 summarizes eight applications from diverse sources, Table 2-9A
Depending on the nature, breadth, or other variables of the research, researchers use a variety of
strategies to collect the data needed to answer their research questions. For example, while one
study may collect quantitative data through a series of experiments, another study may collect
qualitative data by polling a small group of individuals. In addition, researchers frequently use
qualitative data to test hypotheses to explain observations or facts in statistics domains (e.g.,
interpreting, and presenting data. Furthermore, researchers may use qualitative data to grasp better
Because of the nature of the current investigation, quantifiable data on construction project
networks of various sizes and complexities is required. As a result, the systematically generated
PSPLIB developed by Kolisch and Sprecher (1997), which consisted of 2040 project networks of
various sizes and structures, represents an adequate sizeable collection of networks required to
investigate the underlying behaviors of project network schedules. Therefore, the investigation
will be conducted using the applied multivariate statistical techniques developed and introduced
in the subsequent sections. Furthermore, the PSPLIB networks are freely available electronically
in '.sm' format. Appendix A.1 contains a list of all gathered filenames. In contrast, Table 2-10
includes the total number of files collected for each set of project networks consisting of J30, J60,
J90, or J120 activities referred to as jobs or tasks. For instance, a project network J60 consists of
PSPLIB Networks
After obtaining a benchmark set of project networks from the PSPLIB, it is critical to treat them
with care because they include critical project network data for subsequent computations. Each
collection filename begins with the suffix '.sm', which stands for Single-Mode Resource-
Constrained Project Scheduling Problems (Kolisch and Sprecher 1997). Although this file
each file from its original extension to a text file with the extension '.txt' was necessary. In other
words, the conversion is required before using any of the files in MATLAB. For, given the volume
of files, a procedure created in Visual Basic for Applications (VBA) and included in Appendix
B.1, on page 348, automates and manages the conversion process by preserving the integrity of
the original file contents. Appendix A.2 (p. 346) contains the contents of the filename "j301.1.sm."
Due to the technological requirements for establishing precedence links between activities, the so-
called AON, as seen in Figure 2.9 or Figure 2.10, shows the structure of a project using nodes and
arcs to represent the project's activities and their precedence relationships. Additionally, a couple
of dummy activities with zero durations are introduced to the project network, namely "1" and J.
They represent the unique initiating (source) and concluding (sink) activities. As described and
illustrated in Figure 2.10, a network is acyclic and numerically labeled (Kolisch and Sprecher
1997).
For this study, the triangle distribution is employed to simulate the probability durations of project
schedule activities. The distribution's parameters or boundaries: minimum (a), mode(b), and
maximum(c) are defined {90%, 100%, 150%} of the deterministic duration of any activity on a
project network, respectively. Due to the large number of network schedules obtained for this
study, it is critical to automate the operations required to determine the triangular distribution
parameters for each activity. Automation will help minimize errors and save time. Thus, the
flowchart in Appendix C.4 details the approach for computing triangle distribution parameters for
each activity on a project schedule network, either manually or automatically. This methodology
demands the storage of project activity information in a text format file tabulated and structured
For any other project networks, such as the exemplar network, a table in either a text file or an
Excel Spreadsheet can be created in the same manner as in Appendix D.2 to tabulate the project
activity information. Additionally, for the PSPLIB networks, an Excel Spreadsheet table with three
columns named 'ID,' 'Name ID,' and 'pd,' for identifying the probability distribution required for
probabilistic activity durations, must be created. While several probability distributions can be
used to sample activity durations within a schedule, this study employs a homogenous probability
distribution. Due to the large number of PSPLIB network files acquired for this study, a computer
program created in MATLAB and included in Appendix B.2 assisted in tabulating project network
data and exporting the resulting table to a text file format. More precisely, the MATLAB code
calculates the probability distribution parameters for the activities, adds logic constraints and time
lags between activities, and sends the information about the activity network to a new text file for
later use. To demonstrate the validity of this methodology, the following Table 2-11 contains the
output data for the sample network. Refer to Appendix D.2 to illustrate the network j3038-7 output
information for PSPLIB networks. Because there are four distinct CPM logic constraints, as seen
in Figure 1.6, conditions between activities are finish-to-start (FTS or FF) with zero-time lags
After calculating the triangular distribution parameters of each activity, the flowchart in Appendix
subroutine for implementing flowchart operations allows for the computation and graphical
depiction of probability durations. This implementation calculates and displays any given project
activity based on its stochastic duration (b), minimum(a), and maximum (c). The MATLAB
function "makedist" creates the probability distribution object for the distribution name (triangular)
given the parameters a, b, and c. Because each activity's probabilistic durations are sampled from
a triangular distribution, their graphical representation should resemble a triangle with its peak at
b. plotting activities' probabilistic durations is not necessary. The graphs in Figure 2.11A and
Figure 2.11B depict probabilistic durations of the exemplar network activities whose data set
information is provided in Table 2-11. Each chart has a triangular shape peaking at "b" suggesting
that the triangular distribution governs the sampled durations. Knowing the distribution from
which they are drawn, it is not surprising that the triangular distribution governs the activities'
probabilistic durations. As a result, there is no need to plot activities' probability durations if their
Activity constraints are defined during the initial stages of designing a project schedule to establish
precedence relationships between consecutive activities. As seen in Figure 1.6, any of the four
distinct types of logic constraints can be represented in an n-by-n square and binary matrix with
entries p of type "0" or "1." Where "n" is the network's activity count. A "0" or "1" indicates that
there is no priority link between the pair of nodes I j) representing the project activities,
representation of the information flows between activities, allowing for a methodical mapping of
network parts (Uma Maheswari et al. 2006). Table 2-12 illustrates this matrix, representing the
exemplar network's dependence matrix. Indeed, the matrix is square and upper triangular. The
lower half of this matrix has been purposely omitted because it has the same information as the
upper half. Take note that the three ones '1' in the row corresponding to activity F indicate its
immediate successors H, I, and K in this table. The two "1"s beneath its column mean its immediate
predecessors, C and E.
Source 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mob 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
A 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
B 0 1 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 1 0 0 0 1 0 0 0 0 0
D 0 0 0 0 0 0 0 0 1 0 0 0
E 0 1 1 0 0 0 0 0 0 0 0
F 0 0 1 1 0 1 0 0 0 0
G 0 1 1 0 1 0 0 0 0
H 0 0 0 0 0 1 0 0
I 0 0 0 1 0 0 0
J 0 0 1 0 0 0
K 0 0 0 1 0
L 0 0 1 0
M 0 1 0
T/O 0 1
Sink 0
The methodology for identifying network paths is based on the CPM introduced in Section 1.4.2
and the procedures proposed in the previous sections. This methodology allows the determination
and identification of critical and non-critical paths of any given number of project networks.
However, it requires multiple simulations of a project schedule to find all project network paths
that are likely to become critical during the execution of the project. Finding these paths
necessitates that a given network be simulated several times to allow random sampling of activity
probabilistic durations to imitate possible scheduling issue occurrences during the construction
phase of the project network. Thus, the total number of runs to simulate network schedules of any
project and the total number of probabilistic durations to generate for any project activity need to
be established before applying this methodology. Although, because of the large number of
benchmark schedules and the process times (CPU), it may require to automatically execute the
operations of this methodology, simulating (100) schedules per project network and generating
(1000) probabilistic durations per network activity should be sufficient and feasible given the time
and resources allocated to this study. As proposed, this methodology consists of the following four
Follow the methodology proposed in Section 2.4.5.1 to: (1) calculate the triangular distribution
parameters associated with each activity's fixed duration; (2) organize and store into a text file the
data set on the project activities (ID, Name, Fixed Duration, etc.). If necessary, repeat this step for
Follow the methodology proposed in Section 2.4.4.3 to create the dependency matrix of the project
network “N” using the project data set contained in the network file resulting from Step 1. This
step translates precedence relationships between consecutive network activities into a binary
matrix.
Step 3: Simulation of Network Schedules Run (100) simulations to schedule the project network
Follow the methodology flowchart proposed in Appendix C.5 to generate 1000 probabilistic
durations for each activity on the network “N” using its triangular distribution parameters stored
in the network file. There is no need to plot the resulting data points except randomizing them to
obtain a random and probabilistic duration. Repeat this process for all activities to form a vector
Utilize the CPM to schedule the project network based on the random activity durations obtained
in Step 3a, assuming FTS or FF constraints with zero-time lags (L=0). The ES and EF times are
determined during the forward pass using the CPM scheduling method, which is detailed in
Appendix C.1 for the forward pass, Appendix C.2 for the forward pass, and Appendix C.3 for float
calculations. The LS and LF times, on the other hand, are calculated during the forward pass.
Additionally, the entire duration of the project and the activity TF are calculated. The TF of an
activity specifies the amount of time the activity can be postponed without influencing the start
time (s) of its successor (s). Critical activities have TFs of zero, and their sequential connection
defines the network's critical path. A network may contain several critical paths. Last, calculate
each project's frequency and average duration for each critical path detected during the total
Generate an exhaustive list of all possible paths connecting 'Source/start' to 'Sink/target) of the
Network "N." This may be accomplished by graph theory given the network's dependency matrix,
which per definition is independent of activity durations. MATLAB has all Graph theory-based
functions. The generated list contains all network paths, including critical ones. If two or more
networks are considered, save the network "N" path results, and go to Step 1 for the following
For this study, automation of these methodology operations, as illustrated in the flowchart provided
in Appendix C.6, is necessary to derive the indicators of network morphologies crucial to the
was developed for application to the networks considered for this study. To validate this
methodology, provided in Table 2-13 and Table 2-14 outputs of the exemplar Network. The tables
indicate that this network has thirteen paths, among which five have the potential to become
critical. In addition, from the hundred network schedule simulation runs outputted in Table 2-14,
it is most likely that seven out of the seventeen network activities would be critical. Moreover, the
critical path made of these activities happened to be one of the two longest paths of this network.
In addition, these results suggest a strong likelihood for the project network duration of 88.29 days
Table 2-13: All Paths from Source to Sink of the Exemplar Network
Table 2-14: All Critical Paths of the Exemplar Network for 100 Simulations
Similarly, outputs for the PSPLIB project network J301-1 are also provided in Appendix E.1 for
all possible paths connecting activity 'Source' to activity 'Sink' paths. Table 2-15 for the
probabilistic durations of the project. These results show that the network J301-1 has thirteen
possible paths, out of which only two have the potential to become critical. The most likely critical
path is made of 34.4% of the network activities, is the network's longest path, and its activities
have a 60% chance of being critical. In addition, its duration suggests that the project network
Table 2-15: All Critical Paths of Network j301 for 100 Simulations
Su et al. (2016) made this section's methodology possible by supplying some of the MATLAB
subroutines required to display project network schedule diagrams and incorporate critical project
activities based on their criticality indices determined using the CPM. In addition, global
complexity measures such as restrictiveness (RT) and density which will be covered in the
subsequent section, can be calculated and added to the network diagrams. Although criticality
indices will be extensively discussed in a later chapter, it is necessary to introduce them briefly.
Assume that a network schedule is composed of random variables (Dodin and Elmaghraby 1985).
The criticality index of a path, on the other hand, represents the probability that the path's duration
will be larger than or equal to the duration of any other network's path. A criticality index is a
number that ranges from zero for a non-critical activity to one for critical activity.
Additionally, activities having a criticality index greater than 0.5 will be denoted by a darker node
or box on the network diagram. Nevertheless, with few exceptions, the methodology for
representing any network schedule diagram is like the one proposed in Section 2.4.5.4 to find all
critical and non-critical paths of any number of networks. The reason is that they both employ
random durations of project activities during the project scheduling process. With that in mind, the
methodology adopted to represent the network diagram is performed in five steps and as follows:
Step 1 and Step 2: Formatting of Network Files and Creation of Network Dependency Matrix
These steps are identical to the ones described in Section 2.4.5.4 on page 203.
Run ten (10) simulations of the network Schedule. For each simulation, follow Step 3a and Step
3b proposed in Section 2.4.5.4. For the same reason stated in the previous section, ten rather than
100 simulations should be sufficient to determine activity criticality indices. There is no need to
calculate the average total project duration and critical path frequencies. Instead, compute each
activity's criticality index (Crr) based on its TF amount as given in Equation 2.52(a). Go to Step
Compute the criticality index of each project activity on the network using Equation 2.52(b).
Step 5: Representing the Project Network Schedule Diagram and Project Critical Activities
Create a network schedule diagram and list all project network activities with critical index values
greater than 0.5. If two or more networks are being studied, save the outputs for the first network
and proceed to Step 1 for the second network. Otherwise, the network schedule representation is
representation of each set of PSPLIB networks has been displayed. As shown in Table 2-10, each
set represents several networks of identical size. Figure 2.12A(a) shows the exemplar network
diagram and nine project activities found critical out of ten simulation runs. These activities also
happened to be the constituents of the network's most probable critical path, as outlined in Table
2-14. The exemplar network diagram is elongated in the x-axis direction for requiring three rows
and five columns to represent the 17 project activities and their precedence relationships
symbolized by links between activities. Accordingly, the network possesses a sequential structure
different from the diagram's structure representing the PSPLIB network j12012-9 shown in Figure
2.12C. This network diagram necessitates 16 rows and 20 columns to describe the project (124)
activities, and their associated links possess a distinctive parallel structure. When comparing the
four PSPLIB network representations, the diagram in Figure 2.12A(b) depicting PSPLIB network
j3038-7 has the most serial shape requiring a row-column ratio of two to represent the project (34)
activities. While the structures of the PSPLIB networks j3038-7 and j12012-9 could be visually
defined based on their diagrams, the structures of the other graphs are entirely hybrid.
(c)
(a)
(b)
Given that their column-row ratios of 1.23 and 1.38 are necessary to represent both the PSPLIB
networks j6010-5 in Figure 2.12A(c) and j902-4 in Figure 2.12B, one may conclude that both
network topologies are serial. However, to allow comparison, simply displaying them is
insufficient to identify network structures and maybe classify them according to their topological
structures.
Based on the approach described in the preceding section, the following methodology is provided
for determining the study's network complexity metrics. The majority of the measures used were
derived from earlier research studies. While the literature demonstrates that a variety of complexity
measures can be used to evaluate the morphology of any given network, only six have been chosen.
Additionally, the methodology is evolved from the most fundamental to the most advanced in
terms of the data required to compute them. The following sections define the six complexity
metrics utilized in this study, followed by the technique used to calculate them.
The number of links or arcs (a) connecting (n) nodes or project activities in precedence
known as the coefficient of a network (CNC), was developed by Pascoe (1966) as the ratio of "a"
over "n" (Nassar and Hegab 2006; Demeulemeester et al. 2003) and was later redefined by various
academics to be one of the most well-known complexity measures (Demeulemeester et al. 2003).
CM1 will indicate CNC in this study, and its calculation will only contain non-dummy actions, as
shown in Equation 2.53. The number of arcs (a) and nodes (n) will be capitalized to indicate the
presence of non-dummy activity. The adjacency matrix can serve to calculate the number of arcs
A, which is equal to the sum of all its elements in the numerical implementation of CM1. As a
result, in the methodology provided in Section 2.4.5.4, a subroutine for summing up the
dependency matrix components, minus dummies, can be added between Step 2 and Step 3. To
validate this methodology, refer to the AON diagram in Figure 2.9 of the exemplar network and
assign A and N values of 24 and 15, respectively. Doing so should lead to a CM1 value of 1.6 for
this network. Similarly, the network dependency matrix provided in Table 2-12 should be used to
As with the criticality index established in Section 2.4.5.5, a global CI can be constructed in terms
of the network paths connecting activities from 'Source/Start' to 'Sink/End.' As previously defined
and established, a network may have critical or non-critical paths. The overall number of critical
network paths is always less than the total number of non-critical network paths. Intuitively,
the total number of the network's critical paths selected from all feasible routes connecting its first
to last activity may serve as a measure of complexity. With the same amount of simulation runs,
the stated complexity measure can allow the comparison or classification of different project
networks. CM2 is the abbreviation for this measure, which is specified in Equation 2.54.
Its numerical computation can be accomplished after completing Step4 of the methodology
presented in Identification of Network Paths. The proposed procedure for the numerical calculation
of CM2 can be simply performed using the outputs of the exemplar network provided in Table
2-13 for all possible paths and Table 2-14 for all the critical paths acquired after 100 simulation
runs. A CM2 value of 38.46 % is determined for the exemplar network schedule with the resulting
information.
The Johnson measure denoted by D or CM3 in this study is another complexity measure based on
basic project network information, specifically on project activity dependency relationships. Its
formula given by Equation 2.55 derives from the work of scholars such as Nassar and Hegab
(2006, p. 556) and Boushaala (2010, p. 773) for their contributions to the study of project network
Where N is the total number of project activities, and 𝑝 represents the number of predecessors of
Numerically, CM3 can be derived from a network dependency matrix based on the binary
information it contains, with “1” symbolizing a relationship between activity “i” and “j” and “0”
a lack of precedence relationship. After determining the network dependency matrix in Step 2 of
the methodology provided in Section 2.4.5.4, one can computerize the operations of the flowchart
The dependency matrix of the exemplar network provided in Table 2-12 can be used to validate
this methodology. As structured, rows and columns of this dependency matrix are labeled with the
project network activities. The predecessor(s) of any project activity, inscribed at the upper row of
the dependency matrix as shown, can be found on the extreme left column of the matrix as follows:
first, localize the activity associated column; second, find all the “1s” in this column; last, starting
at each one, figuratively draw a horizontal line to cross the vertical line symbolically passing
through the extreme left column. At the intersection of horizontal and vertical lines, the resulting
activity represents one of the activity's predecessors in consideration, which total number is the
sum of all the “1s” found in the activity column. Likewise, the successor(s) of any project activity,
inscribed at the very left column of the dependency matrix, can be found on the upper row of the
matrix column.
As an example, the activity "H," which is shown in Figure 2.13, has activities F and G as its
predecessors. To find its successors, first, find "H" on the far left of the matrix, then find all the
"1s" on the matrix row that corresponds to "H." Finally, draw a vertical line from the only "1"
found in this row to reach the uppermost row of the dependency matrix. Activity "M" located at
the intercession of the vertical line and horizontal line figuratively passing through the upper row
of the matrix represents the only successor of activity "H." In summary, "H" possesses two (2)
predecessors F and G, and one (1) successor, "M," which are also obtained when using the network
AON diagram shown in Figure 2.9. With the resulting information, the difference between the
total numbers of the predecessors of "H" and successors of "H" is equal to one. Similarly, the
predecessors and successors of all project activities may be identified, and the difference between
their predecessor and successor total numbers may be computed as follows. Note that adding or
removing dummy activities from the dependency matrix does not affect the final value of D.
A 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 2
B 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
C 0 1 0 1 0 0 0 1 0 0 0 0 0 3
D 0 0 0 0 0 0 0 0 1 0 0 0 1
E 0 1 1 0 0 0 0 0 0 0 0 2
F 0 0 1 1 0 1 0 0 0 0 3
G 0 1 1 0 1 0 0 0 0 3
H 0 0 0 0 0 1 0 0 1
I 0 0 0 1 0 0 0 1
J 0 0 1 0 0 0 1
K 0 0 0 1 0 1
L 0 0 1 0 1
M 0 1 0 1
T/O 0 1 1
Sink 0 0
Sum 0 1 1 1 1 2 1 2 1 2 2 2 2 3 1 3 1
Another complexity measure considered for the analysis of networks collected for this study is the
one that Nassar and Hegab (2006) developed. The number of activity nodes (n) and the number of
links or arcs (a) linking them serve as a measure of their complexity (identical to CM1). As
depicted in Equation 2.59, the developed measure, denoted by Cn or CM4 in this study, is
formulated in percentage (%), unlike most measures for network complexities which are expressed
as unitless coefficients. In addition, this measure may serve to rank and compare project
alternatives by indicating the option that may be simple to manage (Nassar and Hegab 2006).
Nassar and Hegab (2006, p. 557) advised that redundant links be removed from the network before
analyzing network complexity because existing and indirect links can replace them. Otherwise,
incorporating them would be misleading, as it would imply a higher level of complexity than exists
in the project network. Whenever possible, dummy activities "Source" and "Sink" will be omitted
from the determination of "n" or "a" in this study to justify the use of "N" or "A" in Equation 2.56
⎧100
⎪ Log A N Log N 1 if N is odd [Eq. 2-56]
1 4 N 1
C %
⎨
⎪ 100 Log A N Log N if N is even
⎩ 1 4 N 1
Using the formula in Equation 2.56, C can be calculated directly or as a subroutine to any
previously described ways. One may use the following information on the exemplar network to
validate the methodology: Nodes of activity (N = 15) and linkages or arcs (A = 24).
24
𝐿𝑜𝑔 [Eq. 2-57]
𝐶 100 15 1 38.88 %
15 1
𝐿𝑜𝑔
4 15 1
According to Table 2-16, a Cn value of 32.28% is satisfactory, but the project network may be
improved.
Another complexity measure considered for this study is density, also known as Order of Strength
(OS). The complexity measure OS, also referred to as CM5 in this study, is “defined as the number
of precedence relations (including the transitive ones [,] but not including the arcs connecting the
dummy start or end activity) divided by the theoretical maximum number of precedence relations
[n × (n - 1)/2], where n denotes the number of no dummy activities in the network” (Su et al. 2016).
As initially stated earlier in the paragraph devoted to the complexity measure CM1, n in the
expression of in [n × (n - 1)/2] will be replaced with a capital letter to indicate the use of non-
dummy activities in the expression of OS. In addition, the total number of precedence relations
can be calculated as the sum of the elements of the network precedence matrix without the dummy
activities. Accordingly, OS can be expressed by Equation 2.58 below provided, which may be used
∑ ∑ Dependency Matrix i, j
OS 2 [Eq. 2-58]
N N 1
N represents the total number of non-dummy nodes in an AON network schedule in the above
equation. By adding a subroutine after ‘Creation of Network Dependency Matrix’ in Step 2 of the
Methodology proposed in Section 2.4.4.5, one may calculate the density of any network. The
exemplar network, which has 15 non-dummy activities and a total of 24 precedence relations, can
serve once again as the illustration of this complexity measure. By substituting both values in
Equation 2.58, an OS value of 0.2286 [=24x2/(15x14)] can be found for the exemplar network.
The last complexity measure considered for this study is restrictiveness (RT). This measure was
derived from graph theory and first introduced by Thesen (1977). Nassar and Hegab (2006, p.557)
justified its application to construction schedules because “project network usually falls under a
special category in graph theory called directed acyclic graph.” Its determination requires not only
network schedule but also a known reachability matrix R of the network in question. Given the
network dependency matrix, also known as an adjacency matrix, one can extract from it a new
matrix called the reachability matrix R, and its entries 𝑟 are as follows:
𝑟 1, if both activities in row “i” and in column “j” are reachable or connected by a path
The value of RT can be calculated using Equation 2.59 expressed in terms of the elements of the
R matrix and the set of paths V connecting activity “i” to activity “j” of the network. RT values
vary between 0 and 1, with a value of 0 indicating a perfect parallel digraph (directed graph or
AON diagram) and a value of 1 indicating a serial digraph (Su et al. 2016, p.3-4). Su et al. (2016,
p.3) added that “[r]edundant arcs do not affect RT, since it is based on the
reachability matrix (the closure of the connectivity matrix)” (Latva-Koivisto 2001, p. 16).
2 ∑,∈ r 6 N 1
RT [Eq. 2-59]
N N 1
Numerically, RT can be calculated after the dependency matrix determination in Step 2 of the
methodology proposed in Section 2.4.5.4. The validation of the proposed methodology for the
calculation of RT can be performed by applying the exemplar network information from previous
sections. Using its dependency matrix, the reachability matrix of the exemplar network, as
provided in Table 2-17 below, can be derived. Thus, its RT value can be determined using Equation
2.59 as follows:
2 ∑,∈ r 6 N 1 2 116 6 15 1
RT 0.6476
N N 1 15 15 1
As indicated in Section 1.3.2, construction project networks are built up of nodes representing
activities that are linked by precedence relationships. Each activity is a project task that must be
completed within the time range specified for the project to be completed in its entirety. Aside
from project activities, the enormous interactions between numerous parties involved in building
a project, especially a large one, cause a construction project to fall under complex systems.
Scientists have utilized random matrices to simulate complex systems and examine their behaviors
using their eigenvalues' distributions to understand their behavior better. This section aims to use
the ubiquitous pattern known as "universality," which is based on RMT and has been used to
explore a variety of complex system characteristics. This is justified by the similarities in applying
As a result, examining the distribution of the extreme eigenvalues of random matrices obtained
from probabilistic durations of project network activities could contain the key to identifying their
project network, it is possible to derive reasonable inferences regarding activity durations in the
general population. Creating a mathematical model to draw samples is thus the first step toward
attaining the purpose of the section. The second stage is to develop a simulation technique for
randomly selecting the samples required for a thorough statistical investigation of the sample
covariance matrix's eigenvalues. The final step is to perform multivariate analysis, specifically,
hypothesis testing using the p correlated variables assessed jointly, to establish whether the Tracy-
This research is in part inspired by the discovery a Czech physicist Petr Šebad made after plotting,
on a computer, thousands of bus departure times that each bus driver’s paid spy collected at a bus
stop in Cuernavaca, Mexico. A bus driver would use this information to maximize his profits by
either slowing down if the bus ahead had just left so that passengers would accumulate at the next
stop or speeding up to pass other buses if the bus ahead had left long ago. The discovery confirmed
his suspicion about the chaotic interactions between drivers, causing the spacing between bus times
to have the same behavior pattern as earlier in his experiments involving chaotic systems in
quantum physics (Wolchover 2014). This ubiquitous pattern, which scientists referred to as
“universality,” has often appeared in investigating behaviors of complex systems using RMT.
RMT is involved with the empirical distribution of the eigenvalues of random matrices. In
addition, the similarities between construction project networks and the fields of applications of
the Tracy-Widom distribution laws, as revealed in Section 2.4.3, have led to the belief that RMT
may be critical to discovering a natural probability distribution for the extreme eigenvalues of
random matrices for project networks. A random matrix is a collection of entries tabulated in
columns or rows. Each represents a random vector whose independently chosen observations are
taken from a multivariate population with a known or unknown distribution law characterizing
that population.
In construction management and engineering, a dependency matrix, encoding links with “1” or
“0” respectively, whether a link exists between a pair of activities, serves to describe the
topological structure of a project network. Given project scheduling information, Section 2.4.5.3
and Section 2.4.5.2 provide the methodology for generating dependency matrices of project
networks and probabilistic computing durations of project activities. In addition, the derived data
can serve to encode the project network into a matrix. Figure 2.14A and Figure 2.14B illustrate
thirteen different schemes to encode a network into a matrix. Since the ultimate use of the encoded
network matrix is to study the behaviors of project networks using eigenvalues of the derived
matrix, the following reasons guided the selection process to discard some of the proposed
schemes.
As constructed, the first and foremost reason is that the twelve first matrices are not appropriate to
conduct the research investigation. They are either built with fixed binary numbers “0” and “1”
only, fixed binary numbers and random activity durations, or a combination of activity durations
and their early (ES/EF) and late (LS/LF) times; therefore, are not random matrices. However, the
rectangular matrix in scheme thirteen with entries representing EF times of project activities
computed based on activity probabilistic durations sampled from a triangular distribution with
known parameters is random. The second reason is that all entries below or above the diagonal of
an upper or a lower triangular matrix are zeros. With this construction, there is a risk of
degeneration in standardizing these matrices, as most multivariate analysis techniques require. The
last reason is the redundancies of entries in some of the matrices. For example, keying activity
durations or calculated times k+1 times with k representing the total number of its predecessors or
Although the bulk of the initially considered schemes is not random matrices, they are worth
mentioning to deter any future attempt for considerations in a similar investigation using
multivariate statical analysis. To sum up, among the thirteen ways of encoding a project network
into a matrix, only the last matrix proposed in scheme thirteen will serve to construct the random
matrices needed for this study. Therefore, the following section provides a methodology for
Figure 2.14A: Different Schemes for Encoding a Project Network Schedule into a Sample Data Matrix
228
Figure 2.14B: Different Schemes for Encoding a Project Network Schedule into a Sample
Data Matrix
Table 2-18 shows an example of encoding a project network schedule into a matrix using the
selected scheme. The encoded project network schedule is for the exemplar network, which is
frequently utilized throughout this chapter to quickly show newly developed concepts and
associated approaches. Table 2-10 contains information about the project network, including a list
of activities and their defined durations. Finally, refer to Table 2-12 and Figure 2.9 for the project
Network activities
Run
Act. 1 Act. 2 Act. 3 Act. 4 Act. 5 Act. 6 Act. 7 Act. 8 Act. 9 Act. 10 Act. 11 Act. 12 Act. 13 Act. 14 Act. 15 Act. 16 Act. 17
No.
1 0.011 8.65 35.88 22.69 28.97 56.79 29.61 46.46 51.28 59.84 64.50 59.48 72.45 91.33 73.58 94.66 94.67
2 0.011 8.45 26.45 17.79 25.47 49.31 22.69 46.99 42.31 55.47 61.13 46.65 61.45 77.99 68.20 81.94 81.96
3 0.012 7.85 25.89 17.27 23.10 49.41 23.32 41.04 37.80 49.06 54.64 49.28 61.22 79.75 60.42 82.69 82.70
4 0.013 9.13 26.91 20.43 27.60 52.16 23.96 47.85 39.05 56.09 60.35 47.39 62.08 77.40 70.65 80.47 80.49
5 0.011 10.34 31.36 19.72 27.33 53.56 32.00 49.35 53.96 60.62 70.28 57.99 73.87 88.80 69.71 91.88 91.89
6 0.010 9.90 37.38 20.90 27.99 53.61 26.19 48.24 43.11 56.10 64.15 60.98 64.12 90.38 65.48 94.21 94.22
7 0.012 10.33 30.53 22.49 31.17 49.69 31.27 47.98 49.87 58.43 64.42 53.94 68.86 81.46 68.87 84.21 84.22
8 0.009 8.62 27.09 21.15 28.79 53.06 24.91 53.44 46.82 59.20 65.19 56.23 67.94 89.92 68.81 94.01 94.02
9 0.014 10.23 29.35 19.35 27.75 50.66 24.71 43.96 40.21 49.97 60.28 49.71 62.85 85.68 60.43 89.69 89.70
10 0.011 9.06 30.62 18.77 27.70 52.42 27.13 46.31 43.42 54.82 57.98 53.87 64.77 84.58 65.44 87.76 87.77
11 0.012 7.90 27.80 20.96 27.30 54.39 25.88 46.96 45.73 54.57 62.19 54.94 61.90 79.35 65.45 83.14 83.16
12 0.015 6.55 34.99 19.56 27.18 52.78 20.32 48.32 35.00 56.28 59.63 55.00 70.44 77.79 70.07 80.57 80.58
13 0.015 7.44 25.93 20.08 27.80 53.45 29.01 44.86 50.70 59.20 63.78 45.76 65.02 82.15 72.87 85.58 85.59
14 0.010 9.39 28.20 23.42 30.78 57.36 23.57 55.61 41.42 61.25 71.88 55.29 72.89 94.52 72.33 97.62 97.63
15 0.010 7.88 34.97 19.53 27.22 55.36 28.91 52.03 44.52 57.76 66.29 56.38 66.50 89.23 70.49 93.57 93.58
16 0.012 8.73 37.03 20.85 27.50 63.85 31.16 55.60 52.68 63.40 70.34 59.10 72.70 88.90 77.69 93.29 93.31
17 0.010 6.55 25.52 18.72 24.37 41.83 20.88 45.87 35.82 53.63 59.57 43.46 61.54 79.56 63.39 83.59 83.60
18 0.009 9.17 36.31 19.01 27.33 55.25 22.89 48.04 46.27 56.21 60.62 61.60 65.88 82.56 70.87 86.30 86.32
Table 2-18: Illustration of a Sample Data Matrix Derived from Early Finish Times of Project Network Activity
230
applications of probability theory and statistics. Prior to making probability statements, the process
starts with the formulation of a statistical model 𝒳, also known as RMM, to describe all random
occurrences in the population of interest. For such a population, the true sample covariance matrix
Section 2.3.1, an RMM is defined as a probability triple object (𝛺, ℙ, ℱ), where 𝛺 represents a set,
and ℱ is a family of measurable subsets of 𝛺. Below provided Table 2-19 is a summary of the
The development of the following procedure takes into account the conditions of application of
the well-known “university” results, mainly the celebrated Johnstone’s theorem extended by
various authors such as Soshnikov and Péché to relax most of its requirements. This theorem has
been extensively used to describe the limiting behaviors of complex systems of adequately
centered and scaled largest eigenvalues of random matrices that represent them. Thus, doing so
will help to fully define the model required to investigate the limiting behavior of probabilistic
durations of project network activities, therefore find their true probability distribution, which for
now is assumed to be the Tracy-Widom distribution 𝐹 for the reasons provided in the previous
sections.
Accordingly, inspired by the ad hoc construction proposed by Johnstone (2001) and the
formulation of the statistic 𝑇 also called Hotelling’s 𝑇 (refer to Johnson and Wichern 2019), the
following is the procedure used to standardize the random matrix 𝑿 𝑥 ,⋯,𝑥 with each
project network. As it is the practice in the fields of application of probability and statistics, before
any descriptive and inferential statistical analysis, practitioners usually standardize raw data [e.g.,
Saccenti et al. (2011) and Forkman et al. (2019)]. This helps avoid degenerate matrices ending up
Step 1: Standardize 𝑿 to create a new matrix 𝑾 𝑤 ,⋯,𝑤 in a way that each of its
columns will have a mean zero and unit Euclidean norm (‖∙‖2 ) as follows in Equation 2.60:
𝒙 𝒙 [Eq. 2-60]
𝒘
𝒙 𝒙
and W, respectively. The elements r of the matrix R are randomly selected according to a long
tail distribution like the chi-square 𝜒 distribution. This is crucial not only because of the
interesting features of long-tail distributions but also to randomize the entries 𝑥 of 𝑿 since
Step 3: Select a significance level α value to construct a confidence interval to encompass all
plausible random values of the test statistic that would not be rejected by the level α-test of the null
hypothesis. Otherwise, observed values may be too far from the hypothesized value. This process,
used by practitioners, helps to target observed values that lie in the 100 𝛼 % interval is also
referred to as the acceptance region for the designed test. One may refer to Johnson and Wichern
(2019) for more on this topic. Accordingly, let Equation 2.61 define the matrix 𝑹
/ [Eq. 2-61]
𝑹 𝑛𝑢 𝜒 , 𝑛, 𝑝
[Eq. 2-62]
𝑛 1 𝛼
𝑛𝑢 𝜒 2
𝑛 𝑛 𝑝
𝜒 𝛼
2 represents the inverse CDF of 𝜒 with n degree of freedom evaluated at the probability
values 𝑝 ∈ 0,1 .
Note that there is some leeway in constructing the expression of the above Equation. This is with
regard not only to the choice of α ( 𝛼 2) but also the long tail distribution. Appendix G provides a
handful of expressions examined empirically before selecting the one provided here, as it provided
Step 4: create a real Wishart matrix or sample covariance matrix 𝑺 from the new data matrix
𝑿.
1 [Eq. 2-63]
𝑺 𝑿 𝑿
𝑛 1
By Restituting 𝑿 by its expression given in Step 2 and then R by its expression provided by
1 1 / /
𝑺 𝑹∙𝑾 𝑹∙𝑾 𝑛𝑢 𝝌 , 𝑛, 𝑝 𝑾 𝑛𝑢 𝝌 , 𝑛, 𝑝 𝑾
𝑛 1 𝑛 1
𝑛𝑢 / /
𝝌 , 𝑛, 𝑝 ∙𝑾 𝝌 , 𝑛, 𝑝 .𝑾
𝑛 1
Since the matrix product in the above expression is obtained by straight multiplications of the
corresponding entries of each matrix of interest, Equation 2.64 is the final derived expression of
S. From now on, it will be referred to as 𝑺 with the subscript NET added to avoid any confusion
with the generic sample covariance matrix S used throughout this manuscript up to this point. From
now on, the term sample covariance matrix will refer to 𝑺 denoting the sample covariance
matrix for probabilistic durations of a project network and defined by Equation 2.64.
, , / / [Eq. 2-64]
𝑺 𝑐 , , 𝝌 , 𝑛, 𝑝 ∙𝑾 𝑐 , , 𝝌 , 𝑛, 𝑝 ∙𝑾
𝜒 𝛼 [Eq. 2-65]
𝑐 2
, , 𝑛 𝑛 𝑝
, ,
Step 5: Compute and sort them the eigenvalues of 𝑺 as follows 𝑙 𝑙 ⋯ 𝑙 . Given the
, ,
expression of 𝑺 as provided in Equation 2.64, the sample size n must be greater than p.
, ,
Otherwise, because the constant 𝑐 , , there will be no sample covariance matrix 𝑺 associated
Step 6: Determine the number of eigenvalues and the universality theorem required to analyze the
limiting behavior of the eigenvalues acquired in Step 5. Given the exploratory character of this
, ,
study, it is prudent to examine the mth largest eigenvalue of 𝑺 rather than focusing exclusively
, ,
on the largest eigenvalue of 𝑺 . As a result, investigating the first through fourth largest
, ,
eigenvalue of 𝑺 .should be adequate to derive additional insight into the limiting behavior of
With retrospective to the universality of the TW (see Section 2.3.5), the sample covariance matrix
, ,
𝑺 , whose entries are not quite Gaussian, can serve to study the limiting behavior of
belongs to the class of sample covariance matrices with a more general population that is not
governed by the normal distribution, for which researchers such as Bao et al. (2015) have recently
continued proving university for the limiting behavior of their normalized largest eigenvalue under
theorem, as extended by Péché (2008, 2009) and other authors by relaxing some of the
, ,
assumptions, is appropriate for this study. This is justified given the covariance matrix 𝑺 ’s
design, which is based on tried-and-true strategies used by practitioners when dealing with non-
Gaussian data. As a result, it is simple to demonstrate that the class of sample covariance matrices
constructed from sampled duration data of project networks meets the four requirements for
, ,
Step 7: Standardize the first four largest eigenvalues 𝑙 , 𝑙 , 𝑙 , and 𝑙 , of 𝑺 to obtain their
Equation 2.42 and Equation 2.43, respectively. Let Norm I and Norm II be both normalizations
which will be interesting to employ separately to gain more insights into the limiting behavior of
the largest eigenvalues, as stated earlier. The centering and norming 𝜇 and 𝜎 are defined in
, ,
Standardization Methods for Scaling the Eigenvalues 𝑙 , and 𝑙 of 𝑆
Norm I Norm II
Johnstone's (2001) Limiting Law Tracy and Widom's (2000) Limiting Law
𝑙 𝜇 𝑙 √2𝑛
𝑙 , 𝑙
𝜎 2 ⁄ 𝑛 ⁄
Table 2-20: Normalization Methods for Scaling the mth Eigenvalue of Sample Covariance
Matrix SNET
Step 8: Decide on the number of simulations to sample data from the population of activity
durations and derive the necessary test statistics based on the observations of 𝑙 , and 𝑙 . For the
hypothesis test to be completed in Step 9, unless otherwise indicated, a total number of 1000
simulations, denoted by N, will be performed to collect the order statistics 𝑙 , and 𝑙 for any
given identified network of size p. In various studies, like the current one, researchers have used
Step 9: Conduct a simulation-based experiment to perform a goodness-of-fit test based on the order
durations of project network activities is governed by the Tracy-Widom limit law 𝐹 (TW1) given
by Equation 2.36(b). The Kolmogorov-Smirnov (K-S) Goodness-of-Fit Test (see Section 1.6.7.2)
is more appropriate for this investigation. The reason is that other scholars, such as Saccenti et al.
(2011), have used it for validating distributional assumptions about their data. The test hypothesis
is formulated as follows:
H0: the Tracy-Widom limit law 𝐹 is the limiting probability distribution of project network
activities’ durations.
H1: the limiting probability distribution of project network activities’ durations is not distributed
As described in Section 1.6.4.3, a K-S test requires the computation of the quantiles 𝑙 , and
evaluations, the MATLAB routines developed by Dieng (2005) aided in approximating the Tracy-
Widom CDF 𝐹 𝑥 . Depending on the tabulated critical values of the maximum absolute
difference between sample and population CDFs, 𝑐 , , which is determined based on the
significance level α, the test is rejected if 𝐷 > 𝑐 , . When the K-S test results in the acceptance of
the null hypothesis H0, for validation purposes, a graphical representation of the data, such as Q-
Q PLOT, can validate the limiting distribution of the data obtained through hypothesis testing.
Step 10: Graph a Q-Q plot and or histogram of the empirical distribution to compare the
With the ten steps devised in this section, one can create a mathematical model necessary to
confirm, through a goodness-of-fit test such as a K-S, whether the Tracy-Widom 𝐹 is the limiting
distribution of durations of project activities of a given network of size p. The question now is how
many samples (n) are required for this investigation? The following section provides a step-by-
2.4.6.3 Model Development: Finding the Optimum Sample Size n for the Data
This section proposes a procedure for finding the optimal sample size n for the project network
activity durations required to build a mathematical model to help understand their behavior. As
discussed in the previous section, this model is necessary to obtain samples from the population
needed to make inferences and probability statements about the unknown aspects of the underlying
distribution of project networks suspected to be governed by the Tracy-Widom laws thanks to its
universality. The empirical procedure to devise is based on universality theorems such as those
formulated based on the Tracy-Widom limit laws discussed in Section 2.3.4. The following
process assumes.
Step 1: Make a distributional assumption about the data. For this study, the assumption is that the
Tracy-Widom limit law 𝐹 (TW1) governs the limiting probabilistic durations of project network
activities which should be even more true with more extensive project networks.
Step 2: Choose the total number of data points 𝑛 required for the generation of sample data
matrices for project networks 𝑿 𝑥 ,⋯,𝑥 with 𝑛 𝑝 and each 𝒙 representing EF times
of the jth activity on a given project network. Each network chosen for this inquiry comprises p
activities, as stated in the preceding sections, which also characterize their computational process.
Notably, the more points, preferably sequential integers with steady increments, the easier it is to
estimate values between points using the interpolation approach. The general rule is that
Step 3: Follow the procedure described in the previous section to derive the sample covariance
matrix 𝑺 of the data based on 𝑿 and using Norm I and Norm II, determine its normalized
Step 4: Construct the scatter of pairs 𝑛 , 𝑙 for only for 𝑚 1 instead of all the four values
of m. Refer to Step 6 for the justification. As discussed in Section 1.6.4.1, the scatterplot as a
visualization technique can help uncover interesting "patterns" hidden within the data and locate
Step 5: Calculate the deviations ∆ between the empirical and hypothesized parameters using the
known mean and variance of the assumed distribution. For every sample size 𝑛 , the number of
simulations required to calculate the empirical mean and variance must be specified. Because this
must be done for each sample size 𝑛 , 100 simulations would be adequate to obtain the observed
statistics. Now, to formulate the expressions of the deviations of means and variances to calculate
at each simulation,
let the mean and variance of the Tracy-Widom distribution F1 (respectively of the observed 𝑙 )
and ∆ , , be the deviations calculated based on the statistics of the mth observed eigenvalues
𝑙 and the theoretical 𝜇 and 𝑣𝑎𝑟 , respectively. Equation 2.66 below provide their
expressions.
𝜇 𝜇 (a)
∆ , , 𝜇
[Eq. 2-66]
𝑣𝑎𝑟 𝑣𝑎𝑟 (b)
∆ , , 𝑣𝑎𝑟
The values of 𝜇 and 𝑣𝑎𝑟 are known and can be found in Table 2-3 as 𝜇
Step 6: Plot the pairs 𝑛 , ∆ , , to obtain the smoothing curve of deviations of means on the
one hand and the pairs 𝑛 , ∆ , , on the other hand. Subsequently, find the intersection of
the resulting curve with the n-axis at ∆ , , 0 (resp. ∆ , , 0 for the other curve). The
obtained value of n is the optimal sample size 𝑛 to consider for the verification of the
distributional assumptions required to build a mathematical model for the project network in
question identified for the investigation. Repeat the process for all selected networks for the study.
Unless empirically investigated, there is no guarantee that all the curves will yield the exact value
Note that there is no need to determine the deviations ∆ , , and ∆ , , and plot curves for
each of the four mth largest eigenvalues to determine 𝑛 , . The reason is that they are the
eigenvalues of the same sample covariance matrix 𝑺 derived from the generated sample data
matrix 𝑿 . Thus, only 𝑛 , , obtained with the normalized largest eigenvalue (𝑚 1) will be
adequate to determine 𝑛 .
This section finishes with Step 6, which was designed to devise a procedure for determining the
ideal sample size n. As previously stated, the value of 𝑛 is essential for hypothesis testing to
validate any distributional assumption with known parameters of the assumed probability
distribution.
Given a complexity measure, which computational methodology can be found in Section 2.4.5.6,
the analysis of the values obtained for the benchmark networks collected for this study is facilitated
by classifying the computed values into five categories or groups. This is performed by plotting a
histogram of the complexity measure values to group values into five bins of equal length. Using
descriptive statistics, the frequency distribution of values deriving from the histogram
characterizes the complexity measure. This distribution can be defined by its measures of the
average – mean, median, and mode –and its measures of dispersion range and standard deviation
(Norman and Streiner 2003). Values of all computed complexity measures are provided in
Appendix H. However, due to a large number of collected networks, 2040 networks, only partial
results could be included in this manuscript to limit its final number of pages. These results are
also summarized in the series of tables provided in the subsequent sections below aimed to present
and analyze the complexity measures obtained for all 2040 network schedules. Last, to help
understand these complexity measures, the five networks selected to validate the methodologies
proposed in Section 2.4.5 are visited to relate them to the identified groups for each complexity
measure.
Coefficients of a network (CNC) were computed for all 2040 PSPLIB network schedules generated
for this study using the approach described in Section 2.4.5.6 beginning on page 214. Appendix
E.2.1 contains some of the CNC values. To characterize the CNC values, the findings are divided
into five groups designated Group 1 through Group 5. All groups have the same length, which
equals 0.14, as shown in Table 2-21A or Figure 2.14A(a), which depicts the frequency distribution
of the CNC values. The CNC values of all PSPLIB networks range between 1.5 and 2.19, with a
range of 0.69 and a standard deviation (σ) of 0.2286, according to the histogram of CNC values.
Furthermore, the mean, mode, and median of the CNC distribution are all 1.85 and belong to Group
3. As a result, this distribution is both symmetric and multimodal. Furthermore, the PSPLIB
network CNC values are evenly divided among three categories, denoted by Group 1, Group 3,
and Group 5. Each set has one-third of J30, J60, J90, and J120 networks, with CNC values ranging
from 1.50 to 1.64, 1.78 to 1.92, and 2.06 to 2.20, respectively. This implies that PSPLIB network
schedules were prepared in three categories based on CNC values. The CNC values of the PSPLIB
networks, whose structure representations are shown in Figure 2.11A through Figure 2.11C, are
as follows: J3038-7 (2.125, Group 5), J6010-5 (1.5, Group 1), J902-4 (1.5, Group 1), and J12012-
9 are the J3038-7 (2.125, Group 5), J6010-5 (1.5, Group 1), J902-4 (1.5, Group 1), and J12012-9
(1.5, Group 1). It is worth noting that the computed CNC values are of the same magnitude order
The methodology proposed on page 215 in Section 2.4.5.6 served to compute paths ratios of the
2040 PSPLIB network schedules collected for this study. A network paths ratio represents a
proportion of the network critical paths ─ obtained through 100 simulation runs ─ in all possible
paths connecting its start and end activities. To better describe the paths ratios of collected
networks, computed values are classified into five categories: Group 1, …, and Group 5 of the
same length equals 7. These values are provided in Appendix E.2.2 and summarized in Table
2-21B. Meanwhile, Figure 2.14A(b) depicts the frequency distribution of these ratios in %. It can
be concluded from its histogram that when simulating any PSPLIB network schedule 100 times,
the pathways ratio will vary between 0.13 % and 31.58 %, with a range of 31.45 % and a standard
deviation (σ) of 4.2. Furthermore, the distribution possesses a median of 2.74 %, a mode of 3.5
percent, and a mean of 5.26 %, all of which are in the first group or bin. These measures of average
characterizing the distribution of paths ratio values corroborate the histogram plot showing a
unimodal and asymmetric distribution. Because of its asymmetric shape and the data mean value
being located toward the histogram tail, it can be concluded that this distribution is positively
skewed, which agrees with the histogram plot in Figure 2.14A (b).
Moreover, the results in Table 2-21B suggest among all possible paths of each of the 81 out of 100
PSPLIB network schedules, only up to 7% of them can become critical paths. Each network has
at least one critical path out of many non-critical paths. Furthermore, a network with a few
activities is likely to have more probabilistic critical paths than a larger number of activities. The
exemplar network is a great example for having a paths ratio of 38.46%, while the highest paths
ratio is attributed to j306-4 with a ratio of 31.58% among the PSPLIB networks. Besides J30
networks, whose paths ratios fall under all groups, only the paths ratios of J60 networks get into
Group 3, while there are none for J90 or J120 networks in Group 3 through Group 5. From these
results, 14% is the maximum frequency of paths that can become critical for a J120 network. These
results agree with the PSPLIB networks, which structure representations are provided in Figure
2.12A through Figure 2.12C. Their paths ratios are as follows: J3038-7 (2.78%, Group 1), J6010-
5 (2.44%, Group 1); J902-4 (3.51%, Group), and J12012-9 (5.13%, Group 1).
Ratio Limits (%) CM2 - Network Counts per Group and Size
Group Min Max j30 j60 j90 j120 Total %
1 0 7 250 386 442 581 1659 81.3%
2 7 14 146 82 37 19 284 13.9%
3 14 21 54 12 1 0 67 3.3%
4 21 28 24 0 0 0 24 1.2%
5 28 35 6 0 0 0 6 0.3%
Total 480 480 480 600 2040
The Johnson complexity measure (D) of all 2040 PSPLIB networks is computed using the methods
developed in Section 2.4.5.6 (see page 216). Figure 2.15B (a) depicts their frequency distribution,
which groups the D values into five categories to assist their analysis. Appendix E.2.3 contains
partial results of the computed D values, whereas Table 2-21C below contains tabular information
derived from the distribution of the D values. According to the graph, the distribution of D values
is unimodal. This distribution has a median value of 46, a mean value of 48.82, and a mode value
of 52.5.
Additionally, the distribution values vary between 13 and 95, with a standard deviation of 22.55.
According to Table 2-21C, 97 percent of the J30 networks fit into Group 1 since their D values
range between 13 and 26. Except for the J30 networks, the category with the most often reported
D Values (27.8 percent) is represented by the D values of the J60, J90, and J120 networks.
Moreover, the J30 and J60 networks may be found on the left side of this category. Alternatively,
the J90 and J120 networks can be located on the right side. Furthermore, Group 5 has 45.5 percent
of J120 networks and is entirely composed of these networks with D values ranging from 78 to 95.
As a result, it is possible to conclude that the D values of PSPLIB networks fluctuate proportionally
with network size. It is worth noting that the exemplar network, with a total number of activities
The methodology proposed in Section 2.4.5.6 (see page 218) assisted in computing the Cn values
of the 2040 benchmark network schedules. The histogram of the networks’ Cn values is plotted to
group values into five distinct categories of equal intervals. Partial results of Cn values are
provided in Appendix E.2.4 and a summary of all results followed by the first Group through the
last Group. Table 2-21D below is tabulated information derived from the histogram of the Cn
values shown in Figure 2.15B(b). The Cn values range from 12.075 to 38.581, with a range value
of 26.505 and a standard deviation of 7.037. The median value is 20.709 (Group 2), the mode value
is 19.000 (Group2), and the mean value is 21.350 (Group 2). From these values, it can be derived
that the distribution of the Cn values of the PSPLIB networks is asymmetric and positively skewed.
The Cn values of the J120 networks are the most frequent ones representing 53% of results in the
modal Group (Group 2). For the J90, one-third of J60 networks are in Group1, Group 2, or Group
3, with all Cn values ranging from 10 to 28. This observation is similar to the j120 networks, except
those results are not equally split between groups ─ Group 2 or modal Group get 60% of the Cn
values.
Regarding the Cn values of the J60 networks, there are not the most frequently reported. One-third
of them have their Cn values comprised between 10 and 16, whereas two-thirds of the remaining
J60 networks have theirs between 22 and 28. For the J30 networks, one-third of these networks
are either in Group2, Group 4, or Group 5. In Group 4 and Group 5, only Cn values of the J30
networks are reported in these categories. It can also be noticed that the Cn values of PSPLIB
networks are disproportional to the network sizes. This is well expressed in Table 2-15, which
provides Cn values of the four networks j3038-7, j6010-5, j902-4, and j12012-9.
The methodology provided in Section 2.4.5.6 (see page 220) enabled the calculation of the OS
values of all the benchmark networks considered for this study. A classification of the computed
OS values into five groups was necessary to facilitate the analysis. This is performed through the
histogram plot in Figure 2.14C(a) of the calculated OS values classifying results into five bins of
equal length. On page 443 of Appendix E.2.5, a summary of the results ranging per group can be
found. Due to many networks, only partial results are made available. However, Table 2-21E
provides an overview of all the OS values. As depicted, groups are not equals. All the PSPLIB
J120 and 2/3 of J90 networks, having the most significant numbers of activities, fall in Group 1,
representing 45.1% of the 2040 networks considered. The second group is made of 2/3 of the J60
networks and the remaining 1/3 of the J90 PSPLIB networks representing 23.5% of the total
networks. The third and fourth groups are equal frequencies and contain 1/3 of the J60 and 1/3 of
the J30 networks. The last group included OS values of only J30 networks, with 2/3 of the network
The proportions of the networks throughout different groups suggest that the OS values are
disproportional to the network sizes. The J30 networks have the smallest network sizes but have
the greatest OS values ranging from 0.092 to 0.14 (Group 4 and Group 5), followed by the J60
networks, whose OS values are between 0.068 and 0.116 (Group 2, Group 3). The J120 with the
greatest sizes (122 activities) has the lowest OS values between 0.02 and 0.044 (Group1). Overall,
the frequency distribution of the OS values of PSPLIB networks ranges from 0.0248 to 0.1371,
with a range value of 0.1123 and a standard deviation of 0.0355. The distribution of the OS values
has a mode value of 0.032, a median value of 0.0592, and a mean value of 0.0621. As a result, the
distribution of the OS values of the PSPLIB networks is unimodal and positively skewed.
The methodology proposed in Section 2.4.5.6 (see page 221) enabled the numerical determination
of RT values for all 2040 PSPLIB networks. The resulting RT values are categorized into five
groups to assist in the analysis, and their histogram is presented in Figure 2.15C(b). The RT values
of the benchmark networks range from 0.1780 to 0.6875 in Table 2-21F, with a range of 0.5095
and a standard deviation of 0.1274. The average measures of the frequency distribution of the RT
values are 0.3804 for the mean value, 0.4000 for the mode value, and 0.5331 for the median value.
From these values, it can be concluded that the distribution of RT values is asymmetric and
negatively skewed. The repartitions of networks throughout the five groups concerning their sizes
are consistent with the previous complexity measures. RT values and network sizes are inversely
proportional. Networks with the most significant number of activities have a maximum RT value
of 0.46 (Group 3), while the networks with the smallest number of activities have a maximum RT
value of 0.7 (Group 5). It is worthwhile noting that the modal group contains networks of all the
Below provided in Table 2-22 is a summary of all the complexity measures for the four PSPLIB
To fully understand the underlying behavior of project network schedules, it is critical to choose a
diverse sample of networks from a pool of 2040 benchmark project network schedules. Utilizing
one of the six complexity measures described in Section 2.4.5.6 should assist in identifying a few
networks to consider. Note that these values have already been computed, classified, and analyzed
for each network in Section 2.5.1. Thus, the available data can aid in identifying appropriate project
networks to investigate their schedules' inherent behavior. Given that restrictiveness, denoted by
RT, is a widely used network structure analysis tool for project network schedules in construction
scheduling and graph theory, it would be appropriate for this task. Thus, based on their RT values,
which range from "0.1780" to "0.6875", a few networks of varying sizes have been identified to
represent each of the five RT value classes [see Figure 2.15C (b)]. As a reminder from Section
2.4.5.6, an RT value near "0" indicates a perfect parallel AON diagram, whereas a value of "1"
Along with obtaining representatives for each RT value category, other networks were included
whenever possible to get pairs of networks with equal sizes and RT values. Additionally, networks
of varying sizes but identical RT values have been identified. Indeed, all the resulting project
networks with descriptive characteristics have been identified and arranged in Table 2-23. These
carefully chosen networks should aid in gaining insight into the behavior of project network
schedules as a function of their sizes and complexity measured in terms of RT values. Yet can one
predict the behavior of a network schedule based on its RT value or/and size? The following
Table 2-23: Benchmark Networks of Interest for the Study of their Underlying Behaviors
Table’s annotations:
(1) For all the 35 networks, their sample covariance matrices S are constructed based on the main formula with 𝛼𝑠𝑖𝑚 0.025 whose eigenvalues are normalized
using Norm I.
(2) For all 22 networks denoted by a *, their simulated matrices S are created using the main formula with 𝛼𝑠𝑖𝑚 0.025 whose eigenvalues are normalized
according to Norm II.
(3) For the four networks denoted by ᵛ or ᶺ, their simulated matrices S are based on the main formula, but with 𝛼𝑠𝑖𝑚 0.05 or 𝛼𝑠𝑖𝑚 0.1, respectively. Norm I
normalizes the eigenvalues.
(4) For network j3032-4 with the suffix ᵘ, rather than 103 (for all other networks), 104 simulations are based on the main formula in conjunction with Norm I.
(5) For network j3032-4 with the suffixes ᵐ or ᵚ, simulated networks are constructed using the main formula without the 𝛼𝑠𝑖𝑚 operator, and the eigenvalues are
normalized according to Norm I or Norm II, respectively.
(6) Each simulation of S allows the collection of the first through the fourth largest eigenvalues. However, due to the limited time available for this investigation,
only the first greatest eigenvalues for the 24 networks, j60 through j120 corresponding to Note (1) have been collected.
(7) For each of the groups j30, j60, j90, or j120, the underlined RT values represent a pair of networks with identical RT values.
(8) Italicized and bolded RT values indicate identical RT values in intergroup networks.
255
This section aims to treat and analyze the results of a series of 100 simulations run to determine
the optimum size n of the sample necessary to create an appropriate sample data matrix serving as
implementation of the proposed mathematical model and procedures described in the methodology
section (see Section 2.4.6) helped run the required simulations for the study. Like previous
sections, MATLAB aided in carrying out all simulations. For a given project network of size p
(refer to Table 2-23 for the notes), and for each data point i related to the sample size 𝑛 , Table
Table 2-24: Sample Sizes and Numbers of Data Points Required for Project Networks
The random matrix 𝑿 serves to derive the first largest eigenvalues 𝑙 , and 𝑙 of the sample
Section 2.4.6, the first largest eigenvalues 𝑙 , and 𝑙 are normalized using the normalization
methods Norm-I and Norm-II methods, as specified in Table 2-20 of Section 0. The number of
Notably, each row of 𝑿 contains p EF times, each of which computed using the CPM to
schedule the project network of interest based on independently sampled durations from a
triangular distribution with parameters a, b, and c set for each activity on the network.
Nonetheless, for each fixed value of p, the values of 𝑛 were chosen under the requirement that
n be greater than p to construct the sample covariance matrix 𝑺 from a data matrix 𝑿 (see
Section 0, steps 4 and 5). Additionally, a closer value of n to the value of p resulted in positive and
extremely large eigenvalues for 𝑺 . However, the value of 𝑛 was found through trial and
error. Given a normalization procedure and a size value of p for all the networks identified in Table
2-23, experimenting initially with simply a few points 𝑛 and 𝑛 picked at greater distances from
one another can assist in determining 𝑛 . More precisely as it will be implemented later, the
statistics of the matrix 𝑺 's normalized first greatest eigenvalue collected at each of the 100
simulations served to select a suitable 𝑛 . The objective was to determine a value of n that would
produce a negative average value of the normalized first eigenvalues smaller than the mean of the
expected probability distribution (𝜇 ) after a few trials. Finally, 𝑛 was chosen based on the
For a more detailed examination of the values in Table 2-24, it is possible to observe that as p
grows, the total number of points 𝑛 decreases. The reason is that this simulation study found
that simulating a larger project network schedule takes longer than simulating a smaller network
schedule. Refer to step 3b of Section 2.4.5.5 for the numerical execution of the approach required
to schedule a project network using Kelly's (1961) CPM. The flowcharts for the methodology's
forward and backward passes are included in Appendices C1 and C2 on pages 409 and 410.
Consequently, the simulation study produced more equidistant and closed points when the project
network was smaller than larger. Finally, the values of 𝑛 ,𝑛 , and 𝑛 used to calculate
the 𝑛 are highly dependent on the method employed to normalize 𝑺 's largest eigenvalue.
They all three, particularly 𝑛 , drop significantly as n increases from Norm-I to Norm-II.
Using the information provided in Table 2-23 and Table 2-24, necessary to run the required
simulations, assisted in generating the outputs needed to scatterplot the pairs 𝑛 , 𝑙 , and
𝑛 ,𝑙 according to Norm I and Norm II, respectively, with j varying from 1 to 100 and 𝑛 or
and Figure 2.16B, were created in MATLAB by categorizing project networks according to their
sizes and the normalizing method used to determine 𝑙 , and 𝑙 . Additionally, the scatterplots
depicted in Figure 2.16A for the j30 and j60 networks and Figure 2.16B for the j60 and j120
networks were produced utilizing all 35 networks selected for this simulation investigation.
Figure 2.16A: Scatterplots of Normalized 1st Largest Eigenvalues Versus X’s n Rows
259
Figure 2.16B: Scatterplots of Normalized 1st Largest Eigenvalues Versus X’s n Rows
260
In each of the previous figures, each graph's legend presents the networks in ascending order of
their complexity as measured by their RT values. Appendix F.1 on page 456 illustrates scatterplots
created using a single network in the case of j3024-8 or j3032-4. The following paragraphs look
Constructing the scatterplots in the matter explained in the previous paraphs enabled the
unexpected discovery of a distinct and consistent pattern associated with project networks of
various sizes, regardless of the method used to compute the largest eigenvalue of the sample
covariance matrices 𝑺 derived from the population of EF times for project activities. This
discovery is a crucial development since it provides insight into the fundamental behavior of
project network schedules. The smaller networks (j30 and j60), shown in Figure 2.16A, have a
higher data point density than their larger counterparts (j90 and j120), as depicted in Figure 2.16B.
By connecting all the scatterplots' markers, the slope, defined as the ratio of vertical to horizontal
change between any two distinct points on a line, of the resulting curve may provide crucial
information. The curve begins steeper at the left and gradually becomes almost horizontal to the
right as the number n of samples increases. This trait of the curve indicates a significant degree of
collinearity in the data, which implies a very low variability in the largest eigenvalues of the sample
covariance matrices as the sample size n becomes larger. The following section provides a more
2.5.2.3 Using Patterns in the Data to Infer Structure in The PSPLIB Data Set
The data gathered from the simulation results discussed in the preceding section also aided in
computing the deviations ∆ , , and ∆ , , between the means and variances of the empirical
and hypothesized distributions. While the statistics of the assumed distribution are well known,
the empirical statistics were computed using the observed mean and variance of the normalized
, ,
first largest eigenvalue 𝑙 , and 𝑙 of the sample covariance matrix 𝑆 . Given a sample
size 𝑛 , 100 replicas of a project network schedule of size p supplied the inputs required to
, ,
construct the sample data Matrix 𝑿 , then derive the matrix 𝑆 . As previously indicated, 𝑛
The 100 needed simulations produced the outputs required to draw a total of 𝑛 pairs of points
sample size n and the vertical or y-axis represents the deviation ∆ or ∆ , respectively. Once
again, MATLAB assisted with the graphical depiction of the data, producing the smoothed curves
seen in Figure 2.16(a) and Figure 2.16(b), generated by connecting consecutive points with a
straight line. While the curves supplied here only reflect the set of j30 networks, the remaining
curves illustrating the sets of project networks j60, j90, and j120 described in Table 2-23 can be
found in Appendix F.2 and Appendix F.3. The following graphical analysis of the curves will be
conducted case-by-case, distinguishing between mean and variance deviance plots. For
simplification, the curves obtained using mean values ∆ (resp. ∆ ) will be referred to as a mean
As with normal yield curves (finance), learning curves (construction and engineering cost
estimating), and stress-strain curves (materials science and engineering), a collection of mean
deviation curves provides a graphical representation of combined networks of equal size but
obtained using the set of j60 networks, see page 463 of Appendix F.4. It is supposed to assist in
determining the optimum sample size required to verify the limiting distributional assumption of
jointly sampling the durations of project network activities using a triangular distribution.
Additionally, as the sample size 𝑛 increases by taking values between 𝑛 and 𝑛 (see Table
2-24), the ideal sample size produced from either of these curves should aid in testing the
hypothesis that the Tracy-Widom distribution of type 1 is the true probability distribution of the
, ,
normalized extreme eigenvalue 𝑙 , or 𝑙 of a sample covariance matrix 𝑺 . The following
paragraphs analyze the curves' trends in-depth based on their initial slopes, their yield points, and
What is remarkable about Figure 2.17 (or the other curves for the sets of j60, j90, and j120
networks in Appendix F.2) is the persistent and unique phenomenal pattern, similar to that shown
in Figure 2.16A and Figure 2.16B. However, unlike the previous pattern, this one is inverted and
takes the shape of a downward and monotonic concave curve with positive slopes decreasing from
left to right as the sample size n grows. As can be seen in each picture, regardless of the
normalization method used (Norm I/Norm II), each upward-sloped curve begins with somewhat
more significant deviations and has a sharper slope that levels off as the sample size increases.
(a)
(b)
Figure 2.17: Plots of Deviations between Means of the Assumed PDF and Empirical PDF
of a Set of j30 Project Network Schedules's Largest Eigenvalues
For example, using Norm I, the highest slope of 8.57 is seen near the beginning of the mean
deviation curve for Network j3032-4, between points with abscissa 40 and 50. Using Norm II, the
steepest slope of 5.60 is obtained with the same network and points.
Remarkably, the spots with the steepest slopes are located at the beginning of each curve and are
characterized by negative and highly significant deviations, which correspond to smaller sample
size values. After each curve reaches a certain point, which is surprisingly like the "yield point"
on a stress-strain curve, the vertical to horizontal change rate decreases but eventually vanishes as
the sample size increases. Because of the striking similarity between the curves, it is worth
providing some facts about the yield points, which may help with the current study. A yield point
indicates the end of elastic behavior and the start of plastic behavior. In addition, at less than the
yield point, a material deforms elastically and returns to its original shape. Once the yield point is
passed, a portion of the deformation, termed plastic deformation, is permanent and irreversible.
Nevertheless, the downward and concave curve of deviation of means reaches its maximum
This unique point, which exists for any of the curves, corresponds to the optimal sample size,
denoted by 𝑛 , most likely to result in the observed and postulated means being equal. For
example, network j3011-1 has a 𝑛 value of 105 for Norm I, which happens to be the smallest
value achieved with the set of j30 networks. Whereas, with the same set of networks but with
Norm II, the minimum 𝑛 value is 49, corresponding to network j3024-8. Given a set of networks
and a normalization procedure, the closer the mean deviation curve is to the origin, the lower the
optimum sample size 𝑛 . Alternatively, the larger the value of 𝑛 , the further the curve is from
the space origin. For Norm I, network j3032-4, depicted in red, has the highest 𝑛 value of 239
and is the farthest from the space origin. Alternatively, the greatest 𝑛 value of 65 corresponds
to the same network j3032-4, whose red curve appears again at the bottom of all curves. However,
because the networks are ordered from smallest to largest in the legend of each figure, it is worth
noting that the values of 𝑛 are unrelated to the RT values. Furthermore, this pattern is constant
Finally, the mean deviation curves continue to rise after reaching the yield point 𝑛 but at a
slower and more gradual rate as the sample size n grows. The inspection of any curves collection
reveals the steepest slopes at the beginning of the curves that gradually diminish in steepness as
the sample size increases. For example, regarding slopes, the curve associated with network j3032-
4 begins at 8.57 and finishes at 0.074 with Norm I and 5.60 to 0.168 with Norm II. When
combining the graphical analysis and computed slopes of individual curves, it becomes clear that
the asymptotic rate of convergence to the x-axis is faster with the curves obtained using Norm I
than with the curves obtained using Norm II for the same sample size. Additionally, one can project
that all the curves represent a collection of equal size networks, but varying complexity will
eventually merge into a straight line. Appendix F.5 contains illustrations of the slopes computed
using Norm I and Norm II for the j3032-4 network. Finally, it is worth mentioning that the greater
the absolute value of ∆ , the further away from the hypothetical mean the statistic mean of the
used in economics to investigate customer choice (demand) and budget restrictions. It is a graph
reflecting combined networks of equal size but varying complexity. Each curve, in Figure 2.18 for
the j30 networks or Appendix F.3 for the j60, 90, and 120 networks, plots pairs of points
𝑛 ,∆ , , in the cartesian coordinate system, with the x-axis representing the number of rows
𝑛 of the sample data matrix 𝑿 and the y-axis representing the deviations ∆ , , of the
observed largest eigenvalues' variances from the anticipated distribution variance. For pairs of
points obtained using the set of j60 networks, see page 464 of Appendix F.4. The largest
, ,
eigenvalues 𝑙 , or 𝑙 are determined from the sample covariance matrix 𝑺 , with 𝑛 taking
values between 𝑛 and 𝑛 for a p-dimensional project network (see Table 2-24). These curves
serve a dual role. The first is to identify the optimal sample size n necessary for each network to
validate the limiting distributional assumption on the joint sampling distribution of project network
activities' durations, which are individually randomized using a triangular distribution. Another
Given the comprehensive graphical analysis of the mean deviation curves (refer to case 1 on page
262), one may use the same approach to examine the current curve patterns according to their
characteristics graphically. As a result, the subsequent analysis will concentrate on the previous
referenced figures' apexes. As with the previous pictures, an intriguing and recurrent pattern can
be observed across multiple networks of varying sizes and complexity, regardless of the approach
, ,
used to normalize the greatest eigenvalues of the sample covariance matrix 𝑺 ..
(a)
(b)
Figure 2.18: Plots of Deviations between Variances of the Assumed PDF and Empirical
PDF of a Set of j30 Project Network Schedules 's Largest Eigenvalues
However, the present tendency is in the opposite direction of the previous one. Overall, the trend
is downward and convex toward the origin. In addition, the slopes of the curves decrease from left
to right. However, the slopes do not diminish in a continuous or monotonous manner. There are
irregularities at the point when the curvature of each curve peaks just before its slopes begin to
drop as the sample size increases rapidly. The figures represent these anomalies by zig-zag lines
linking pairs of points to generate any of the curves displayed. After computing and inspecting the
slopes of each curve as their values change from positive to negative, these irregularities become
readily apparent. Appendix F.6 contains illustrations of the slopes computed using Norm I and
Norm II for the j3032-4 network. Consequently, due to the absence of continuity between the
points that comprise these curves, this study determined that the variance deviation curves are
Regardless of their shortcomings, some observations from the variance deviation curves' figures
are observed. Since networks are ranked in increasing order of complexity, irrespective of the
normalizing method or network size, each figure illustrates that complexity is independent of the
sample size associated with a zero deviation. Because this sample size is different from 𝑛 , let
𝑛 refer to it as from here on. The curve with the lowest 𝑛 appears at the bottom of the
variance deviation map from any figures. Alternatively, the one with the highest 𝑛 is the one
that is the furthest from the origin. This arrangement is identical to the one seen with the mean
deviation curves. As n grows, the rates of change steadily drop until they reach an asymptotic value
of zero near the horizontal axis for Norm I and ̶ 1 for Norm II. With sufficiently large n values, all
curves converge to a single straight line asymptotically parallel to the x-axis. Convergence occurs
Comparing the values of 𝑛 and 𝑛 derived using the mean and variance deviation curves for
both normalization methods, Norm I and Norm II, make it clear that both values are different for
any given network regardless of its size or normalization method. The discrepancy is magnified
when networks of a larger size are compared to networks of a smaller size. The value of 𝑛
appears to be higher than the value of 𝑛 for smaller networks j30 and j60, except for network
j6010-5. For this network, the 𝑛 value of 137 is less than the 176 representing the value of
𝑛 . Except for j901-3, which has 177 and 276 for its 𝑛 and 𝑛 , respectively, the same
observation established for smaller networks persists for larger ones for Norm II. For Norm I, the
observation no longer applies for j90 networks, where all 𝑛 values are significantly greater than
The following result is reached by comparing the 𝑛 values obtained using the two normalization
approaches. 𝑛 values calculated using Norm I are more important than those computed using
Norm II. Additionally, the disparities become more pronounced as network sizes increase. For
instance, the ratio of 𝑛 values obtained using Norm I to those obtained using Norm II varies
between 2 and 4 for j30 networks but between 5 and 7 for j120 networks. As a result, when the
first normalizing approach (Norm I) is used, the resulting sample data matrices are more
considerable in size than when the second normalization method is used (Norm II). Concurrent
observation of the mean and variance deviation curves for each given network size and
normalization method demonstrates that choosing a sample size greater than 𝑛 results in
smaller to zero variances for the highest eigenvalue of the sample covariance matrices. With
reference to the first chapter, sample variance measures the variability or dispersion of data values
near the mean—the greater the variance, the wider the spread.
Simulation Outputs
The following Table 2-25 summarizes the outputs of a series of 100 simulations necessary to plot
all the figures analyzed throughout the previous section in terms of mean and variance deviation
curves provided in Figure 2.17 and through Figure 2.18 and the ones in the Appendix F.2 and
Appendix F.3. This table focuses only on simulation outputs obtained with the deviations of mean
curves because this study found the variance deviation curves unsuitable for the network behavior
investigation. The single table categorizes simulation outputs by a set of networks of equal size
sorted in increasing order according to their complexity in terms of RT values. In addition, this
table helps verify the lack of correlation between the optimal sample size 𝑛 and complexity
across the set of networks. Moreover, this table is also valuable for learning more about the limiting
behavior of durations of project network activities by providing answers to the research's crucial
questions.
Table 2-25: Optimal Sample Size Predictions for All Networks of Interest
This study found the following in conjunction with the various figures used to graphically analyze
the mean deviation curves and the summary Table 2-26 below. Despite the normalization method,
networks of equal size and RT value do not necessarily have similar 𝑛 values except for smaller
project networks. With Norm I, the difference increases in magnitude as p increases. Since 𝑛 is
12012-1) having equal or approximate RT values will not necessarily end up with identical 𝑛
values.
𝑛
j30 ̶ RT = 0.4597 j60 ̶ RT = 0.4030 j90 ̶ RT= 0.217 j120 ̶ RT = 0.1901
Norm j3011-1 j3024-8 j6020-7 j6028-9 j9010-5 j901-3 j1205-5 j1209-10
For the four networks j3032-4, j6040-5, j901-3, and j12012-9 selected for additional experiments
as per Table 2-23, the results in Table 2-25 indicate an increase in the significance level α from
, ,
0.05 to 0.1 or 0.2 in the expression of the sample covariance matrix 𝑺 , increased the value of
the sample size 𝑛 . Unfortunately, the experiment was limited to the normalization approach
Norm I due to time constraints. For example, for network j3032-4, the value of 𝑛 increased
from 239 to 266 (𝛼 0.1) or 298 (𝛼 0.2). These values are also recorded in the since additional
simulations were ran for the same network. As a result, it can be inferred that 𝑛 is an increasing
, ,
Effect of the Significance level α on the Sample Covariance Matrix 𝑺
According to Table 2-23, another experiment was performed utilizing the small size network
j3032-4. Both normalization methods were used for this experiment, but α was left out of the
sample covariance matrix's expression (see Equation 2.64) by setting it to one. In each
normalizing scenario, the patterns of the three plotted graphs required to obtain the optimum
sample size were consistent with the preceding ones. The resulting values of 𝑛 , as given in
Table 2-27 below, illustrate that the absence of alpha resulted in substantially greater values of
Alpha
Norm No Alpha 0.05 0.1 0.2
Norm I 449 239 266 298
Norm II 83 65
This finding is consistent with those obtained for larger networks j60, j90, and j120, not included
in this analysis. Experimenting with a representative of each of the four sets of networks was
necessary to determine the optimal formula for the sample covariance matrix, among others
available in the literature, some of which are in Appendix G. In the absence of a supercomputer
and given the time constraints associated with completing this simulation task, choosing a sample
designing confidence intervals; otherwise, running all the simulations required for this study would
The empirical investigation, which involved 100 simulation runs for each network chosen for this
study, has found a universal pattern for project network schedules in project network schedules.
This trend was unexpectedly discovered after plotting the scatterplots of the normalized largest
eigenvalue of sample covariance matrices against the sample sizes required to produce the matrices
from EF times of project network activities. While each activity's durations were independently
randomized from a triangular distribution with known parameters, the sampling of the activities’
joint durations required to generate a sample data matrix was unknown but assumed till proven to
be the Tracy-Widom limit law of type 1. The same sampling distribution governs the selection of
each row of the sample data matrix, which constructs the sample covariance matrix. Regardless of
the approach used to standardize the sample covariance matrix's first eigenvalue, the distinctive
uncovered pattern was consistent across networks of similar size and complexity. Remarkably,
regardless of the normalization approach used, the same trend was persistent with networks of
varying sizes and complexities. Subsequently, graphs of deviations of means or variances of the
observed first largest eigenvalue of sample covariance matrices from the hypothesized distribution
mean or variance versus the data matrix sample size revealed two distinct patterns consistent with
the previous one. While the unveiled pattern is concave upward for the deviation of means,
comparable to a stress-strain curve used in engineering to calculate the yield point of any given
material, the pattern is convex downward for the deviation of variances. The visual examination
of several curves illustrating both trends revealed that the mean deviation curves were more
reliable than those of variance deviation curves in determining the optimal sample sizes.
Additionally, the analysis of simulation data showed that the sample size increases as the network
size increases. Further, when the sample covariance matrix was a function of α, the study found
that the optimum sample size necessary to derive a sample covariance matrix whose largest
eigenvalue’s statistic mean coincides with the assumed distribution’s mean, is a rising function of
the significance level α required to test the distributional assumption set forth. Moreover, the
analysis found that the optimum sample sizes obtained using the normalizing approach Norm I
were considerably more significant than those obtained using the normalization method Norm II,
particularly for larger networks. Finally, when the sample covariance matrix was no longer stated
in terms of α, the optimum sample size values grew considerably larger in the case of Norm I.
Preliminary
After conducting a graphical data analysis in the preceding part to determine the optimal sample
size n, this section uses n to calculate the test statistics necessary to verify the distributional
assumption made for project network activities’ durations. In other words, this section is concerned
with determining whether the limiting probability distribution of project network durations is
governed by the Tracy-Widom limit law s of type 1. Section 2.4.6.1 in step 8 of the devised
a project network systematically. Considering that a normalization method and a network of size
p have been selected, each simulation run allowed the creation of a sample data matrix which
served to compute the sample covariance matrix and then derived its four normalized first largest
eigenvalues. As previously stated, the two identified normalizing methods, based on the
interest. A significant number (1000 or 10000) of empirical order statistics produced from sample
, ,
covariance matrices 𝑺 helped invalidate the distributional assumption in a multivariate
statistical analysis based on hypothesis testing. This assumption is stated in terms of a null
hypothesis test in step 9 of the procedure provided in Section 2.4.6.1 and restated in Equation
𝐻 :𝐹 𝑥 𝐹 𝑥
where F is the limiting true probability distribution of project network schedules in terms of
Based on carefully selected four significance levels α of 0.01, 0.05, 0.10, and 0.20, the well-known
Kolmogorov-Smirnov (K-S) goodness-of-fit test enabled the testing of the above hypotheses.
Regarding the selected four significance levels, their corresponding hypothesis tests have been
denoted by KS-test 1, KS-test 2, KS-test 3, and KS-test 4 in Table 2-28 below, providing the
This is an exploratory study using random samples, so conducting hypothesis testing on three α
values would diversify the acceptance region of the test. Broadening the acceptance region would
figuratively increase the likelihood of tuning into the proper or “universal” radio station. At this
station, one can listen to wonderful and never-ending music audibly. The same logic applies to
determining the optimal sample size. Instead of conducting simulations with only the optimal
sample size, the study included two or four additional sample sizes to create a sequence of numbers
with the optimal sample size at the center. Adding numerals in one-digit increments before and
after the center aided in determining the sequence's remaining values. As a result, experiments
were conducted using five or three sample sizes, depending on the size and complexity of each
network. The preceding section's empirical investigation established that the computational time
The next sequel clarifies the test findings presented in various tables in the context of the preceding
development. Due to the variety of parameters evaluated during the series of simulations, the
simulation outputs supplied in the tables have been categorized according to the normalization
approach used to standardize the sample covariance matrix's four largest eigenvalues.
Additionally, the findings have been grouped according to the matrix's first, second, and following
largest eigenvalues. For example, if the preceding section classified networks according to their
sizes, this section would organize simulation outputs into tables according to the normalization
technique and rank the largest eigenvalue of interest, regardless of the network's size or
complexity. As a result, a table with missing values for any given group of networks indicates that
the K-S test failed for at least one of those networks. In other words, each table contains only
For example, no satisfactory test results were seen with any of the j120 networks chosen for the
study in the accompanying Table 2-29 below, which provides test results for Norm I regarding the
first greatest eigenvalue of the sample covariance matrices of project networks. Additionally, the
absence of a table intended to contain the findings of a specific normalizing method for the first,
second, and subsequent greatest eigenvalue shows that the trials failed to produce valid test results.
The same argument applies to any identified networks for this investigation that are not included
Table 2-29: Kolmogorov-Smirnov Test of Goodness of Fit for the 1st Largest Eigenvalue of
Project Networks
The first column, under 'Network,' contains a list of all networks that passed the K-S test. The blue
cells show networks for which more than one sample size out of five or three successfully
performed the K-S test. In this scenario, the value of n in the second column of the table refers to
the sample size that resulted in a greater number of positive tests out of the four tests labeled KS-
1000 Simulations
Normalized 1st Eigenvalues (Norm II, 0.025)
Significance level α / Probability P
0.01 / 0.99 0.05 / 0.95 0.10 / 0.9 0.20 / 0.80
Network n RT D₂ KS-test 1 KS-test 2 KS-test 3 KS-test 4
j3011-1 48 0.4597 0.0325 1 1 1 1
j3011-1 49 0.4597 0.0162 1 1 1 1
j3011-1 50 0.4597 0.0308 1 1 1 1
j3024-8 47 0.4597 0.0249 1 1 1 1
j3024-8 48 0.4597 0.0168 1 1 1 1
j3024-8 49 0.4597 0.0211 1 1 1 1
j3034-10 50 0.5907 0.0444 1 0 0 0
j3034-10 51 0.5907 0.0423 1 1 0 0
j3034-10 52 0.5907 0.0256 1 1 1 1
j3034-10 53 0.5907 0.0218 1 1 1 1
j3037-6 49 0.6875 0.0269 1 1 1 1
j3037-6 50 0.6875 0.0494 1 0 0 0
j3037-6 51 0.6875 0.0471 1 0 0 0
j3038-5 53 0.6169 0.0219 1 1 1 1
j3038-5 54 0.6169 0.0285 1 1 1 1
j3038-5 55 0.6169 0.0363 1 1 1 0
j3038-5 56 0.6169 0.0367 1 1 1 0
j3038-5 57 0.6169 0.0454 1 0 0 0
j3041-8 50 0.5786 0.0490 1 0 0 0
j3041-8 52 0.5786 0.0479 1 0 0 0
j3041-8 53 0.5786 0.0306 1 1 1 1
j3041-8 54 0.5786 0.0321 1 1 1 1
j305-7 49 0.3387 0.0200 1 1 1 1
j305-7 50 0.3387 0.0384 1 1 1 0
j305-7 51 0.3387 0.0493 1 0 0 0
j9010-5 174 0.2174 0.0372 1 1 1 0
j9010-5 175 0.2174 0.0441 1 0 0 0
j9010-5 176 0.2174 0.0392 1 1 0 0
j905-3 178 0.2000 0.0403 1 1 0 0
j905-3 179 0.2000 0.0381 1 1 1 0
j905-3 180 0.2000 0.0419 1 1 0 0
j905-3 176 0.2000 0.0384 1 1 1 0
j905-3 178 0.2000 0.0353 1 1 1 0
j12024-2 279 0.3397 0.0348 1 1 1 0
j12024-2 280 0.3397 0.0316 1 1 1 1
j12024-2 281 0.3397 0.0371 1 1 1 0
Table 2-30: K-S Test of Goodness of Fit for the 4th Largest …
An orange, gold, blue, or green left brace, such as in Table 2-30 above provided, has been added
to group rows belonging to a j30, j60, j90, or j120 network with more than one successful sample
size. A brace with dashes denotes an unintentional repetition of the trials for the specified network.
The study sought to conduct only one experiment per network for each chosen normalization
approach, with a total of 1000 or 10000 simulation runs. For example, only two of the five sample
sizes considered for network j9014-5 produced successful findings. See Appendix E.9 to illustrate
all five outputs for network j9014-5. Also, the same appendix contains a companion table that
summarizes each sample's median, mode, mean, variance, skewness, and kurtosis.
The third column, called 'RT,' gives additional information about the complexity of each network
in terms of its RT value, which will serve to analyze the K-S test findings. Finally, the fourth
column contains the D2 test statistic's values. The information in the table's final four columns was
generated by comparing each observed test statistic to the critical values in Table 2-28. In the
table's four columns, a green cell carrying a "1" indicates a successful outcome for the two-tailed
K-S test. An uncolored cell with a "0" in it, on the other hand, means rejection of the null
hypothesis H0. With retrospective to Section 1.9 on the NHST, it is worth noting that rejecting H0
or failing to reject H1 does not imply that H0 is false and H1 is true. Instead, rejecting H0 means
insufficient data to support either H0 or H1. Additionally, rejecting the null hypothesis H0
expression and refer to Table 2-31, Table 2-32, and the ones provided in Appendix F.7 and
Appendix E.8. Each table shows the results of the first and second normalization methods (Norm
I and Norm II). These attractive tables were also made by encoding all results with the letters "A,"
"B," C,” and “D,” such that they fit into a single table for each normalization method's results. The
letters "A," "B," "C," or "D" in the row of a given network under any of the columns labeled with
KS-test 1 through KS-test 4 signifies that the experimentation conducted with respectively the first
four ranked largest eigenvalue of the sample covariance matrix has resulted in the acceptance of
the null hypothesis H0. The acceptance of the test with a 100 𝛼 % confidence level pertains only
to the network in question. It is important to note that the most significant eigenvalues were
normalized using Norm I or Norm II. For each successful K-S test, the parenthesized value
following any of the four letters indicates the sample size, resulting in the acceptance of H0 for a
given network.
Regardless of the network size and the eigenvalue rank, each table categorizes results by network
size. The RT values are also included to facilitate the analysis. At first glance of both tables
arranged side by side, Norm II resulted in more successful results than its counterpart did. Norm I
yielded mainly As and Bs, while Norm II mainly yielded Cs and Ds. The Cs recorded in the rows
of both networks j3041-8 and j3041-8 for Norm-I and the Bs reported in the row of network j3038-
Table 2-31: K-S Test of Goodness of Fit – All Test Results with Norm I
Table 2-32: K-S Test of Goodness of Fit – All Test Results with Norm II
Except for the j30 networks, only significant eigenvalues were assessed in the first normalization
approach (Norm I), which is why there are no letters other than "As" in any rows of the three sets
of networks j60, j90, and j120. Probably if the second through the fourth largest eigenvalues were
collected and tested under Norm I, Table 2-31 would have been packed. Since all four most
significant Eigenvalues of j30 networks were evaluated, it was shown that the results obtained with
the networks under Norm II still exceeded those obtained with Norm I. As a result, the K-S test
yielded better results for Norm II than Norm I did. The graphs in Figure 2.19 depict all results.
Norm/Scaling I
j120 0
j90 3 0
j60 3 0
j30 3 7 2
0 2 4 6 8 10 12
Norm/Scaling II
j120 2 0
j90 1 2
j60 2 0
j30 4 7
0 2 4 6 8 10 12
Figure 2.19: Illustrations of the K-S test Results for all Project Networks
Regarding the set of networks whose tests resulted in more acceptance of the null hypothesis H0,
the j30 networks outperformed the largest networks for both normalization methods, with more
successful results obtained with Norm II. The previous statement is more accurate with the first
Norm-I than Norm II since only a few networks were tested using Norm II. These networks' names
have an Asterix suffix. Note that for a given network, the sample size is not always equal to 𝑛
and may differ from one eigenvalue rank to another rank. For example, the following illustration
is regarding Norm II (see Table 2-32). While a sample size of 53 produced successful test results
for network j3038-5 with the second (B) and third (D) largest eigenvalues at all significant levels
of α, the sample size that produced successful test results for network j3037-6 fluctuated around
The analysis of the results with emphasis to the α level under which any given K-S test was
performed suggests that including α in the expression of the sample covariance matrix to build an
acceptance region or confidence interval has a significant effect on the results. Linking the design
of the sample covariance matrix to the significance level α of the two-tailed hypothesis test
suggests that the results obtained under the Null Hypothesis H0 would be correct and significant
at the confidence level of (1-α) %. As a result, having more successful outcomes with tests
performed with a significance level smaller or equal to α is reasonable, which explains the direction
of the test results in Table 2-31 and Table 2-32. For instance, concerning Table 2-32 (Norm II),
the K-S test results for network j12012-1 have resulted in significant results when performed at
the significance levels of 0.01 and 0.05, lesser than equal to the level α of the test. Given the
construction of the sample covariance matrix, only the K-S test 2 should have been conducted.
Concerning networks of equal size and complexity, the results suggest that these equalities cannot
infer the performance of the network's K-S test results regardless of the normalization method and
rank of the largest eigenvalue. For instance, using either of the tables' results obtained with
networks of equal sizes and complexities j3011-1 and j3024-8 help illustrate this conclusion.
Surprisingly, the test results suggest similar K-S test performances for Norm-I but not for Norm-
II for networks of similar complexity but different sizes. For instance, for the second normalization
method (Norm II), in the absence of test results with network j6040-5 for comparison with network
j3032-4, comparing the K-S test performances of network j901-3 with no result against network
Lastly, to verify the effect of either removing alpha from the expression of the sample covariance
matrix, changing alpha from 0.05 to 0.1 or 0.2 in the formula of the sample covariance matrix, or
increasing the simulation runs from 1000 to 10000 would have on the results, additional
experiments were performed independently using network j3032-4. Before providing their
outcomes, it is notable stating the reasons for choosing network j3032-4, among others, as an
excellent candidate for these experiments. First, no successful results were obtained with j3032-4,
whether with any normalization methods. Lastly, due to its small size, which requires less
computational time than the larger ones, the necessary simulations can be available quickly.
Nevertheless, none of the three changes stated earlier resulted in any improvement at all levels
from the simulation results. The test performances of the experiments performed under Norm-I are
consistent with the observations made earlier. That is mainly of types A and B. The same applies
to the results obtained with the second normalization method (Norm II), mainly of types B, C, and
D.
Q-Q plots and histograms served to validate the successful results obtained with the K-S test.
Figure 2.20 and Figure 2.22 are samples of Q-Q plots obtained with Norm I and Norm II for a few
networks of different sizes and complexities. In addition, Figure 2.21 and Figure 2.23 depict their
associated histograms. More graphs of Q-Q will be provided in the following chapter to conduct
PCA. Nevertheless, all the plots corroborate the hypothesis testing results. In other words, with
proper scaling of the largest eigenvalues, the Tracy-Widom distribution of type 1 is the true
For network j3032-4 which failed to accept H0 regardless of the normalization method used,
constructing the Q-Q plots and histograms of the test statistics generated with each experiment to
understand the results better. The observed data frequencies were low compared to the
hypothesized probability distribution frequencies. Therefore, any of the stated changes could not
Figure 2.20: Q-Q Plots of Networks j3011-1 (Norm I) and j3038-5 (Norm II)
290
Figure 2.21: Histograms of Networks j3011-1 (Norm I) and j3038-5 (Norm II)
Figure 2.22: Q-Q Plots of Networks j6015-1 (Norm I) and j12024-2 (Norm II)
292
Figure 2.23: Histograms of Networks j6015-1 (Norm I) and j12024-2 (Norm II)
The simulation study conducted with the objective of elucidating the underlying behavior of
project network schedules produced three noteworthy discoveries. The first finding concerned the
uncovering of a universal pattern for project network schedules involving 100 simulation runs for
each network chosen for this study. This trend was unexpectedly discovered after plotting the
scatterplot of the largest eigenvalue of sample covariance matrices against the sample sizes
required to produce the matrices from EF times of project network activities. While each activity's
durations were independently randomized from a triangular distribution with known parameters,
the sampling of the activities’ joint durations required to generate a sample data matrix is unknown
but assumed till proven to be the Tracy-Widom limit law of type 1. The same sampling distribution
governs the selection of each row of the sample data matrix, which constructs the sample
covariance matrix. Regardless of the normalizing approach used to standardize the sample
covariance matrix's first eigenvalue, the distinctive uncovered pattern is consistent across networks
of similar size and complexity. Remarkably, regardless of the normalization approach used, the
same trend was persistent with networks of varying sizes and complexity. Subsequently, graphs of
deviations of means or variances of the observed first largest eigenvalue of sample covariance
matrices from the hypothesized distribution mean or variance versus the data matrix sample size
Both patterns were derived from plotting sample sizes versus deviations of the empirical
distribution's mean of the sample covariance matrices' largest eigenvalues from the hypothesized
distribution's mean, and sample sizes versus deviations of the observed distribution's variance of
the sample covariance matrices' first largest eigenvalue from the hypothesized distribution's
variance. The study discovered that the pattern derived using the means of the empirical and
assumed distributions was more stable and adequate for determining the optimal sample size
required to validate the distributional assumption for any identified network. Due to the striking
similarities between a stress-strain curve and the newly discovered pattern, the point of intersection
between the mean deviation curve and the horizontal axis at zero deviation, whose abscissa is
referred to as the optimum sample size for a given network, may be regarded as the yield point on
a stress-strain curve. Given that the yield point, which is commonly used in materials science and
engineering, denotes the boundary between elastic and plastic behavior, the optimum sample size
appears to be acceptable for verifying the limiting distributional assumption stated for project
network scheduling.
The second conclusion addressed the normalizing procedure required to standardize the four most
significant eigenvalues of the sample covariance matrix utilized as test statistics and related to each
of the study's project networks. A comparative analysis of the normalization methods based on the
universality of the Tracy-Widom limit law of type 1 revealed that the scaling formulas developed
by Baik et al. (1999) and Johansson (1998) are more appropriate for studying the behavior of
project network schedules than the one devised by Johnstone (2001). This discovery is significant
because the chosen approach was obtained from studying the length of the longest increasing
sequence of random permutations, which is analogous to the critical path sequentially connecting
The third significant finding of the study is that the sampling probability distribution of project
network schedules corresponds to the probability distribution of EF times for project network
activities. The Kolmogorov-Smirnov goodness-of-fit test was used to validate the distributional
assumption made about the sampling distribution. The test was created to determine whether the
Tracy-Widom limit law of type 1 is the natural sampling distribution of project network schedules
as sample size approaches a limit. This limiting size corresponds to the optimal sample size
determined from a curve with a universal pattern. When the third and fourth largest eigenvalues of
sample covariance matrices were considered, the null hypothesis was accepted for 19 of the 21
networks investigated. The test performed considerably worse with the first largest eigenvalue than
with the second largest eigenvalue. To corroborate these findings, the Q-Q plots and histograms
used to visualize the simulation data demonstrated that the TW of order 1 is a good fit for the
sample distribution of project network schedules. After the null hypothesis for the networks was
rejected, a graphical representation of their Q-Q plots revealed that suitable rescaling and
recentering of the mth greatest eigenvalue would undoubtedly improve their test performance. This
conclusion also applies to all 35 project networks identified for this investigation, of which only
18 resulted in the null hypothesis being accepted in the case of Johnstone (2001)’s normalization
method.
The following sections summarize the chapter's primary contributions to the body of knowledge.
The chapter's objective was to evaluate the evidence for a population covariance structure in
schedules. Due to the network dependency structure created by the relationships between project
activities, construction management, and engineering practitioners have always suspected this
covariance structure. The aims of this study have been met as a result of well-known universality
Append construction project management and engineering to the list of fields where the Tracy-
Widom limit laws, based on Random Matrix Theory (RMT), have effectively analyzed high
Propose a mathematical model for project network schedules based on well-established results in
probability and statistics and project scheduling techniques that may be used to study their
Facilitate testing hypotheses about the durations of project network activities and the whole project
Devise a methodology based on multivariate statistical analysis and graphical methods for data
analysis that can be used to determine the limiting duration of a project and of each activity
comprising the project network schedule, beyond which any delay will be irreversible. The Tracy-
Widom distribution has been established for this study as the limiting distribution of project
network schedules.
Initiate a research study of the connections between a measure of project network complexity and
the sample size required to draw an appropriate number of samples from a population of identically
distributed activity durations as a requirement for studying project networks using RMT.
resolve the ubiquitous problem of project delays. To aid in their efforts, this research study, which
opened new avenues for studying the behavior of project network schedules, began with numerous
unknowns to venture into the unexplored ground. Aware of the limitations of deterministic
scheduling techniques that have been demonstrated to be ineffective at resolving the problem of
delays, this research study provided an opportunity to explore other related fields.
These fields have been utilizing modern mathematics to investigate the underlying behavior of
mathematics, the study of the underlying behavior of any complex system begins with the
construction of a mathematical model that defines the sample space representing the collection of
all outcomes for any given measurable attribute (variable) of the population under study.
Subsequently, and establishing the probability distribution for drawing samples (events) from the
In most cases, the joint sampling distribution of the population data is unknown, but the
distribution of individual population variables is always known (triangular distribution in the case
of this study). The majority of well-known RMT theorems necessitate the creation of sample data
matrices to compute sample covariance matrices crucial for determining the intrinsic covariance
matrix of the population representing the complex system under investigation. The real covariance
complexity and sufficient correlation in its structure. Testing hypotheses based on the eigenvalues
When employing RMT, not all sample covariance matrices may be used to investigate a given
system's underlying behavior. Hence, its application requires proper standardization of sample data
matrices and formulation of the sample covariance matrices. The resulting matrix belongs to one
laws in probability and statistics (e.g., the Wigner semi-circle law). Due to the study's primary
objective, utilizing the universality of the Tracy-Widom limiting rules provides a potential method
for finding solutions to the delay problem. These laws specify the limiting probability distribution
of the first, second, third, or subsequent greatest eigenvalue of the sample covariance matrix that
is correctly normalized.
Given a project network schedule, the universality of the TW distribution laws is likely to aid in
defining a duration threshold that should not be exceeded for a project to be completed on time
and under its assigned budget. However, as with any other law, the TW's universality restricts its
applicability (e.g., Gaussian entries for sample data matrices). Owning it to the contributions of
numerous authors (e.g., Soshnikov 2002, Bao et al. 2015), most of the restrictions have been lifted.
Relaxing these assumptions enables it to be applied to a broader class of matrices that are not
necessarily Gaussian or have a near-unit ratio of the total number of rows to the total number of
columns.
When combined with a suitable standardization of the largest eigenvalue of the sample covariance
matrix for project networks, the proposed model demonstrated that the universality of the TW limit
law of type 1 holds when the eigenvalues are appropriately normalized. As such, the following
While the current study used only project network schedules from the Project Scheduling Problem
Library (PSPLIB), whose maximum size is 120, future research should extend the analysis to
Because this analysis revealed no correlation between restrictiveness (RT) and the number of
samples required to satisfy the conditions of applying TW-based universal theorems, future
research should investigate alternative complexity metrics, such as the other five identified by this
study.
While the normalization approach utilized in this study was specified as a function of n and p using
Johnstone's (2001) celebrated theorem, especially its ad hoc version, future studies might explore
employing a more extended formulation of the centering and scaling functions. Péché (2008), who
validate distributional assumptions in circumstances where the sample matrix covariance's first or
second largest eigenvalue and a supercomputer are available for speedy simulations.
Future research may include extending the study to at least the fifth and sixth greatest eigenvalues
when employing the normalization method derived from the work of Johansson (1998) and Baik
et al. (1999). Bornemann (2009) determined the numerical approximations to the Tracy-Widom
distributions (CDF) statistics up to the sixth greatest eigenvalue, allowing for this expansion.
2.8 Conclusion
This chapter concludes with suggestions for future research studies and formulation of the current
study work's contributions to the corpus of knowledge. The comprehensive empirical study
formulated by adopting and adapting proven methodologies from other areas of application of the
TW limit laws contributed to achieving the chapter's objectives, which were defined based on the
301
303
306
308
310
Abstract
Significant findings have been achieved by employing the proposed methodology for analyzing
the principal components of a sample covariance matrix or correlation matrix generated from
project network schedules. Project network schedules are comprised of precedence relationships
between activities that can become complex as the number of pairwise links (thousands in large
projects) between activities grows, making them challenging to design and maintain. This study
demonstrated that applying well-known concepts from random matrix theory can help develop
better project schedules. The proposed methodology is based on the largest eigenvalue of sample
covariance and population correlation matrices derived from sampling durations of project
The methodology’s assumptions limit the sample size to an optimum sample size determined at a
significance level α and require an appropriate normalization procedure for standardizing the
matrices ‘eigenvalues. Under these conditions, it has been established that the Tracy-Widom
distribution of order 1 (TW1) is the joint sampling distribution of durations of a project network’s
activities at the significance level α. Moreover, the proposed methodology is based on three
identified rules that assisted in selecting the principal components (PCs) to retain.
311
The simulations performed on a few numbers of networks of varied sizes resulted in the following
results. First, applying the scree plot rule, the study found that there seems to be a direct correlation
between the optimum sample size and the number of PCs to retain for a given network.
Additionally, the analysis indicated that Johnstone's (2001) spiked covariance model is a viable
candidate for predicting the limiting durations of project network activities using a PCA-based
linear regression. The spiked model, an empirical derivation of Johnstone (2001), is a covariance
matrix with a specific structure used to characterize the behavior of a system having one or more
prominent eigenvalues that are easily differentiated from the rest of the data. Second, applying the
second rule based on hypothesis testing with TW p-values, the study found some limitations that
prevented the specified null hypothesis from adequately being evaluated. The limits are essentially
Finally, the calculated threshold value indicates a phase transition in project network schedules.
This phase transition separates the TW distribution's left and right tails. Over this critical zone, the
system transitions from the weak (stable) to the strong (unstable) coupling phase over this critical
zone. This discovery is critical because it is likely to assist practitioners in identifying the location
at which a construction project schedule may become unstable. The empirical investigation
conducted only on a few project networks has yielded exciting results. Still, more research is
needed to consider project networks of various sizes and complexities to develop guidelines for
constructing PCA-based models and locating phase transitions in project network schedules.
Various scientific and engineering disciplines (e.g., genetics, meteorology, agriculture, and
econometrics) rely on exploratory data analysis and visualization. The requirement to evaluate vast
amounts of multivariate data raises the fundamental dimensionality reduction problem as Roweis
and Saul (2000) inquired about: how to develop compact representations of high-dimensional data.
Due to digitization, a large volume of records is being generated across numerous sectors, reducing
high-dimensional data is a critical problem in today's modern life. While the literature on the issue
recommends various strategies for data reduction, this study focuses on Principal Component
Analysis (PCA). As Naik 2017 wrote, PCA is a frequently used matrix factorization approach for
reducing the dimension of sets of random variables or measurements and identifying hidden
features beneath them. A review of dimensionality reduction strategies, such as the one undertaken
by Sorzano et al. (2014), classified the diversity of available techniques by providing the
mathematical foundations for each, which is an excellent source of well-known techniques. PCA
is concerned with reducing the dimensionality of a data set while keeping as much variance as
possible (Jolliffe 2002). This data set comprises many interrelated variables. Scholars generally
accomplish such a reduction by changing to a new collection of variables known as the principal
components (PCs), which are uncorrelated and ordered. The first few maintain most of the variance
Nevertheless, after investigating the presence of a covariance structure in the sampled durations of
project network activities in the previous chapter using the universality of the Tracy-Widom laws,
the goal of this chapter is twofold. The initial goal will be to run a sphericity test to see if a specific
sample variance-covariance matrix (its realization) matches a population with a particular matrix
defining the population covariance structure. Following the determination of the population
covariance matrix (Σ), the second objective of this chapter will be to employ Σ to perform data
reduction and interpretation via PCA. A comprehensive literature study on the sphericity test and
PCA is necessary to fulfill the chapter's primary goal. The methodology that will help achieve the
chapter's primary goals will follow. The proposed methodology will then be applied to a small
number of identified project networks, followed by an analysis of the outcomes. This chapter will
end with the contributions to the body of knowledge and recommendations for further research.
Principal component analysis (PCA) is a popular matrix factorization technique for reducing
measurements. Its power comes from the basic assumptions that distinct physical processes
generate independent variables. The reduction - although there are numerous other applications of
maximizing directions in the space containing the data (Bejan 2005). The PCA is a multivariate
statistics technique based on correlations or covariances that dates back to Pearson (1900) and
Hotelling (1933). Classic writers such as Jolliffe (2002) have a large body of literature, and
fascinating modern variations continue to emerge by covering the most recent cutting-edge
subjects in PCA (e.g., Naik 2017). In recent decades, the scale of data collection has increased. It
is no longer uncommon for the number of variables or features collected, p, to be on the order of,
or greater than, the number of instances (or sample size), n. Under specific assumptions on the
covariance structure of the data in this "high dimensional" scenario, the statistical features of PCA
display phenomena that are possibly unexpected when seen from the historically common
perspective of numerous samples and a fixed number of variables (Johnstone and Paul 2018). The
PCA is carried out for two main reasons. These are the reasons for selecting major components
that characterize a mathematical model and checking for a covariance structure in the data. Both
Because of the difficulty in explaining the variance-covariance structure of a set of variables using
a few linear combinations of these variables, much of the overall system variability is frequently
accounted for by a small number k of the PCs. In this case, the k components contain (almost) as
much information as the original p variables. The p original variables can therefore be replaced
with k principal components. The original data set including n measurements on p variables can
likewise be replaced with the reduced data set containing n measurements on k principal
interpretations that would not have occurred otherwise (Johnson and Wichern 2019). Although it
is preferable to deal with a smaller number of linear combinations of the p variables containing
most of the information, reducing a data set to fewer components is tricky. According to scholars
such as Forkman et al. (2019), the first few PCs usually indicate fascinating systematic patterns.
In contrast, the last may reflect random noise rather than a recurring pattern. As aa result, the last
ones are often eliminated. For investigators, the critical question is how many PCs are statistically
significant. Practitioners retain only a few, depending on the fraction of variation explained by the
first few PCs. Because each PC is almost certainly a function of all p variables, the values of all p
variables are still necessary to calculate the PCs. The literature on the issue provides numerous
rules for determining an appropriate value of k, the majority of which are ad hoc. In practice, these
decision-making criteria are mainly based on the behavior of the sample covariance or correlation
matrix's largest eigenvalues rather than dealing with natural objects (Bejan 2005). The first three
rules for determining k are ad hoc rules of thumb. However, despite various attempts to formalize
them, as Jolliffe 2002 stated, they are intuitively logical and work in practice. These criteria are
(1) cumulative percentage of total variation—see Cangelosi and Goriely (2007) for application to
biology; (2) size of variances of principal components, also known as the Kaiser's rule (Kaiser
1960)—see Shin et al. 2012, and Kevric and Subasi 2014 for applications; (3) the scree graph and
its alternative the log-eigenvalue (or LEV) diagram introduced by Cattell (1966) and Craddock
and Flood (1969), respectively—e.g., see GarcÇa-Alvarez (2009) for application in engineering
and Franklin et al. (1995) for references with regards to parallel analysis developed by Horn (1965)
as a modification of Cattell's scree diagram. The second set of criteria consists on formal
hypothesis tests.
However, according to Jolliffe (2002), they use distributional assumptions that are frequently
unrealistic and often maintain more variables than are necessary. The Bartlett (1950) test is an
example of this type. It contrasts the null hypothesis that all descend ranked eigenvalues from k+1
to p are equal with the alternative hypothesis that at least two of the last eigenvalues (p-q) are
unequal. Because of the issues with these rules, there is a list of ad hoc rules found in the literature.
For example, for covariance matrices, a maximum likelihood ratio test (MLRT) using the null
distribution of the test statistic can be employed directly (e.g., Johnstone 2001; Choi et al. 2017,
Saccenti and Timmerman 2017). The latter are statistical rules, most of which do not require
distributional assumptions. The concept underlying these methods is quite similar to the
cumulative percentage of the total variation, except that each entry 𝑥 of X is now predicted from
an equation similar to the SVD but based on a submatrix of X that does not include 𝑥 . The PCs
are based on PRESS, which stands for PREdiction Sum of Squares and is derived from Allen's
multidimensional model (e.g., Forkman et al. 2019 for applications in biology and environmental
sciences) and the other on bootstrapping the data itself. Another application is related to the
jackknife estimation process. While these criteria for determining the number of variables to keep
have been supplied as background information, the mathematical underlying most of them and
illustrative examples may be found in Jolliffe's (2002) famous work. However, several of these
The second reason may be illustrated by referring to a general example supplied by Bejan (2005).
Assume that one is interested in determining whether a particular data matrix has a particular
covariance structure. Then, according to standard statistical methodology, one could consider the
following. First, for a specific type of population distribution, find the distribution of some
statistics that is a function of the sample covariance matrix; then, construct a test based on this
distribution and use it whenever the data satisfy the test's conditions. Some tests employ this
methodology, for example, a test of sphericity in examinations of the data's covariance structure;
for references, see Kendall and Stuart (1968), Korin (1968), and Mauchly (1940). Naturally, this
is a pretty broad overview. However, given recent results in our field of interest, which will be
explored in greater detail in later sections, one may hope that such a statistic for assessing a
covariance structure is based on the largest sample eigenvalue. Indeed, such tests have evolved
and are now referred to as the greatest root tests of 𝜮 𝑰 in the literature; see Roy (1953).
Principal components analysis (PCA) is extensively utilized in various domains, including data
analysis, model compression, and multivariate process monitoring, with the primary goal of data
reduction. Its end-users include academic researchers and professionals from a variety of fields. In
addition, there are several applications in domains other than construction and engineering, some
of which can be found in various journal articles and thesis. For example, Shubham (2021) recently
employed PCA in Computer and Mathematics Education to establish nine parameters for a
government control approach to preventing COVID-19 proliferation in India. The Indian second
wave SEIR model based on the nine parameters helped with the analysis. The PCA results ranked
wearing a mask in first place (90%), second place in sneezing with a tissue (65%), and third place
in sanitizer dispenser.
Nonetheless, because the current research project is being undertaken in CE, a literature review on
the applications of PCA in this field is required. Doing so will aid in presenting the current
CE. According to the conducted literature survey, PCA is not new to the construction and
engineering communities. Various practitioners have used PCA to tackle engineering challenges
and problems faced during project construction in many parts of the world. For example, el-Kholy
(2021) evaluated the best models for Predicting Delay and Cost Overrun Percentages (PDCOP)
for highway projects using four accurate Artificial Neural Networks (ANN) paradigms in the field
of transportation engineering. The research methodology included applying various models to each
paradigm based on the Input Projection Algorithm, including rule and function. In addition, the
methodology included a sensitivity analysis to ensure the consistency of the results for the superior
models. According to the PCA paradigm, his best-proposed model outperformed previously
published models by having a Mean Absolute Percentage Error (MAPE) of 25.4 percent for
forecasting percent cost overrun, compared to 30.42 and 40.37 % for models in the literature. Other
applications include Ghosh and Jintanapakanont's identification and assessment of significant risk
Chai et al. (2015) used a Structural Equation Modeling (SEM) approach to examine the current
delay reduction measures in construction to help alleviate housing supply delays caused by
Malaysia's rapid growth and urbanization. They conducted research in Malaysia's 13 states and
three Federal Territories. As a result, the PCA found 17 mitigating solutions, with the essential
mitigating approach being the prevention of delays in house supply. In a similar vein, Tahir et al.
(2017) analyzed 69 responses to discover the fundamental causes of delay and cost overruns in the
Malaysian construction industry using PCA and factor analysis. Their analysis indicated that the
primary causes of delays and cost overruns were delays in design document creation, poor time
management, material delivery delays, a lack of awareness of different execution methods, labor
and material shortages, and changes in the scope of work. Moreover, Karji et al. (2020) used PCA
to identify the primary challenges to promoting sustainable construction in the United States.
In Computing in Civil Engineering, Dao et al. (2017) used PCA to construct input for a project
complexity model that researchers and practitioners may use to examine a project's complexity
levels based on well-established complexity indicators. They used findings from the study
conducted by the Construction Industry Institute (CII) in 2016, which identified 37 project
However, because having too many predictive variables can be detrimental to a regression model
(the number of parameters being greater than the number of observations), PCA was an appropriate
technique for reducing the number of original variables for the model. Consequently, out of the 37
explanatory variables, the PCA technique based on the Pearson Chi-square test with a significance
In hydraulic engineering, Nam et al. (2019) proposed an effective method for burst monitoring,
isolation, and sensor placement in water distribution networks (WDN) using PCA and other
Bernoulli's Equation to define pressure and flow rate patterns, which necessitated a pretreatment
procedure to normalize the data of varying size and variation and reduce the dimensionality of the
input data. As a result, a PCA was used to modify the input data to the k-means algorithm, which
was inefficient without the modification. The supervised k-means clustering methodology
uncovered natural features of the data based on similarities among them. Their proposed
monitoring method improved the isolation ratio by 10% compared to conventional systems, and
the sensor combination was 40% less expensive. Because their system was not designed to handle
complex real-world WDNs (for example, industrial use), they anticipated that deep learning
In hydrology, Arabzadeh et al. (2016) proposed a novel drought index (SDI) based on PCA as a
valuable tool for monitoring hydrological drought that is streamflow dependent. First, using
Kolmogorov Smirnov and chi-square, they demonstrated that the streamflow time series does not
follow a normal distribution for the ability to employ well-known distributions for fitting purposes.
Next, they used Bartlett's sphericity test (BST) to validate the PCA requirements (the presence of
strong correlations between variables known as hyper cloud's correlation). The test established the
sufficiency of the hyper cloud's correlation at each time scale at the 1% significant level using a
test statistic generated from the eigenvalues of correlation matrices. They then implemented PCA
based on scree plots to display the eigenvalues and cumulative variability against PCs' number and
other graphical methods. Their results revealed significant correlations between the SDI series of
the stations for specific time scales. Furthermore, the first principal component (PC1) explains 58–
85 % of the regional variability in the SDI series at the time scales specified. Additional
applications may be added, but the ones listed should illustrate how PCA is employed in
combinations indicate the geometric selection of a new coordinate system created by rotating the
original system using the coordinate axes 𝑋 , ⋯ , 𝑋 . The new axes denote the directions with the
greatest variability and provide a more concise and straightforward representation of the
covariance structure. The covariance matrix or correlation matrix of the random variables
𝑋 , ⋯ , 𝑋 determines the PCs, as will be discussed. According to a few researchers (Jolliffe 2002,
Johnson and Wichern 2019), their development does not necessitate a multivariate normal
assumption. However, it is worth mentioning that principal components derived for multivariate
normal populations can be usefully interpreted in constant density ellipsoids, allowing for
To enable the introduction of PCA, let 𝑿 𝑋 ,⋯,𝑋 have the covariance matrix 𝛴 with
⎧𝑌 𝒂 𝑿 𝑎 𝑋 𝑎 𝑋 ⋯ 𝑎 𝑋
[Eq. 3-1]
⎪𝑌 𝒂 𝑿 𝑎 𝑋 𝑎 𝑋 ⋯ 𝑎 𝑋
⎨ ⋮
⎪
⎩𝑌 𝒂 𝑿 𝑎 𝑋 𝑎 𝑋 ⋯ 𝑎 𝑋
Thus, using well-established variance (var) and covariance (cov) properties resulting from linear
combinations of random variables, one can construct Equation 3.2(a) and Equation 3.2(b):
𝑉𝑎𝑟 𝑌 𝒂 𝜮𝒂 𝑖 1, 2, ⋯ , 𝑝 (a)
[Eq. 3-2]
𝐶𝑜𝑣 𝑌 , 𝑌 𝒂 𝜮𝒂 𝑖, 𝑘 1, 2, ⋯ , 𝑝 with 𝑖 𝑘 (b)
The principal components are those uncorrelated linear combinations 𝑌 , 𝑌 , ⋯ , 𝑌 with the largest
variances in Equation 3.2(a). The first PC 𝑌 represents the linear combination with the greatest
variance. It follows that the variances 𝑉𝑎𝑟 𝑌 can be augmented by multiplying any vector 𝒂 by
some constant. To avoid this indeterminacy, practitioners focus their attention on coefficient
With these settings, the following significant results are stated to aid in understanding the
Result 1: Assume that 𝜮 is the covariance matrix associated with the random matrix 𝑿
pairs of 𝜮 such that 𝜆 𝜆 ⋯ 𝜆 0. Hence, Equation 3.3 below provides the ith PC.
𝑌 𝒆 𝑿 𝑒 𝑋 𝑒 𝑋 ⋯ 𝑒 𝑋 , 𝑖 1, 2, ⋯ , 𝑝
[Eq. 3-3]
As a result, Equation 3.4 (a) and Equation 3.4(b) can be derived as follows:
𝑉𝑎𝑟 𝑌 𝒆 𝜮𝒆 𝜆 𝑖 1, 2, ⋯ , 𝑝 (a)
[Eq. 3-4]
𝐶𝑜𝑣 𝑌 , 𝑌 𝒆 𝜮𝒆 0 𝑖, 𝑘 1, 2, ⋯ , 𝑝 with 𝑖 𝑘 (b)
If some of the 𝜆 are equal, the choices of the corresponding coefficient vectors, 𝑒 , then 𝑌 are not
unique. From this result, one may conclude that the principal components are uncorrelated and
Result 2: Let 𝜮 be the covariance matrix associated with the random matrix 𝑿 𝑋 , ⋯ , 𝑋 . In
Equation 3.5 establishes the link between the population covariance matrix and PCs.
Using Equation 3.5, Result 2 shows that the Total population variance equals the total of the
variances of the variables X, which coincides with the sum of the eigenvalues of the population
covariance matrix 𝜮. As a result, Equation 3.6 can indicate the fraction or proportion of total
The general rule is that if the first one, two, or three components can account for the majority (80
to 90 %) of total population variance, these components can "replace" the original p variables with
The magnitude of 𝑒 with 𝑘 1, 2, ⋯ , 𝑝, indicates the relative contribution of the kth variable to
the ith principal component, regardless of the other variables. The coefficient vector 𝑒 is
specifically proportional to the correlation coefficient between 𝑌 and 𝑋 . The following result
𝑒 𝜆
𝜌 , 𝑖, 𝑘 1, 2, ⋯ , 𝑝 [Eq. 3-7]
𝜎
specifies the correlations coefficients between the components 𝑌 and the variables 𝑋 .
While correlations between variables and PCs aid in the interpretation of components, they only
quantify the univariate contribution of an individual variable 𝑋 to a component 𝑌 . For that reason,
some statisticians “recommend that only the coefficients 𝑒 , not the correlations, be used to
If the preceding section discussed the principal components derived from the population variables,
this section discusses the PCs produced from standardized population variables such as those in
Equation 3.8.
𝑋 𝜇
𝑍
√𝜎
𝑋 𝜇
𝑍
√𝜎 [Eq. 3-8]
⋮
𝑋 𝜇
𝑍
𝜎
In matrix notation, Equation 3.8 may be written as in Equation 3.9(a) with its independent matrix
/
𝒁 𝑽 𝑿 𝝁
⎡ 𝜎 0 ⋯ 0 ⎤
⎢ 0 𝜎 ⋯ 0 ⎥ [Eq. 3-9]
/
𝑽 ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣ 0 0 ⋯ 𝜎 ⎦
Without a doubt, the expectation 𝐸 𝒁 𝟎, and Equation 3.10 represents the covariance of 𝒁.
/ /
𝐶𝑜𝑣 𝒁 𝑽 𝜮 𝑽 𝝆
[Eq. 3-10]
From Equation 3.10, one can obtain the principal components of Z from the eigenvectors of the
correlation matrix 𝛒 of X. All prior results should be applied from here, as the variance of each 𝑍
is equal to one. To simplify, the notation 𝑌 will refer to either the ith PC of either 𝛒 or Σ.
Nonetheless, the 𝜆 , 𝒆 derived from Σ are not identical to those derived from. The literature
contains numerous examples demonstrating how standardization significantly impacts PCs (e.g.,
Result 4: Equation 3.11 defines the ith principal component of the standardized population 𝒁
/
𝑌 𝒆 𝒁 𝒆 𝑽 𝑿 𝝁 , 𝑖 1, 2, ⋯ , 𝑝
[Eq. 3-11]
Additionally, Equation 3.12 provides a relationship between the variances of the standardized
Furthermore, Equation 3.13 provides the correlation coefficients between the PCs of standardized
𝜌 , 𝑒 𝜆 𝑖, 𝑘 1, 2, ⋯ , 𝑝
[Eq. 3-13]
As a result, 𝜆 , 𝒆 ,⋯, are the eigenvalue-eigenvector pairs of 𝝆. Like Equation 3.6, a similar
equation for a fraction of total variance is explained by the kth PC of 𝒁. This expression is given
Proportion of standardized
population variance due to 𝜆
, 𝑘 1, 2, ⋯ , 𝑝 [Eq. 3-14]
kth principal component 𝑝
There are specific patterned covariance and correlation matrices whose principal components can
be expressed in well-known and straightforward forms. PCA's structured matrices are sought after
because, for the most part, they allow inferences through hypothesis testing by relying on well-
established results, which is especially useful when dealing with big data matrices. Tridiagonal
matrices are an example of these matrices. To illustrate one of these structures due to its relevance
𝜎 0 ⋯ 0
0 𝜎 ⋯ 0
𝜮 (a)
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 𝜎
0 0 [Eq. 3-15]
⎡⋮⎤ ⎡ ⋮ ⎤
𝜎 0 ⋯ 0 ⎢ 0⎥ ⎢ 0 ⎥
0 𝜎 ⋯ 0 ⎢ 1⎥ ⎢1𝜎 ⎥ or 𝜮𝒆 𝜎 𝒆 (b)
⋮ ⋮ ⋱ ⋮ ⎢ 0⎥ ⎢ 0 ⎥
0 0 ⋯ 𝜎
⎢⋮⎥ ⎢ ⋮ ⎥
⎣ 0⎦ ⎣ 0 ⎦
By establishing 𝒆 0, ⋯ , 0, 1, ⋯ , 0 , with 1 in the ith position, one can derive Equation 3.15(b)
and conclude that 𝒆 𝑿 𝑋 , is the ith eigenvalue-eigenvector pair. Because of the linear
original set of uncorrelated random variables. As a result, extracting the PCs for a covariance
matrix with the pattern provided by Equation 3.15(a) yields no benefit. From another perspective,
if X is distributed as 𝑁 𝝁, 𝜮 , the contours of constant density are ellipsoids with axes already
pointing in the direction of maximum variation. As a result, the coordinate system does not need
to be rotated. Finally, it is worth noting that standardization does not affect the situation in Equation
3.15(a). In such instance, 𝝆 𝑰, the 𝑝 𝑝 identity matrix. Additionally, spheroids are multivariate
Another patterned covariance matrix, which is frequently used to describe the correspondence
between certain biological variables, such as the sizes of living organisms, has the general form of
Equation 3.16(a) for its covariance matrix, with the resulting correlation matrix being the same as
in Equation 3.16(b).
𝜎 𝜌𝜎 ⋯ 𝜌𝜎
⎡ ⎤
𝜮 ⎢𝜌𝜎 𝜎 ⋯ 𝜌𝜎 ⎥ (a)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝜌𝜎 𝜌𝜎 ⋯ 𝜎 ⎦
[Eq. 3-16]
1 𝜌 ⋯ 𝜌
𝜌 1 ⋯ 𝜌
𝝆 (b)
⋮ ⋮ ⋱ ⋮
𝜌 𝜌 ⋯ 1
In this situation, 𝝆 additionally represents the standardized variables' covariance matrix. Moreover,
Equation 3.16(b) implies that the variables 𝑋 , ⋯ , 𝑋 are correlated equally. One can certainly
demonstrate that the p eigenvalues of the correlation matrix in Equation 3.16(b) fall into two
categories which expressions are provided in Equation 3.17(a). For their corresponding
eigenvectors, while Equation 3.17(b) gives the expression of 𝒆 associated with the largest
eigenvalue 𝜆 , the remaining ones can be found in the literature (e.g., see Johnson and Wichern
2019). In the meantime, Equation 3.17(c) and Equation 3.17(d) provide the first principal
component of 𝝆 and the proportion of the total variance explained by this component. Other
𝜆 1 𝑝 1 𝜌, 𝜆 𝜆 ⋯ 𝜆 1 𝜌 (a)
1 1 1 (b)
𝒆 , ,⋯,
𝑝 𝑝 𝑝
𝜆 1 𝜌 (d)
𝜌
𝑝 𝑝
After discussing the principles of principal components analysis based on the population
covariance or correlation matrix, this section will discuss how to summarize the variation in n
with a mean vector μ and covariance matrix 𝛴. In the previous chapter, it was explained that the
sample mean vector 𝒙, sample covariance matrix S, and sample correlation matrix R can be derived
from these data. The goal of this part is now to build uncorrelated linear combinations of the
measured parameters that account for a large proportion of the variation in the sample. Hence, the
sample PCs will be the uncorrelated combinations with the greatest variances.
Through straightforward transformations, one may demonstrate that the n values of any
combination given by the equation below have a sample mean and variance of 𝒂 𝒙 and 𝒂 𝑺𝒂 .
𝒂 𝒙 𝑎 𝑥 𝑎 𝑥 ⋯ 𝑎 𝑥 , 𝑗 1, 2, ⋯ , 𝑛
In addition, the pairs of the values 𝒂 𝒙 , 𝒂 𝒙 , for two linear combinations, have covariance
𝒂 𝑺𝒂 .
With these settings, the sample principal components are defined as those linear combinations with
maximum sample variance. As with the population quantities, the coefficient vectors 𝒂 are
restricted to satisfy the condition 𝒂 𝒂 . More specifically, they are provided in Figure 3.2.
Figure 3.2: Illustration of Coefficient Vectors Maximizing the 1st and ith Sample PCs
Hence, the first principal component maximizes 𝒂 𝑺𝒂 which translates to Equation 3.18.
𝒂 𝑺𝒂
𝒂 𝒂 [Eq. 3-18]
result derived for the maximizing of quadratic forms for points on the unit sphere (e.g., see
result, identical to the initial results (1–3), the following result about sample principal components
is obtained.
𝑦 𝒖 𝒙 𝑢 𝑥 𝑢 𝑥 ⋯ 𝑢 𝑥 , 𝑖 1, 2, ⋯ , 𝑝 (a)
𝑢 𝑙 (e)
Correlation coefficients 𝑟 , 𝑖, 𝑘 1, 2, ⋯ , 𝑝
𝑠
are obtained from S, Sn, or R when there is no ambiguity. Please note that the components derived
from each are not identical. Furthermore, 𝒙 observations are frequently centered by removing 𝒙,
which does not influence the sample covariance matrix S. For the centered observation 𝑦 , as
specified in Equation 3.20(a) and Equation 3.20(b), defines the ith PC. It can be demonstrated that
𝑦 𝒖 𝒙 𝒙 , 𝑖 1, 2, ⋯ , 𝑝 (a)
𝑦 0 (c)
In general, sample PCs are not invariant with regard to scale changes. As indicated previously in
the discussion of population components, variables measured on different scales or a single scale
with widely varying ranges are frequently normalized (e.g., see Saccinti et al. 2011, Forkman et
al. 2019). The literature on the subject reveals a plethora of standardization strategies based on
this strategy). For illustration, standardization can be performed for the sample by creating a new
sample 𝒛 , such as in Equation 3.21(a), from the observations 𝒙 of the random variable X.
Equation 3.21(b) provides the expression of the new 𝑛 𝑝 sample data matrix Z.
𝑥 𝑥̅
⎡ ⎤
⎢ √𝑠 ⎥
⎢𝑥 𝑥̅ ⎥
𝒛 𝑫 𝒙 𝒙 ⎢ √𝑠 ⎥, 𝑗 1, 2, ⋯ , 𝑛 (a)
⎢ ⋮ ⎥
⎢𝑥 𝑥̅ ⎥
⎢ ⎥
⎣ 𝑠 ⎦
𝑧 𝑧 𝑧 ⋯ 𝑧
⎡ ⎤
𝑧 𝑧 ⋯ 𝑧
𝒁 ⎢𝑧 ⎥ (b) [Eq. 3-21]
⎢⋮ ⎥ ⋮ ⋮ ⋱ ⋮
⎣𝑧 ⎦ 𝑧 𝑧 ⋯ 𝑧
𝑥 𝑥̅ 𝑥 𝑥̅ 𝑥 𝑥̅
⎡ ⋯ ⎤
⎢ √𝑠 √𝑠 𝑠⎥
⎢𝑥 𝑥̅ 𝑥 𝑥̅ 𝑥 𝑥̅ ⎥
⎢ ⋯ ⎥
√𝑠 √𝑠 ⋱ 𝑠
⎢ ⋮ ⋮ ⎥
⋮
⎢𝑥 𝑥̅ 𝑥 𝑥̅ 𝑥 𝑥̅ ⎥
⎢ ⋯ ⎥
⎣ √𝑠 √𝑠 𝑠 ⎦
Equation 3.21 produces the sample mean vector 𝒛 and covariance matrix 𝑆 in Equation 3.22.
𝑥 𝑥̅ ⎤ (a)
⎡
⎢ √𝑠 ⎥
⎢ ⎥
⎢ 𝑥 𝑥̅ ⎥
1 1 1
𝑧̅ 𝟏 𝒁 𝒁 𝟏 ⎢ ⎥ 𝟎
𝑛 𝑛 𝑛⎢ √𝑠
⎥
⎢ ⋮ ⎥
⎢ 𝑥 𝑥̅ ⎥ [Eq. 3-22]
⎢ ⎥
⎣ 𝑠 ⎦
1 1 1 1 (b)
𝑆 𝒁 𝟏𝟏 𝒁 𝒁 𝟏𝟏 𝒁 𝒁 𝟏𝒛 𝒁 𝟏𝒛
𝑛 1 𝑛 𝑛 𝑛 1
1
𝒁 𝒁 𝑹
𝑛 1
Equation 3.19 gives the sample principal components of the standardized data, where R replaces
S. Because the observations are already centered, there is no need to write them in Equation
3.20(a). In the form of result 6, the results for the standardized observations Z are as follows.
Result 6: The ith sample principal component is given by Equation 3.23 if 𝒛 , 𝒛 , ⋯ , 𝒛 are
𝑦 𝒖 𝒛 𝑢 𝑧 𝑢 𝑧 ⋯ 𝑢 𝑧 , 𝑖 1, 2, ⋯ , 𝑝 (a)
The proportion of the total sample variance explained by the ith sample principal component, as
Proportion of standardized
population variance due to 𝑙
, 𝑘 1, 2, ⋯ , 𝑝 [Eq. 3-24]
kth principal component 𝑝
As a basic rule, one should preserve only those components with variances greater than unity, or,
more precisely, only those components that individually explain at least a proportion 1 𝑝 of the
total variance. However, this rule lacks a theoretical foundation and should not be applied
carelessly.
When performing a PCA, one searches for the successively orthogonal directions (eigenvalues
𝑙 /eigenvectors 𝒖 ) that can maximally describe the variation in the data consisting of n
𝒖 𝑺𝒖
𝑙 𝑚𝑎𝑥 ∶ 𝒖 ⊥ 𝒖 ,⋯,𝒖 [Eq. 3-25]
𝒖 𝒖 ,⋯, ,
Note that the pair eigenvalues and eigenvectors 𝑙 , 𝒖 ) derives from the singular value
decomposition of a data matrix X. The drawn variable in the sample in the form Equation 3.23(a)
There are various possible meanings for the sample principal components (e.g., see Jolliffe 2002,
Anderson 2003, Greenacre and Hastie 1987). To make this concept easier to grasp, assume the
Equation 3.26(a) are then a realization of the population PCs specified in Equation 3.26(b), which
𝑦 𝒖 𝒙 𝒙 , 𝑖 1, 2, ⋯ , 𝑝 (a)
[Eq. 3-26]
𝑌 𝒗 𝑿 𝝁 , 𝑖 1, 2, ⋯ , 𝑝 (b)
Also, from the sample values 𝒙 , one can approximate 𝝁 by 𝒙 and 𝛴 by S. If S is positive definite,
the contour defined by all 𝑝 1 vectors 𝒙 meeting Equation 3.27(a), referred to as the Mahalanobis
distance from the sample mean, estimates the constant density contour of the underlying normal
𝒙 𝝁 𝑺 𝒙 𝝁 𝑐 (b)
The approximate contours can be drawn to illustrate the normal distribution that generated the data
(e.g., see Jolliffe 2002, p. 23). While the normalcy assumption is favorable for inference
approaches, it is unnecessary to derive the characteristics of the sample PCs given in Equation
3.19. Even when the normal assumption is questioned and the scatter plot deviates from an
elliptical shape, the eigenvalues of S can still be extracted to obtain the sample PCs. Geometrically,
the data can be plotted as n points in p-space. The data can then be expressed in new coordinates
corresponding to the contour axes of Equation 3.27(a). Hence, this equation defines a
hyperellipsoid centered on the sample mean x and with axes defined by the eigenvectors of S.
Since 𝒖 has a length of 1, the absolute value of the ith principal component (|𝑦 |) corresponds to
the length of the vector 𝒙 𝒙 ’s projection onto the unit vector 𝒖 . Thus, as defined in Equation
3.26(a), the sample principal components 𝒚 lie along the hyperellipsoid's axes, and their absolute
values are the lengths of 𝒙 𝒙 ’s projections in the directions of the axes 𝒖 . As a result, the
sample PCs can be considered as the result of translating the origin of the original coordinate
system to 𝒙 and then rotating the coordinate axes until they intersect the scatter in the directions
of maximum variation.
Figure 3.3 depicts the geometry of the sample PCs in two-dimensional space (𝑝 2). Figure 3.3(a)
illustrates a constant-distance ellipse centered on 𝒙, with 𝑙 𝑙 . The PCs of the sample are well
defined. They are perpendicular to the ellipse's axes in the direction of maximum sample variance.
Figure 3.3(b) depicts a constant distance ellipse with a center at X1 and a length of 𝑙 𝑙 . In this
situation, the constant distance contours are almost circular, or the eigenvalues of S are
approximately equal, and the sample variation is homogeneous in all directions. Then it is not
(a) 𝑙 𝑙 (b) 𝑙 𝑙
The final few sample principal components can often be disregarded if the last few eigenvalues 𝑙
are small enough that the variance in the related 𝒖 directions is negligible. The data can be
effectively modeled by their representations in the space of the preserved components. For
including the p-Dimensional or n-dimensional, one may consult PCA-related works such as Jolliffe
The number of components to keep was discussed earlier in the sections related to the literature
review. Furthermore, as previously stated, there is no definitive solution to this topic. Typically,
practitioners utilize another rule worth revisiting to provide more specifics to the amount of total
sample variation explained rule. This rule applies to the scree plot. A scree plot (Cattell 1966) is
an effective visual aid for determining the number of appropriate significant components. A scree
plot depicts 𝑙 versus i—the magnitude of an eigenvalue versus its number—with the eigenvalues
arranged from most important to most minor. To identify the appropriate number of components,
one examines the scree plot for an elbow (bend). The number of components is the value at which
the remaining eigenvalues are all approximately equal in size. For instance, Figure 3.4, Courtesy
of Donald et al. (2009), reveals that the first 20 eigenvalues account for the most variance. As a
result, the dimensionality can be reduced from (50 1000) to (50 20) while retaining much of the
Principal component plots can identify questionable observations and verify distributional
assumptions (e.g., normality). Because PCs are linear combinations of the original variables, it is
logical to anticipate them as close to the expected distribution. When the first few PCs are to be
utilized as input data for future analyses, it is frequently required to check that they are
approximately distributed as the theoretical distribution. The last principal components can aid in
Equation 3.28.
𝒙 𝒙 𝒖 𝒖 𝒙 𝒖 𝒖 ⋯ 𝒙 𝒖 𝒖
[Eq. 3-28]
𝑦 𝒖 𝑦 𝒖 ⋯ 𝑦 𝒖
As a result, the magnitudes of the last principal components affect how well the first few principal
components fit the observations. The expression supplied in Equation 3.29(a) departs from 𝒙 by
the expression given in Equation 3.29(b). The expression of Equation 3.29(c) gives the square of
the length of the last principal component. Suspicious observations will frequently have at least
𝑦 𝒖 𝑦 𝒖 ⋯ 𝑦, 𝒖 (a)
𝑦 𝑦 ⋯ 𝑦 (c)
distributional assumption by creating scatter diagrams for pairs of the first few principal
components. Additionally, Q-Q plots can be created using the sample values generated by each
principal component.
An examination of PCs is more of a means to a goal than just an end. They are commonly used as
interim steps in much larger investigations. PCs, for example, might be used as inputs to a multiple
regression procedure. As a result, this section gives context for a multiple regression procedure
based on PCs.
𝒙 , ⋯ , 𝒙 . Let r be the number of observations to draw from the population in question. The basic
linear regression model implies that the variable Y consists not only of an expression that is
3.30. The term "linear" refers to the fact that the expression of Y is made up of a linear function of
the "p" unknown parameters. On the other hand, the behavior of the error ε is characterized by a
Y β β x ⋯ β x ϵ
Y β β x ⋯ β x ϵ
⋮ [Eq. 3-30]
Y β βx ⋯ β x ϵ
where the error values are presumed to have the characteristics below grouped in Equation 3.31
𝐸 𝜺 0, ∀𝑗 1, ⋯ , 𝑟; (a)
𝐶𝑜𝑣 𝜀 , 𝜀 0, ∀ 𝑗 𝑘 (c)
Expressed in term of matrix notation, Equation 3.30 becomes Equation 3.32 below provided:
𝒀 1 𝑥 𝑥 ⋯ 𝑥 𝛽 𝜀
⎡𝒀 ⎤ ⎡1 𝑥 𝑥 ⋯ 𝑥 ⎤ ⎡𝛽 ⎤ ⎡𝜀 ⎤ [Eq. 3-32]
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢𝜀 ⎥
⎢𝒀 ⎥ ⎢1 𝑥 𝑥 ⋯ 𝑥 ⎥ ⎢𝛽 ⎥ ⎢ ⎥
⎢⋮ ⎥ ⎢⋮ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢⋮⎥
⎣𝒀 ⎦ ⎣1 𝑥 𝑥 ⋯ 𝑥 ⎦ ⎣𝛽 ⎦ ⎣𝜀 ⎦
where 𝛃 is an unknown parameter and the multiplier of the constant term 𝛽 is denoted by "1s" in
the first column of the design matrix X. It should be noted that the assumptions of the error term
specified in Equation 3.31 are overly simplistic for confidence statements and hypothesis testing.
Many research goals can be achieved with regression analysis. Developing an equation to help
determine the projected response given the values of the predictor variables is one of them. As a
result, it is critical to fit the model in Equation 3.30 to the observed values 𝐲 corresponding with
the known measurements 1, 𝑥 , 𝑥 , ⋯ , 𝑥 . For example, calculating the true values of both the
regression coefficients and the error variance σ matching the available data will aid in achieving
the regression analysis target. The least-squares approach entails choosing trial values for the
regression coefficients β, abbreviated as b, in such a way that they minimize the sum (S) of the
S 𝐛 y b b x ⋯ b x 𝐲 𝐗𝐛 𝐲 𝐗𝐛 [Eq. 3-33]
with:
Since the least square criterion selects the coefficients “b”, they are denoted as least squares
estimates of the regression coefficients β. To highlight their role as estimates, they often written
as 𝛃. As defined, they are consistent with the data by producing estimated or fitted mean responses
With the least squares’ estimates denoted as 𝛃, the corresponding deviations are called residuals,
𝜀̂ 𝑦 𝛽 𝛽𝑥 ⋯ 𝛽 𝑥 ,∀ 𝑗 1, 2, ⋯ , 𝑟
or [Eq. 3-34]
𝛆 𝐲 𝐗β
While this section aims to provide the background knowledge necessary to connect linear
regression to principal components, the literature contains much information. The elements
required to fit a model to data using least squares estimation are illustrated below.
Observed Straight‐line
Responses "y" model "Y"
Figure 3.5: Elements of Fitting a Model to Data Using Least Square Estimates
As previously stated, linear regression analysis may represent the ultimate application of PCA.
This section will provide some context information and significant and widely utilized results for
regression analysis using PCs. The following is a significant result worth including because it
establishes the process for determining the linear regression model's unknowns. Jolliffe (2002)
predictor variables with their means quantified and that the associated regression equation is given
by Equation 3.35,
𝐲 𝐗𝛃 𝛆
[Eq. 3-35]
where y is the vector of n observations on the dependent variable, measured about the mean.
sequentially, the smallest possible variances if 𝑩 𝑨, the matrix whose kth column is the kth
eigenvector of 𝑿 𝑿, and hence the kth eigenvector of the sample covariance matrix 𝑺 𝑿 𝑿. In
The residual vector can be estimated as part of the diagnostic established when verifying the
distributional assumptions for a multivariate multiple regression model. Indeed, after fitting any
model using any estimation approach, it is prudent to consider the following equality depicted by
Figure 3.6
Observation Vector of
Residual
vector predicted
vector estimated
For multivariate linear models, one can analyze the principal components derived from the
residuals’ covariance matrix, as provided in Equation 3.36, in the same way as those determined
1
𝜺 𝜺 𝜺 𝜺
𝑛 𝑝 [Eq. 3-36]
One should keep in mind that because the residuals from linear regression analysis are linearly
The principal component analysis is defined as a method of determining the principal components
of a covariance matrix (S) or correlation matrix (R). A set of eigenvectors and eigenvalues defines
the maximum variance directions. p-dimensionality is reduced when the few eigenvalues are
substantially greater than the rest. In practice, the quality of the principal component
will differ from their underlying population counterparts due to sampling variation. These
distributions are difficult to derive (Johnson and Wichern 2019, Johnstone 2001). While this
manuscript cannot include all large sample inference findings, the following are relevant to this
investigation and are based on the concepts already introduced in the previous section.
(variabilities or distances of data points from the origin) in the observations of a given data is
frequently unknown in the field of multivariate statistical analysis. To find this matrix,
investigators use sphericity tests as inferential tools. First, they look for evidence of a matrix like
this by sampling a specific covariance matrix. Then, they test the matrix to ascertain whether a
population matrix is proportionally equal to the identity matrix. To put it another way, one would
like to know if there is any link between the population variables. For instance, answering this
question would entail testing the null hypothesis of the identity covariance matrix under Gaussian
assumptions. In this case, the null hypothesis will be defined in Equation 3.37.
𝐻 :𝜮 𝑰
𝐻 :𝜮 𝑰 .
The hypothesis 𝐻 implies that 𝜮 has a specific value in general. Another possibility is to test the
following hypothesis: Does a particular observed sample covariance matrix S correspond to the
population matrix? Kendall and Stuart (1968) developed a sphericity hypothesis to address this
challenge.
The new hypothesis is formed by linearly transforming 𝜮 into a unity matrix, which is
after this transformation is whether the new matrix 𝑪 𝜮 𝑺 corresponds to the hypothetical
matrix 𝜎 𝑰 . The location is unknown. The test that results is referred to as the sphericity test
(Bejan 2005). Equation 3.38 offers the test statistics under the multi normality assumption
because of Mauchly's 1940 work, where n and p are the data dimensions. The quantity
𝑝 𝑝 1
𝑛log 𝑙 has a chi-square 𝜒 distribution with degrees of freedom 𝑓 2 1.
Apart from the previous sphericity test, another critical test is derived from Johnstone's (2001)
celebrated theorem. This section aims to establish this sphericity test for white Wishart matrices
(𝜮 𝜎 𝑰 ). It is helpful to refer back to chapter 2 about the Wishart model 𝑊 𝑛, 𝜮 and its
assumptions for its establishment. That model 𝒳 approximates a random process using p random
sample's observations are indicated by 𝑿 𝑥 , where 𝑥 are i.i.d. from 𝑁 0,1 . For the
Let 𝑙 ⋯ 𝑙 denote the eigenvalues of A (biased estimate of 𝜮). Notice that through the
equality 𝑙 𝑛ℓ , the eigenvalues 𝑙 of 𝛴 are associated with the eigenvalues ℓ of the unbiased
estimator of 𝛴, that is 𝑺 𝑨 .
𝑛
Under these conditions, the null hypothesis H0 asserts that there are no associations between the p
variables. That is the formula 𝜮 𝜎 𝑰 . Under 𝐻 , all population eigenvalues equal one. However,
there is a spread in the sample eigenvalue 𝑙 , as has been known for some time. This subject was
covered in chapter 2. Nonetheless, Tracy and Widom (1993, 1994) estimated the null hypothesis
by Equation 3.39 to determine if "large" observed eigenvalues support rejecting the null
hypothesis.
ℙ 𝑙 𝑡 \𝐻 𝑊 𝑛, 𝜎 𝑰
[Eq. 3-39]
Theorem 1:
ℙ 𝑛𝑙 𝜇 𝜎 𝑥\𝐻 → 𝐹𝟏 𝒙
[Eq. 3-40]
𝑝
where the limit is 𝑛 → ∞, 𝑝 → ∞ such that 𝑛 → 𝛾 ∈ 0, ∞ , 𝐹𝟏 is the largest eigenvalue
distribution known as the Tracy-Widom limit law see chapter 2.) Under relaxed assumptions
on n, p, and the entries of the data matrix X, this theorem has been extended to the mth greatest
eigenvalue of the sample covariance matrix S, corresponding to a broader class of matrices. This
approximately the Tracy-Widom 𝑇𝑊 𝑛, 𝑝 distribution. From this result, one may construct a
sphericity test for covariance matrices in terms of their largest eigenvalues. Specifically, under the
assumption of multinormality, one may consider testing the null hypothesis 𝐻 in the form
Leaning on Saccenti and Timmerman (2017) will provide greater insight into this subject.
Johnstone's Theorem, which addresses the asymptotic distribution of the largest eigenvalue of
random covariance matrices, is the main result illustrated here. Because this is a realistic estimate
even for small sample sizes n and a small number of variables p. It can be used as a statistical test
to determine the number of principal components of empirical data. So yet, there is no similar
approach for standardized data (i.e., principal components based on correlations). Moreover, while
Johnstone's Theorem has recently been extended to the greatest eigenvalue of random correlation
matrices, this asymptotic solution requires very large n and p to arrive at an acceptable
approximation. An approximate solution for the first largest eigenvalue has been proposed, which
appears to apply to smaller N and P. Still, no reasonable approach for verifying the number of
principal components in the case of correlation matrices is available. To conclude this topic,
Johnstone's (2001) ad hoc proposal to include running PCA on standardized data in his theorem.
Nevertheless, given a significance level α or the threshold of probability by which to reject the null
hypothesis in a two-tailed test, approximate 𝛼 ∙ 100% significance values to use for the sphericity
F1. The following is a quick illustration of the sphericity test based on the TW1 distribution.
Let 𝜮 be a positive definite square matrix of size 10 to show the sphericity test under the Gaussian
assumption. Consider a diagonal matrix whose elements are ten positive real-valued integers, such
𝟏
approach to construct 𝜮. The inverse 𝜮 of 𝜮 is also a diagonal matrix, the elements created by
inverting each element of 𝜮. Then, given and the mean vector 𝝁 𝟎 , the MATLAB function
"mvnrnd" can be used to create 𝑛 40 observations of the 𝑝 40 random variables selected from
𝑁 𝝁, 𝜮 . The resulting matrix reflects S, which may then be used to calculate 𝑪 𝜮 𝑺, as well
as the test statistics required to validate the null hypothesis H0: does 𝑪 𝜎 𝑰 ? Under the Gaussian
Assumption, any of the sphericity tests performed on C using the p-value computed from l
H0' s validity. The content for each test is provided in the subsections that follow.
One may verify H0 through the application of Equation 3.38. Using this equation, one can
compute: 𝑙 2.1869e , then deduct the value of the test statistic 𝑛log 𝑙 58.3022.
Independently, the degree of freedom 𝑓 can be calculated as 𝑓 1 54, and then used
to find the 99% quantile of the 𝜒 distribution with 54 degrees of freedom. That is 𝜒 . ,
81.069, or simply written as 𝑃 𝜒2 81.069 0.99. Note that given α and a degree of freedom
(d.f.) f, one can either use a lookup table or MATLAB function (chi2inv) to derive or calculate the
𝜒 distribution with mean 𝑓 54, it is helpful to graphically depict all the test results in a graph
Figure 3.7. This graph helps to make conjectures about the test in question, which by design is a
one-tailed test for which obtaining an extreme test statistic t such that 𝑃 𝜒2 𝑡 0.01, would
Figure 3.7, it is evident that the test statistic 58.3022 is not significant since it is less than the value
of 𝜒 . , . Therefore, with a confidence level of 99%, there is no reason for rejecting the
hypothesis H0 that S is a sample covariance matrix equal to the population with (known) true
covariance matrix 𝜮.
A similar verification of H0 may be carried out using the largest eigenvalues of C in a two-tailed
test with the significance level 𝛼 0.01. Let an eigenvalue decomposition of C be performed to
calculate its eigenvalue vector 𝑒𝑖𝑔 𝑪 below provided and derive the largest eigenvalue 𝑙
𝑒𝑖𝑔 𝑪 2.0704, 1.7830, 1.4632, 1.4036, 1.0915, 0.3626, 0.8385, 0.7326, 0.4706, 0.5742
Then, substitute n and p by their respective values into Equation 2.41 to compute the centering and
replacing each expression by its value in 𝑛𝑙 𝜇 𝜎 , one should find a p-value of 0.6699.
Since the test statistic 𝑛𝑙 𝜇 𝜎 is known to have a Tracy-Widom F1 distribution, one may
Figure 3.7. to graphically show all the results. To determine the corresponding quantiles of the
Tracy-Widom F1 at 0.5% and 99.5%, one may either derive them from a lookup table such as Table
C.1 proposed by Bejan (2005) or compute them using Dieng's (2005) MATLAB codes.
Figure 3.8 provides all the information necessary to make inferences about H0. This figure shows
that the value of 0.6699 does not fall in any rejection regions located at the tails of the 1%
Therefore, the null hypothesis H0 can be accepted at the confidence level of 99%.
The following describes the spiked model, an empirical derivation by Johnstone (2001). Similar
to the covariance matrices with unique structures presented in Section 3.3.3, this model is very
sought in PCA. However, in practice, there are frequently one or more significant eigenvalues that
are clearly distinguished from the rest of the data. This begs whether they would pull up the other
values if there were only one or a small number of non-unit eigenvalues in the population, for
example. Consider, for example, a "spiked" covariance model in Equation 3.41 with a fixed
number of eigenvalues greater than one, say r, and an eigenvalue greater than one.
𝛴 𝑑𝑖𝑎𝑔 𝜆 , ⋯ , 𝜆 , 1, ⋯ ,1
[Eq. 3-41]
For this model, the author introduced the notation ℒ 𝑙 \𝑛, 𝑝, 𝜮 for the distribution of the kth
largest eigenvalue of the sample covariance matrix 𝑨 𝑿 𝑿 where the 𝑛 𝑝 matrix X is derived
When examining the Tracy-Widom test for the first component, Baik and Silverstein (2006) must
consider that it established a detection limit. Equation 3.42 below gives the 𝜆 threshold,
𝑝
𝜆 1 [Eq. 3-42]
𝑛
where n is the sample size and p is the number of variables. If the first population eigenvalue is
less than the threshold, the sample eigenvalues are Tracy-Widom distributed, and hence the first
component cannot be distinguished from noise (Baik et al., 2005). As seen in Figure 3.9 below, a
so-called phase transition occurs when the sample eigenvalues are over the threshold (Saccinti et
al. 2017).
Figure 3.9: Illustrations of Transition between Two Distinct Phases–Strong and Weak
Courtesy of Majumdar and Schehr (2014)
As shown above, a phase transition separates the TW distribution's left and right tails. It identifies
a key zone of crossover. Over this critical zone, the system transitions from the weak coupling
(stable) to the strong coupling (unstable) phase. It is analogous to May's model's third-order stable-
unstable transition. At finite N, the Tracy-Widom distribution precisely reflects the crossover
behavior of the free energy from one phase to the other (Majumdar and Schehr 2014). The concept
of phase transition is linked to the work of May (1972) on probing the stability of large complex
ecosystems is the first known direct application of the statistics of the largest eigenvalue of the
covariance matrices. Through his work, May (1972) found that these systems, connected at
random, are stable until they reach some critical level of connectance—the proportion of links per
network schedules are like complex systems. The intricacy and pairwise associations (thousands
in large projects) between activities making up a project network schedule may help explain this
similarity.
Examine the PCA literature for the method used to select the number of principal components in
future studies since PCA is a means to an aim. Review PCA literature for regression analysis
Select a few networks varying in size and complexity from the benchmark schedules to apply the
suitable PCA algorithms discovered during the literature study and essential for data reduction.
This objective uses the available algorithms developed to transform a project network into a matrix
Summarize and interpret findings, infer the number of components required for each project
network based on their sizes, complexity, and other characteristics (e.g., statistical feature of the
Conduct extensive simulations of test schedules to develop a regression model to predict project
activity durations and delays by selecting features that capture the maximum variance in the
Relate the phase transition phenomenon in project schedules to the one observed in the minuscule
margins of the Tracy-Widom distribution to identify similarities, derive the phase transition zone
formula or location, and devise a method for constructing resilient project schedules.
This part will outline the methods used to complete the work at hand to accomplish the Chapter's
objectives. This methodology is devised on the preceding Chapter's findings, which examined the
fundamental behavior of building project timelines. In other words, this methodology uses the
previous Chapter's discoveries to fulfill the current Chapter's objective. Additionally, the results
of the substantial literature research undertaken as part of this Chapter's objectives are critical in
developing the approach for this Chapter. The methodology is divided into three distinct sections
to cut to the chase. The first one establishes a set of prerequisites for its application. The second
section analyzes the data acquired from the literature research to identify PCA approaches used in
this Chapter. The third section utilizes the approaches described in the literature review to construct
the necessary procedures for acquiring all the results required for project network schedule
3.7.1 Assumptions
The following are conditions that must be met to use the methods supplied. First, the mathematical
model presented in Table 2-19 for project network schedules applies to PCA's prospective project
network schedule. Second, given the data matrix derived from the population of i.i.d. durations of
project network activities in question, there exists an optimum sample size 𝑛 at significance
level α (see Section 2.4.6). Third, this optimal sample size ensures sufficient correlation in the data
to apply the universal results required for hypothesis testing. Finally, according to the conclusions
of Chapter 2, when the sample covariance matrix obtained from the project network is normalized,
at least one of the largest eigenvalues follows the Tracy-Widom limit law of order 1.
The study of the information gained by reviewing the literature on the topic of principal component
analysis to minimize data has yielded critical information required for developing the procedures
of this methodology. First, a literature survey on selecting the number of statistically significant
principal components provided criteria that can serve this purpose. Following a thorough
examination of PCA applications in the construction and engineering fields, the rules based on the
cumulative percentage of the total variance, scree plots, and hypothesis testing are appropriate for
this study. Second, the PCA based on correlation matrices instead of covariance matrices will be
applicable for project network scheduling data to determine which principal components to keep.
The reason for this is that the proposed methodology in Section 2.4.6 requires data to be
standardized before use. It suggests two normalization procedures in terms of Norm I and Norm
II. According to Chapter 2, Norm II produced more substantial results than Norm I. Third, because
project network schedules are not quite normally distributed, Chapter 2 revealed the presence of a
covariance structure in project network schedules when an optimum sample size 𝑛 was used.
appropriate. Finally, using the discovered significant principal components, a linear regression
model can be developed using the methods described in Section 3.4.3. Finally, the phase transition
location in project network schedules can be established. This last observation summarizes the
entire section.
3.7.3 Procedure for Conducting a PCA for Construction Project Network Schedules
The following step-by-step processes are based on the preceding section's literature review
analysis.
First step: choose a normalization method for the largest eigenvalues of the sample covariance
matrices produced from project network scheduling data. While correlation matrices (R) will be
employed, this phase is critical for obtaining further data from simulation runs. As a result, Norm
Second step: Identify the prospective networks for PCA. Since Chapter 2 already identified a
handful of networks for the analysis that yielded significant results, selecting a set from those
networks will be helpful for the analysis to be conducted in Chapter 3. Since time is of the essence
due to the tremendous time and effort to run and collect data from simulations, the following
networks will serve for the analysis: j3037-6, j6028-9, j902-4, and j12014-1 (see Table 2-32 for
Third step: simulate a project network schedule to create a data matrix 𝑿 with 𝑛 𝑛 . Then,
use the created sample data matrix to derive the standardized matrix 𝑾 from which the
correlation matrix R is determined. Finally, based on the significance level α used to find 𝑛 .,
, ,
determine the sample covariance matrix 𝑺 . Chapter 2 provides all the necessary formulas.
Fourth step: calculate the necessary eigenvalues for the population correlation and sample
covariance matrices. R represents the population correlation matrix (see Johnstone 2001, p. 304).
Fifth step: construct the scree graphic using the correlation and sample covariance matrices'
eigenvalues. The eigenvalues are ordered decreasingly, and their rank number is shown.
Sixth step: analyze the results and make recommendations for further applications.
Seventh Step: device a linear regression model using the method proposed in Section 3.4 based on
Eighth step: localize phase transition based on the expression of Equation 3.42.
The final step of the approach provided in the preceding section completes the methodology
Following the proposed technique to satisfy the objectives of Chapter 3, the results of simulations
of project network schedules required to undertake the analysis of principal components are as
follows. The networks described in Table 3-1 aided in the investigation. The greyed cells represent
the level of significance of the Kolmogorov-Smirnov Test. The significance level of 0.05 was
found to be the most appropriate for the distributional assumption validation on the probabilistic
This section examines the outcomes of project network schedule replicas. The simulations aided
in calculating the eigenvalues of the sample covariance matrix S and the population correlation
matrix R=W. Both matrices helped decide the number of principal components to retain for each
identified project network in Table 3-1. Figure 3.10 through Figure 3.13 show two scree plots
derived with the eigenvalues of a correlation matrix (left panel) or a covariance matrix (right panel)
for each network. Some networks feature four plots, whereas others have a couple of plots. This
number was determined from Table 3-1. For example, the K-S test accepted the distributional
the limiting distribution of the third (resp. fourth) greatest eigenvalue of the matrix R or S,
computed from the sample data matrix X of size 52 (resp. 49), is TW of order 1. Any of the figures
shows that a small number of the eigenvalues stand out more than others. In other words, two or a
small number of large sample eigenvalues clearly distinguish themselves from the rest. This is
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
Table 3-2 through Table 3-6 output information obtained from the eigenvalues of the population
correlation (on the left) and covariance (on the right) matrices and the scree plots. The ranks of the
eigenvalues in the Number column, the eigenvalues in the Eigenv. Column, the differences
between consecutive eigenvalues in the Differ column, the proportions of the eigenvalues in the
Prop column, and the cumulative sum of the proportions in the "Cumm" column are all provided
in each table. The cutoff criterion for selection is 80%, and partial tables demonstrate the
eigenvalues chosen as principal components for each network. 6 to 8 for j30 and j60 networks and
Table 3-2: Principal Components of the Project Network j3037-6 (Size 49 x 32)
Table 3-3: Principal Components of the Project Network j3037-6 (Size 52 x 32)
The analysis based on the TW p-values was conducted to supplement the analysis based on the
scree plots and the summary of the variability in the eigenvalues. They are all complementary.
As expressed by Equation 3.40, the double-sided hypothesis test allows each value in Table 3-2
through Table 3-6 to be tested at the significance level of 0.05 based on the Tracy-Widom p-value
TW p-value
st nd
Probability 1 Eig. 2 Eig. 3rd Eig. 4th Eig.
0.005 ‐4.1490 ‐5.7302 ‐7.0585 ‐8.2483
0.025 ‐3.5165 ‐5.1769 ‐6.5488 ‐7.7685
0.05 ‐3.1816 ‐4.8876 ‐6.2822 ‐7.5199
0.1 ‐2.7830 ‐4.5467 ‐5.9727 ‐7.2283
0.3 ‐1.9116 ‐3.8156 ‐5.3108 ‐6.6121
0.5 ‐1.2694 ‐3.2911 ‐4.8401 ‐6.1753
0.7 ‐0.5924 ‐2.7509 ‐4.3602 ‐5.7318
0.8 ‐0.1662 ‐2.4160 ‐4.0650 ‐5.4600
0.9 0.4495 ‐1.9431 ‐3.6489 ‐5.0793
0.95 0.9789 ‐1.5422 ‐3.3004 ‐4.7613
0.975 1.4530 ‐1.1893 ‐2.9948 ‐4.4829
0.99 2.0232 ‐0.7703 ‐2.6346 ‐4.1561
0.995 2.4217 ‐0.4810 ‐2.3875 ‐3.9317
It is worth noting that Equation 3.40 only describes the test for the first largest eigenvalue using
the Norm I normalization approach. Adjusting the hypothesis test statement for the successive
eigenvalues and the normalization procedure Norm II is necessary. Saccinti and Timmerman
(2017)'s publication contains valuable information for redefining Equation 3.40 for normalizing
approaches and the next-next eigenvalue. The availability of TW p-values after the fourth
(Bornemann, 2009, 2010) is an excellent tool for those familiar with Matlab.
If the first population eigenvalue is less than the threshold, the sample eigenvalues are Tracy-
Widom distributed, and hence the first component cannot be distinguished from noise (Baik et al.,
2005). As seen in Figure 3.9 below, a so-called phase transition occurs when the sample
Table 3-8: Threshold Value for a Phase Transition (Baik et al. 2005)
The eigenvalues of correlation and covariance matrices obtained from a few benchmark project
network schedules provided greater insight into the limiting behaviors of probabilistic durations
of project activities began in Chapter 2. The scree plots and the proportions of the total population
(sample) variance attributed to each variable (component) aided in deciding the number of
principal components to maintain for each network considered for the experiment. Unless further
to the size of project networks. Furthermore, the analysis indicated that Johnston's (2001) spiked
covariance model is a viable candidate for identifying the limiting durations of project network
activities. The spiked model, an empirical derivation of Johnstone (2001), is a covariance matrix
with a specific structure used to characterize the behavior of a system having one or more
prominent eigenvalues that are easily differentiated from the rest of the data. Furthermore, the
principal component analysis based on hypothesis testing with TW p-values had limitations that
prevented the specified null hypothesis from being adequately evaluated. Finally, the calculated
threshold value indicated the occurrence of phase transitions in the project's network schedules.
Develop a linear regression model using Johnstone's spiked covariance model (2001). This model
is expected to help predict the total project duration by predicting the limiting duration of each
activity on the project network. This model should be a powerful tool for project managers,
Propose a method for determining the principal components based on the correlation or covariance
matrices derived from project network schedules governed by the universal law of the Tracy-
Widom distribution.
Propose a process for determining the number of principal components to include in a predictive
model that can be used as a prelude to PCA-based models in construction project schedules.
Guide including the phase transition into construction project networks under the TW phase
transition framework. The location of this phase transition is expected to assist practitioners in
identifying the moment at which a construction project schedule may reach a tipping point.
The findings of the experiments with a few project networks show that more research is needed to
analyze the principal components of project networks of various sizes and complexities. This will
aid in the development of guidelines for constructing PCA-based models for project networks. In
addition, the spiked model and phase transition are additional subjects that need to be explored
3.10 Conclusion
The proposed methodology, which necessitated lengthy simulation, resulted in discovering a new
model based on Johnstone's spiky covariance matrix (2001). This model can be used to forecast
the limiting durations of project activities through a PCA-based linear regression approach.
Additionally, this study established the position of a phase transition in project network schedules
using the Tracy-Widom distribution's universality. A phase transition identifies a key zone of
crossover. Over this critical zone, the system transitions from weak coupling (stable) to strong
coupling (unstable) phase. This discovery is critical because it is likely to assist practitioners in
identifying the moment at which a construction project schedule may become unstable.
Allen, D. M. (1974). "The relationship between variable selection and data augmentation and a
method for prediction." Technometrics, 16(1), 125-127.
Al-Sabah, R., Menassa, C. C., and Hanna, A. (2014). "Evaluating impact of construction risks in
the Arabian Gulf Region from perspective of multinational architecture, engineering and
construction firms." Constr.Manage.Econ., 32(4), 382-402.
Anderson, T. W. (. (2003). An introduction to multivariate statistical analysis. Wiley-
Interscience, Hoboken, N.J.
Arabzadeh, R., Kholoosi, M. M., and Bazrafshan, J. (2016). "Regional hydrological drought
monitoring using principal components analysis." J.Irrig.Drain.Eng., 142(1), 04015029.
Baik, J., Arous, G. B., and Péché, S. (2005). "Phase transition of the largest eigenvalue for
nonnull complex sample covariance matrices." The Annals of Probability, 33(5), 1643-1697.
Baik, J., Deift, P., and Johansson, K. (1999). "On the distribution of the length of the longest
increasing subsequence of random permutations." Journal of the American Mathematical
Society, 12(4), 1119-1178.
Baik, J., and Silverstein, J. W. (2006). "Eigenvalues of large sample covariance matrices of
spiked population models." Journal of Multivariate Analysis, 97(6), 1382-1408.
Bartlett, M. S. (1950). "Tests of significance in factor analysis." Br.J.Psychol.,.
Bejan, A. (2005). "Largest eigenvalues and sample covariance matrices. Tracy-Widom and
Painlevé II: computational aspects and realization in s-plus with applications." Preprint:
Http://Www.Vitrum.Md/Andrew/MScWrwck/TWinSplus.Pdf, .
Bianchini, A. (2014). "Pavement Maintenance Planning at the Network Level with Principal
Component Analysis." J Infrastruct Syst, 20(2), 4013013.
Bornemann, F. (2010). "On the numerical evaluation of Fredholm determinants." Mathematics of
Computation, 79(270), 871-915.
Cangelosi, R., and Goriely, A. (2007). "Component retention in principal component analysis
with application to cDNA microarray data." Biology Direct, 2(1), 1-21.
Cattell, R. B. (1966a). "The scree test for the number of factors." Multivariate Behavioral
Research, 1(2), 245-276.
Cattell, R. B. (1966b). "The scree test for the number of factors." Multivariate Behavioral
Research, 1(2), 245-276.
376
377
380
Chapter Summary
The suggested introductory chapter for this study provided the framework for the subsequent work.
In addition, its framework may serve as a model for others, particularly for subjects that are being
applied for the first time in a particular discipline. The trick is to select and present pertinent
background material that adequately prepares readers and researchers for the task ahead.
Schedules (Chapter 2)
The extensive empirical analysis, which was developed by adopting and adapting proven
procedures from other fields of application of the TW limiting laws, led to achieving the chapter's
objectives, which were defined based on the scope of its examination. The conclusions and
recommendations for this chapter are as follows. (1) Add construction project management and
Engineering to the list of domains where the Tracy-Widom limit laws based on Random Matrix
Theory (RMT) have successfully examined large dimensionality complex systems; (2) Propose a
mathematical model for project network schedules based on well-established results in probability and
statistics and project-scheduling approaches, which can serve to investigate their behavior and improve
381
existing scheduling techniques; (3) Based on the newly established pattern for project network
schedules, it is possible to evaluate predictions regarding the durations of project network activities
and the entire project; (4) Create a methodology based on multivariate statistical and graphical data
analysis methodologies which assisted in demonstrating that the Tracy-Widom limit law of order 1
governs the joint sampling distribution of project activity durations that may be used to forcast the
project's limiting duration and the limiting duration of each activity comprising the project network
schedule, beyond which any delay will be irreversible; (5) Initiate a research study to investigate the
relationships between a measure of project network complexity and the sample size required to draw
4.1.3 Application of PCA for Data Reduction in Modeling Project Network Schedules
To reduce data using PCA, eigenvalues and eigenvectors serve for this purpose. The eigenvalues
of correlation and covariance matrices derived from a few benchmark project network schedules
shed additional light on the limiting behaviors of probabilistic durations of project activities
discussed in Chapter 2. The scree plots and proportions of the total variance were used to decide
how many PCs to keep for each network in the experiment. Unless further research is done, the
number of preserved PCs looks to be related to project network size. The investigation also
revealed that Johnston's (2001) spiked covariance model could be used to detect project network
activity time limits. It is a covariance matrix with a specific structure that is used to characterize
the behavior of a system with one or more significant eigenvalues that stand out from the rest of
the data. Moreover, the principal component analysis using TW p-values had a few restrictions
that hindered evaluating the null hypothesis. Finally, the determined threshold value reflects the
The contributions are as follows: (1) Develop a linear regression model using Johnstone's spiked
covariance model (2001). This model is expected to help predict the total project duration by
predicting the limiting duration of each activity on the project network. This model should be a
powerful tool for project managers, especially for budgeting and tracking projects; (2) Propose a
method for determining the principal components based on the correlation or covariance matrices
derived from project network schedules governed by the universal law of the Tracy-Widom
distribution. (3) Propose a process for determining the number of principal components to include
in a predictive model that can be used as a prelude to PCA-based models in construction project
schedules; (4) Provide guidance on how to include the phase transition into construction project
The recommendations resulting from Chapter 2 for future research are as follows: (1) While the
current study used only project network schedules from the Project Scheduling Problem Library
(PSPLIB), whose maximum size is 120, future research should extend the analysis to include larger
network schedules from fictitious and real-life projects; (2) Because this analysis revealed no
correlation between restrictiveness (RT) and the number of samples required to satisfy the
alternative complexity metrics, such as the other five identified by this study; (3) While the
normalization approach utilized in this study was specified as a function of n and p using
Johnstone's (2001) celebrated theorem, especially its ad hoc version, future studies might explore
employing a more extended formulation of the centering and scaling functions. Péché (2008), who
validate distributional assumptions in circumstances where the sample matrix covariance's first or
second largest eigenvalue and a supercomputer are available for speedy simulations; (4) Future
research may include extending the study to at least the fifth and sixth greatest eigenvalues when
employing the normalization method derived from the work of Johansson (1998) and Baik et al.
distributions (CDF) statistics up to the sixth greatest eigenvalue, allowing for this expansion.
4.2.2 Application of PCA for Data Reduction in Modeling Project Network Schedules
The findings of the experiments with a few project networks show that more research is needed to
analyze the principal components of project networks of various sizes and complexities. This will
aid in the development of guidelines for constructing PCA-based models for project networks. In
addition, the spiked model and phase transition are additional subjects that need further exploration
4.2.3 Conclusion
This final chapter concludes the inquiry work undertaken in this research study. All went well,
with fascinating discoveries and insights that will hopefully assist in enhancing project scheduling
procedures and equipping project managers with modern techniques necessary to sustain and
386
J30 Networks
No. File Name No. File Name No. File Name No. File Name
1 j3010_1.sm 44 j3014_3.sm 87 j3018_6.sm 130 j3021_9.sm
2 j3010_10.sm 45 j3014_4.sm 88 j3018_7.sm 131 j3022_1.sm
3 j3010_2.sm 46 j3014_5.sm 89 j3018_8.sm 132 j3022_10.sm
4 j3010_3.sm 47 j3014_6.sm 90 j3018_9.sm 133 j3022_2.sm
5 j3010_4.sm 48 j3014_7.sm 91 j3019_1.sm 134 j3022_3.sm
6 j3010_5.sm 49 j3014_8.sm 92 j3019_10.sm 135 j3022_4.sm
7 j3010_6.sm 50 j3014_9.sm 93 j3019_2.sm 136 j3022_5.sm
8 j3010_7.sm 51 j3015_1.sm 94 j3019_3.sm 137 j3022_6.sm
9 j3010_8.sm 52 j3015_10.sm 95 j3019_4.sm 138 j3022_7.sm
10 j3010_9.sm 53 j3015_2.sm 96 j3019_5.sm 139 j3022_8.sm
11 j3011_1.sm 54 j3015_3.sm 97 j3019_6.sm 140 j3022_9.sm
12 j3011_10.sm 55 j3015_4.sm 98 j3019_7.sm 141 j3023_1.sm
13 j3011_2.sm 56 j3015_5.sm 99 j3019_8.sm 142 j3023_10.sm
14 j3011_3.sm 57 j3015_6.sm 100 j3019_9.sm 143 j3023_2.sm
15 j3011_4.sm 58 j3015_7.sm 101 j301_1.sm 144 j3023_3.sm
16 j3011_5.sm 59 j3015_8.sm 102 j301_10.sm 145 j3023_4.sm
17 j3011_6.sm 60 j3015_9.sm 103 j301_2.sm 146 j3023_5.sm
18 j3011_7.sm 61 j3016_1.sm 104 j301_3.sm 147 j3023_6.sm
19 j3011_8.sm 62 j3016_10.sm 105 j301_4.sm 148 j3023_7.sm
20 j3011_9.sm 63 j3016_2.sm 106 j301_5.sm 149 j3023_8.sm
21 j3012_1.sm 64 j3016_3.sm 107 j301_6.sm 150 j3023_9.sm
22 j3012_10.sm 65 j3016_4.sm 108 j301_7.sm 151 j3024_1.sm
23 j3012_2.sm 66 j3016_5.sm 109 j301_8.sm 152 j3024_10.sm
24 j3012_3.sm 67 j3016_6.sm 110 j301_9.sm 153 j3024_2.sm
25 j3012_4.sm 68 j3016_7.sm 111 j3020_1.sm 154 j3024_3.sm
26 j3012_5.sm 69 j3016_8.sm 112 j3020_10.sm 155 j3024_4.sm
27 j3012_6.sm 70 j3016_9.sm 113 j3020_2.sm 156 j3024_5.sm
28 j3012_7.sm 71 j3017_1.sm 114 j3020_3.sm 157 j3024_6.sm
29 j3012_8.sm 72 j3017_10.sm 115 j3020_4.sm 158 j3024_7.sm
30 j3012_9.sm 73 j3017_2.sm 116 j3020_5.sm 159 j3024_8.sm
31 j3013_1.sm 74 j3017_3.sm 117 j3020_6.sm 160 j3024_9.sm
32 j3013_10.sm 75 j3017_4.sm 118 j3020_7.sm 161 j3025_1.sm
33 j3013_2.sm 76 j3017_5.sm 119 j3020_8.sm 162 j3025_10.sm
34 j3013_3.sm 77 j3017_6.sm 120 j3020_9.sm 163 j3025_2.sm
35 j3013_4.sm 78 j3017_7.sm 121 j3021_1.sm 164 j3025_3.sm
36 j3013_5.sm 79 j3017_8.sm 122 j3021_10.sm 165 j3025_4.sm
37 j3013_6.sm 80 j3017_9.sm 123 j3021_2.sm 166 j3025_5.sm
38 j3013_7.sm 81 j3018_1.sm 124 j3021_3.sm 167 j3025_6.sm
39 j3013_8.sm 82 j3018_10.sm 125 j3021_4.sm 168 j3025_7.sm
387
388
389
390
No. File Name No. File Name No. File Name No. File Name
1 j6010_1.sm 46 j6014_5.sm 91 j6019_1.sm 136 j6022_5.sm
2 j6010_10.sm 47 j6014_6.sm 92 j6019_10.sm 137 j6022_6.sm
3 j6010_2.sm 48 j6014_7.sm 93 j6019_2.sm 138 j6022_7.sm
4 j6010_3.sm 49 j6014_8.sm 94 j6019_3.sm 139 j6022_8.sm
5 j6010_4.sm 50 j6014_9.sm 95 j6019_4.sm 140 j6022_9.sm
6 j6010_5.sm 51 j6015_1.sm 96 j6019_5.sm 141 j6023_1.sm
7 j6010_6.sm 52 j6015_10.sm 97 j6019_6.sm 142 j6023_10.sm
8 j6010_7.sm 53 j6015_2.sm 98 j6019_7.sm 143 j6023_2.sm
9 j6010_8.sm 54 j6015_3.sm 99 j6019_8.sm 144 j6023_3.sm
10 j6010_9.sm 55 j6015_4.sm 100 j6019_9.sm 145 j6023_4.sm
11 j6011_1.sm 56 j6015_5.sm 101 j601_1.sm 146 j6023_5.sm
12 j6011_10.sm 57 j6015_6.sm 102 j601_10.sm 147 j6023_6.sm
13 j6011_2.sm 58 j6015_7.sm 103 j601_2.sm 148 j6023_7.sm
14 j6011_3.sm 59 j6015_8.sm 104 j601_3.sm 149 j6023_8.sm
15 j6011_4.sm 60 j6015_9.sm 105 j601_4.sm 150 j6023_9.sm
16 j6011_5.sm 61 j6016_1.sm 106 j601_5.sm 151 j6024_1.sm
17 j6011_6.sm 62 j6016_10.sm 107 j601_6.sm 152 j6024_10.sm
18 j6011_7.sm 63 j6016_2.sm 108 j601_7.sm 153 j6024_2.sm
19 j6011_8.sm 64 j6016_3.sm 109 j601_8.sm 154 j6024_3.sm
20 j6011_9.sm 65 j6016_4.sm 110 j601_9.sm 155 j6024_4.sm
21 j6012_1.sm 66 j6016_5.sm 111 j6020_1.sm 156 j6024_5.sm
22 j6012_10.sm 67 j6016_6.sm 112 j6020_10.sm 157 j6024_6.sm
23 j6012_2.sm 68 j6016_7.sm 113 j6020_2.sm 158 j6024_7.sm
24 j6012_3.sm 69 j6016_8.sm 114 j6020_3.sm 159 j6024_8.sm
25 j6012_4.sm 70 j6016_9.sm 115 j6020_4.sm 160 j6024_9.sm
26 j6012_5.sm 71 j6017_1.sm 116 j6020_5.sm 161 j6025_1.sm
27 j6012_6.sm 72 j6017_10.sm 117 j6020_6.sm 162 j6025_10.sm
28 j6012_7.sm 73 j6017_2.sm 118 j6020_7.sm 163 j6025_2.sm
29 j6012_8.sm 74 j6017_3.sm 119 j6020_8.sm 164 j6025_3.sm
30 j6012_9.sm 75 j6017_4.sm 120 j6020_9.sm 165 j6025_4.sm
31 j6013_1.sm 76 j6017_5.sm 121 j6021_1.sm 166 j6025_5.sm
32 j6013_10.sm 77 j6017_6.sm 122 j6021_10.sm 167 j6025_6.sm
33 j6013_2.sm 78 j6017_7.sm 123 j6021_2.sm 168 j6025_7.sm
34 j6013_3.sm 79 j6017_8.sm 124 j6021_3.sm 169 j6025_8.sm
35 j6013_4.sm 80 j6017_9.sm 125 j6021_4.sm 170 j6025_9.sm
36 j6013_5.sm 81 j6018_1.sm 126 j6021_5.sm 171 j6026_1.sm
37 j6013_6.sm 82 j6018_10.sm 127 j6021_6.sm 172 j6026_10.sm
38 j6013_7.sm 83 j6018_2.sm 128 j6021_7.sm 173 j6026_2.sm
39 j6013_8.sm 84 j6018_3.sm 129 j6021_8.sm 174 j6026_3.sm
40 j6013_9.sm 85 j6018_4.sm 130 j6021_9.sm 175 j6026_4.sm
41 j6014_1.sm 86 j6018_5.sm 131 j6022_1.sm 176 j6026_5.sm
42 j6014_10.sm 87 j6018_6.sm 132 j6022_10.sm 177 j6026_6.sm
43 j6014_2.sm 88 j6018_7.sm 133 j6022_2.sm 178 j6026_7.sm
44 j6014_3.sm 89 j6018_8.sm 134 j6022_3.sm 179 j6026_8.sm
45 j6014_4.sm 90 j6018_9.sm 135 j6022_4.sm 180 j6026_9.sm
391
No. File Name No. File Name No. File Name No. File Name
181 j6027_1.sm 226 j6030_5.sm 271 j6035_1.sm 316 j6039_5.sm
182 j6027_10.sm 227 j6030_6.sm 272 j6035_10.sm 317 j6039_6.sm
183 j6027_2.sm 228 j6030_7.sm 273 j6035_2.sm 318 j6039_7.sm
184 j6027_3.sm 229 j6030_8.sm 274 j6035_3.sm 319 j6039_8.sm
185 j6027_4.sm 230 j6030_9.sm 275 j6035_4.sm 320 j6039_9.sm
186 j6027_5.sm 231 j6031_1.sm 276 j6035_5.sm 321 j603_1.sm
187 j6027_6.sm 232 j6031_10.sm 277 j6035_6.sm 322 j603_10.sm
188 j6027_7.sm 233 j6031_2.sm 278 j6035_7.sm 323 j603_2.sm
189 j6027_8.sm 234 j6031_3.sm 279 j6035_8.sm 324 j603_3.sm
190 j6027_9.sm 235 j6031_4.sm 280 j6035_9.sm 325 j603_4.sm
191 j6028_1.sm 236 j6031_5.sm 281 j6036_1.sm 326 j603_5.sm
192 j6028_10.sm 237 j6031_6.sm 282 j6036_10.sm 327 j603_6.sm
193 j6028_2.sm 238 j6031_7.sm 283 j6036_2.sm 328 j603_7.sm
194 j6028_3.sm 239 j6031_8.sm 284 j6036_3.sm 329 j603_8.sm
195 j6028_4.sm 240 j6031_9.sm 285 j6036_4.sm 330 j603_9.sm
196 j6028_5.sm 241 j6032_1.sm 286 j6036_5.sm 331 j6040_1.sm
197 j6028_6.sm 242 j6032_10.sm 287 j6036_6.sm 332 j6040_10.sm
198 j6028_7.sm 243 j6032_2.sm 288 j6036_7.sm 333 j6040_2.sm
199 j6028_8.sm 244 j6032_3.sm 289 j6036_8.sm 334 j6040_3.sm
200 j6028_9.sm 245 j6032_4.sm 290 j6036_9.sm 335 j6040_4.sm
201 j6029_1.sm 246 j6032_5.sm 291 j6037_1.sm 336 j6040_5.sm
202 j6029_10.sm 247 j6032_6.sm 292 j6037_10.sm 337 j6040_6.sm
203 j6029_2.sm 248 j6032_7.sm 293 j6037_2.sm 338 j6040_7.sm
204 j6029_3.sm 249 j6032_8.sm 294 j6037_3.sm 339 j6040_8.sm
205 j6029_4.sm 250 j6032_9.sm 295 j6037_4.sm 340 j6040_9.sm
206 j6029_5.sm 251 j6033_1.sm 296 j6037_5.sm 341 j6041_1.sm
207 j6029_6.sm 252 j6033_10.sm 297 j6037_6.sm 342 j6041_10.sm
208 j6029_7.sm 253 j6033_2.sm 298 j6037_7.sm 343 j6041_2.sm
209 j6029_8.sm 254 j6033_3.sm 299 j6037_8.sm 344 j6041_3.sm
210 j6029_9.sm 255 j6033_4.sm 300 j6037_9.sm 345 j6041_4.sm
211 j602_1.sm 256 j6033_5.sm 301 j6038_1.sm 346 j6041_5.sm
212 j602_10.sm 257 j6033_6.sm 302 j6038_10.sm 347 j6041_6.sm
213 j602_2.sm 258 j6033_7.sm 303 j6038_2.sm 348 j6041_7.sm
214 j602_3.sm 259 j6033_8.sm 304 j6038_3.sm 349 j6041_8.sm
215 j602_4.sm 260 j6033_9.sm 305 j6038_4.sm 350 j6041_9.sm
216 j602_5.sm 261 j6034_1.sm 306 j6038_5.sm 351 j6042_1.sm
217 j602_6.sm 262 j6034_10.sm 307 j6038_6.sm 352 j6042_10.sm
218 j602_7.sm 263 j6034_2.sm 308 j6038_7.sm 353 j6042_2.sm
219 j602_8.sm 264 j6034_3.sm 309 j6038_8.sm 354 j6042_3.sm
220 j602_9.sm 265 j6034_4.sm 310 j6038_9.sm 355 j6042_4.sm
221 j6030_1.sm 266 j6034_5.sm 311 j6039_1.sm 356 j6042_5.sm
222 j6030_10.sm 267 j6034_6.sm 312 j6039_10.sm 357 j6042_6.sm
223 j6030_2.sm 268 j6034_7.sm 313 j6039_2.sm 358 j6042_7.sm
224 j6030_3.sm 269 j6034_8.sm 314 j6039_3.sm 359 j6042_8.sm
225 j6030_4.sm 270 j6034_9.sm 315 j6039_4.sm 360 j6042_9.sm
392
No. File Name No. File Name No. File Name No. File Name
361 j6043_1.sm 406 j6047_5.sm 451 j607_1.sm
362 j6043_10.sm 407 j6047_6.sm 452 j607_10.sm
363 j6043_2.sm 408 j6047_7.sm 453 j607_2.sm
364 j6043_3.sm 409 j6047_8.sm 454 j607_3.sm
365 j6043_4.sm 410 j6047_9.sm 455 j607_4.sm
366 j6043_5.sm 411 j6048_1.sm 456 j607_5.sm
367 j6043_6.sm 412 j6048_10.sm 457 j607_6.sm
368 j6043_7.sm 413 j6048_2.sm 458 j607_7.sm
369 j6043_8.sm 414 j6048_3.sm 459 j607_8.sm
370 j6043_9.sm 415 j6048_4.sm 460 j607_9.sm
371 j6044_1.sm 416 j6048_5.sm 461 j608_1.sm
372 j6044_10.sm 417 j6048_6.sm 462 j608_10.sm
373 j6044_2.sm 418 j6048_7.sm 463 j608_2.sm
374 j6044_3.sm 419 j6048_8.sm 464 j608_3.sm
375 j6044_4.sm 420 j6048_9.sm 465 j608_4.sm
376 j6044_5.sm 421 j604_1.sm 466 j608_5.sm
377 j6044_6.sm 422 j604_10.sm 467 j608_6.sm
378 j6044_7.sm 423 j604_2.sm 468 j608_7.sm
379 j6044_8.sm 424 j604_3.sm 469 j608_8.sm
380 j6044_9.sm 425 j604_4.sm 470 j608_9.sm
381 j6045_1.sm 426 j604_5.sm 471 j609_1.sm
382 j6045_10.sm 427 j604_6.sm 472 j609_10.sm
383 j6045_2.sm 428 j604_7.sm 473 j609_2.sm
384 j6045_3.sm 429 j604_8.sm 474 j609_3.sm
385 j6045_4.sm 430 j604_9.sm 475 j609_4.sm
386 j6045_5.sm 431 j605_1.sm 476 j609_5.sm
387 j6045_6.sm 432 j605_10.sm 477 j609_6.sm
388 j6045_7.sm 433 j605_2.sm 478 j609_7.sm
389 j6045_8.sm 434 j605_3.sm 479 j609_8.sm
390 j6045_9.sm 435 j605_4.sm 480 j609_9.sm
391 j6046_1.sm 436 j605_5.sm
392 j6046_10.sm 437 j605_6.sm
393 j6046_2.sm 438 j605_7.sm
394 j6046_3.sm 439 j605_8.sm
395 j6046_4.sm 440 j605_9.sm
396 j6046_5.sm 441 j606_1.sm
397 j6046_6.sm 442 j606_10.sm
398 j6046_7.sm 443 j606_2.sm
399 j6046_8.sm 444 j606_3.sm
400 j6046_9.sm 445 j606_4.sm
401 j6047_1.sm 446 j606_5.sm
402 j6047_10.sm 447 j606_6.sm
403 j6047_2.sm 448 j606_7.sm
404 j6047_3.sm 449 j606_8.sm
405 j6047_4.sm 450 j606_9.sm
393
No. File Name No. File Name No. File Name No. File Name
1 j9010_1.sm 46 j9014_5.sm 91 j9019_1.sm 136 j9022_5.sm
2 j9010_10.sm 47 j9014_6.sm 92 j9019_10.sm 137 j9022_6.sm
3 j9010_2.sm 48 j9014_7.sm 93 j9019_2.sm 138 j9022_7.sm
4 j9010_3.sm 49 j9014_8.sm 94 j9019_3.sm 139 j9022_8.sm
5 j9010_4.sm 50 j9014_9.sm 95 j9019_4.sm 140 j9022_9.sm
6 j9010_5.sm 51 j9015_1.sm 96 j9019_5.sm 141 j9023_1.sm
7 j9010_6.sm 52 j9015_10.sm 97 j9019_6.sm 142 j9023_10.sm
8 j9010_7.sm 53 j9015_2.sm 98 j9019_7.sm 143 j9023_2.sm
9 j9010_8.sm 54 j9015_3.sm 99 j9019_8.sm 144 j9023_3.sm
10 j9010_9.sm 55 j9015_4.sm 100 j9019_9.sm 145 j9023_4.sm
11 j9011_1.sm 56 j9015_5.sm 101 j901_1.sm 146 j9023_5.sm
12 j9011_10.sm 57 j9015_6.sm 102 j901_10.sm 147 j9023_6.sm
13 j9011_2.sm 58 j9015_7.sm 103 j901_2.sm 148 j9023_7.sm
14 j9011_3.sm 59 j9015_8.sm 104 j901_3.sm 149 j9023_8.sm
15 j9011_4.sm 60 j9015_9.sm 105 j901_4.sm 150 j9023_9.sm
16 j9011_5.sm 61 j9016_1.sm 106 j901_5.sm 151 j9024_1.sm
17 j9011_6.sm 62 j9016_10.sm 107 j901_6.sm 152 j9024_10.sm
18 j9011_7.sm 63 j9016_2.sm 108 j901_7.sm 153 j9024_2.sm
19 j9011_8.sm 64 j9016_3.sm 109 j901_8.sm 154 j9024_3.sm
20 j9011_9.sm 65 j9016_4.sm 110 j901_9.sm 155 j9024_4.sm
21 j9012_1.sm 66 j9016_5.sm 111 j9020_1.sm 156 j9024_5.sm
22 j9012_10.sm 67 j9016_6.sm 112 j9020_10.sm 157 j9024_6.sm
23 j9012_2.sm 68 j9016_7.sm 113 j9020_2.sm 158 j9024_7.sm
24 j9012_3.sm 69 j9016_8.sm 114 j9020_3.sm 159 j9024_8.sm
25 j9012_4.sm 70 j9016_9.sm 115 j9020_4.sm 160 j9024_9.sm
26 j9012_5.sm 71 j9017_1.sm 116 j9020_5.sm 161 j9025_1.sm
27 j9012_6.sm 72 j9017_10.sm 117 j9020_6.sm 162 j9025_10.sm
28 j9012_7.sm 73 j9017_2.sm 118 j9020_7.sm 163 j9025_2.sm
29 j9012_8.sm 74 j9017_3.sm 119 j9020_8.sm 164 j9025_3.sm
30 j9012_9.sm 75 j9017_4.sm 120 j9020_9.sm 165 j9025_4.sm
31 j9013_1.sm 76 j9017_5.sm 121 j9021_1.sm 166 j9025_5.sm
32 j9013_10.sm 77 j9017_6.sm 122 j9021_10.sm 167 j9025_6.sm
33 j9013_2.sm 78 j9017_7.sm 123 j9021_2.sm 168 j9025_7.sm
34 j9013_3.sm 79 j9017_8.sm 124 j9021_3.sm 169 j9025_8.sm
35 j9013_4.sm 80 j9017_9.sm 125 j9021_4.sm 170 j9025_9.sm
36 j9013_5.sm 81 j9018_1.sm 126 j9021_5.sm 171 j9026_1.sm
37 j9013_6.sm 82 j9018_10.sm 127 j9021_6.sm 172 j9026_10.sm
38 j9013_7.sm 83 j9018_2.sm 128 j9021_7.sm 173 j9026_2.sm
39 j9013_8.sm 84 j9018_3.sm 129 j9021_8.sm 174 j9026_3.sm
40 j9013_9.sm 85 j9018_4.sm 130 j9021_9.sm 175 j9026_4.sm
41 j9014_1.sm 86 j9018_5.sm 131 j9022_1.sm 176 j9026_5.sm
42 j9014_10.sm 87 j9018_6.sm 132 j9022_10.sm 177 j9026_6.sm
43 j9014_2.sm 88 j9018_7.sm 133 j9022_2.sm 178 j9026_7.sm
44 j9014_3.sm 89 j9018_8.sm 134 j9022_3.sm 179 j9026_8.sm
45 j9014_4.sm 90 j9018_9.sm 135 j9022_4.sm 180 j9026_9.sm
394
No. File Name No. File Name No. File Name No. File Name
181 j9027_1.sm 226 j9030_5.sm 271 j9035_1.sm 316 j9039_5.sm
182 j9027_10.sm 227 j9030_6.sm 272 j9035_10.sm 317 j9039_6.sm
183 j9027_2.sm 228 j9030_7.sm 273 j9035_2.sm 318 j9039_7.sm
184 j9027_3.sm 229 j9030_8.sm 274 j9035_3.sm 319 j9039_8.sm
185 j9027_4.sm 230 j9030_9.sm 275 j9035_4.sm 320 j9039_9.sm
186 j9027_5.sm 231 j9031_1.sm 276 j9035_5.sm 321 j903_1.sm
187 j9027_6.sm 232 j9031_10.sm 277 j9035_6.sm 322 j903_10.sm
188 j9027_7.sm 233 j9031_2.sm 278 j9035_7.sm 323 j903_2.sm
189 j9027_8.sm 234 j9031_3.sm 279 j9035_8.sm 324 j903_3.sm
190 j9027_9.sm 235 j9031_4.sm 280 j9035_9.sm 325 j903_4.sm
191 j9028_1.sm 236 j9031_5.sm 281 j9036_1.sm 326 j903_5.sm
192 j9028_10.sm 237 j9031_6.sm 282 j9036_10.sm 327 j903_6.sm
193 j9028_2.sm 238 j9031_7.sm 283 j9036_2.sm 328 j903_7.sm
194 j9028_3.sm 239 j9031_8.sm 284 j9036_3.sm 329 j903_8.sm
195 j9028_4.sm 240 j9031_9.sm 285 j9036_4.sm 330 j903_9.sm
196 j9028_5.sm 241 j9032_1.sm 286 j9036_5.sm 331 j9040_1.sm
197 j9028_6.sm 242 j9032_10.sm 287 j9036_6.sm 332 j9040_10.sm
198 j9028_7.sm 243 j9032_2.sm 288 j9036_7.sm 333 j9040_2.sm
199 j9028_8.sm 244 j9032_3.sm 289 j9036_8.sm 334 j9040_3.sm
200 j9028_9.sm 245 j9032_4.sm 290 j9036_9.sm 335 j9040_4.sm
201 j9029_1.sm 246 j9032_5.sm 291 j9037_1.sm 336 j9040_5.sm
202 j9029_10.sm 247 j9032_6.sm 292 j9037_10.sm 337 j9040_6.sm
203 j9029_2.sm 248 j9032_7.sm 293 j9037_2.sm 338 j9040_7.sm
204 j9029_3.sm 249 j9032_8.sm 294 j9037_3.sm 339 j9040_8.sm
205 j9029_4.sm 250 j9032_9.sm 295 j9037_4.sm 340 j9040_9.sm
206 j9029_5.sm 251 j9033_1.sm 296 j9037_5.sm 341 j9041_1.sm
207 j9029_6.sm 252 j9033_10.sm 297 j9037_6.sm 342 j9041_10.sm
208 j9029_7.sm 253 j9033_2.sm 298 j9037_7.sm 343 j9041_2.sm
209 j9029_8.sm 254 j9033_3.sm 299 j9037_8.sm 344 j9041_3.sm
210 j9029_9.sm 255 j9033_4.sm 300 j9037_9.sm 345 j9041_4.sm
211 j902_1.sm 256 j9033_5.sm 301 j9038_1.sm 346 j9041_5.sm
212 j902_10.sm 257 j9033_6.sm 302 j9038_10.sm 347 j9041_6.sm
213 j902_2.sm 258 j9033_7.sm 303 j9038_2.sm 348 j9041_7.sm
214 j902_3.sm 259 j9033_8.sm 304 j9038_3.sm 349 j9041_8.sm
215 j902_4.sm 260 j9033_9.sm 305 j9038_4.sm 350 j9041_9.sm
216 j902_5.sm 261 j9034_1.sm 306 j9038_5.sm 351 j9042_1.sm
217 j902_6.sm 262 j9034_10.sm 307 j9038_6.sm 352 j9042_10.sm
218 j902_7.sm 263 j9034_2.sm 308 j9038_7.sm 353 j9042_2.sm
219 j902_8.sm 264 j9034_3.sm 309 j9038_8.sm 354 j9042_3.sm
220 j902_9.sm 265 j9034_4.sm 310 j9038_9.sm 355 j9042_4.sm
221 j9030_1.sm 266 j9034_5.sm 311 j9039_1.sm 356 j9042_5.sm
222 j9030_10.sm 267 j9034_6.sm 312 j9039_10.sm 357 j9042_6.sm
223 j9030_2.sm 268 j9034_7.sm 313 j9039_2.sm 358 j9042_7.sm
224 j9030_3.sm 269 j9034_8.sm 314 j9039_3.sm 359 j9042_8.sm
225 j9030_4.sm 270 j9034_9.sm 315 j9039_4.sm 360 j9042_9.sm
395
No. File Name No. File Name No. File Name No. File Name
361 j9043_1.sm 406 j9047_5.sm 451 j907_1.sm
362 j9043_10.sm 407 j9047_6.sm 452 j907_10.sm
363 j9043_2.sm 408 j9047_7.sm 453 j907_2.sm
364 j9043_3.sm 409 j9047_8.sm 454 j907_3.sm
365 j9043_4.sm 410 j9047_9.sm 455 j907_4.sm
366 j9043_5.sm 411 j9048_1.sm 456 j907_5.sm
367 j9043_6.sm 412 j9048_10.sm 457 j907_6.sm
368 j9043_7.sm 413 j9048_2.sm 458 j907_7.sm
369 j9043_8.sm 414 j9048_3.sm 459 j907_8.sm
370 j9043_9.sm 415 j9048_4.sm 460 j907_9.sm
371 j9044_1.sm 416 j9048_5.sm 461 j908_1.sm
372 j9044_10.sm 417 j9048_6.sm 462 j908_10.sm
373 j9044_2.sm 418 j9048_7.sm 463 j908_2.sm
374 j9044_3.sm 419 j9048_8.sm 464 j908_3.sm
375 j9044_4.sm 420 j9048_9.sm 465 j908_4.sm
376 j9044_5.sm 421 j904_1.sm 466 j908_5.sm
377 j9044_6.sm 422 j904_10.sm 467 j908_6.sm
378 j9044_7.sm 423 j904_2.sm 468 j908_7.sm
379 j9044_8.sm 424 j904_3.sm 469 j908_8.sm
380 j9044_9.sm 425 j904_4.sm 470 j908_9.sm
381 j9045_1.sm 426 j904_5.sm 471 j909_1.sm
382 j9045_10.sm 427 j904_6.sm 472 j909_10.sm
383 j9045_2.sm 428 j904_7.sm 473 j909_2.sm
384 j9045_3.sm 429 j904_8.sm 474 j909_3.sm
385 j9045_4.sm 430 j904_9.sm 475 j909_4.sm
386 j9045_5.sm 431 j905_1.sm 476 j909_5.sm
387 j9045_6.sm 432 j905_10.sm 477 j909_6.sm
388 j9045_7.sm 433 j905_2.sm 478 j909_7.sm
389 j9045_8.sm 434 j905_3.sm 479 j909_8.sm
390 j9045_9.sm 435 j905_4.sm 480 j909_9.sm
391 j9046_1.sm 436 j905_5.sm
392 j9046_10.sm 437 j905_6.sm
393 j9046_2.sm 438 j905_7.sm
394 j9046_3.sm 439 j905_8.sm
395 j9046_4.sm 440 j905_9.sm
396 j9046_5.sm 441 j906_1.sm
397 j9046_6.sm 442 j906_10.sm
398 j9046_7.sm 443 j906_2.sm
399 j9046_8.sm 444 j906_3.sm
400 j9046_9.sm 445 j906_4.sm
401 j9047_1.sm 446 j906_5.sm
402 j9047_10.sm 447 j906_6.sm
403 j9047_2.sm 448 j906_7.sm
404 j9047_3.sm 449 j906_8.sm
405 j9047_4.sm 450 j906_9.sm
396
No. File Name No. File Name No. File Name No. File Name
1 j12010_1.sm 46 j12014_5.sm 91 j12019_1.sm 136 j12022_5.sm
2 j12010_10.sm 47 j12014_6.sm 92 j12019_10.sm 137 j12022_6.sm
3 j12010_2.sm 48 j12014_7.sm 93 j12019_2.sm 138 j12022_7.sm
4 j12010_3.sm 49 j12014_8.sm 94 j12019_3.sm 139 j12022_8.sm
5 j12010_4.sm 50 j12014_9.sm 95 j12019_4.sm 140 j12022_9.sm
6 j12010_5.sm 51 j12015_1.sm 96 j12019_5.sm 141 j12023_1.sm
7 j12010_6.sm 52 j12015_10.sm 97 j12019_6.sm 142 j12023_10.sm
8 j12010_7.sm 53 j12015_2.sm 98 j12019_7.sm 143 j12023_2.sm
9 j12010_8.sm 54 j12015_3.sm 99 j12019_8.sm 144 j12023_3.sm
10 j12010_9.sm 55 j12015_4.sm 100 j12019_9.sm 145 j12023_4.sm
11 j12011_1.sm 56 j12015_5.sm 101 j1201_1.sm 146 j12023_5.sm
12 j12011_10.sm 57 j12015_6.sm 102 j1201_10.sm 147 j12023_6.sm
13 j12011_2.sm 58 j12015_7.sm 103 j1201_2.sm 148 j12023_7.sm
14 j12011_3.sm 59 j12015_8.sm 104 j1201_3.sm 149 j12023_8.sm
15 j12011_4.sm 60 j12015_9.sm 105 j1201_4.sm 150 j12023_9.sm
16 j12011_5.sm 61 j12016_1.sm 106 j1201_5.sm 151 j12024_1.sm
17 j12011_6.sm 62 j12016_10.sm 107 j1201_6.sm 152 j12024_10.sm
18 j12011_7.sm 63 j12016_2.sm 108 j1201_7.sm 153 j12024_2.sm
19 j12011_8.sm 64 j12016_3.sm 109 j1201_8.sm 154 j12024_3.sm
20 j12011_9.sm 65 j12016_4.sm 110 j1201_9.sm 155 j12024_4.sm
21 j12012_1.sm 66 j12016_5.sm 111 j12020_1.sm 156 j12024_5.sm
22 j12012_10.sm 67 j12016_6.sm 112 j12020_10.sm 157 j12024_6.sm
23 j12012_2.sm 68 j12016_7.sm 113 j12020_2.sm 158 j12024_7.sm
24 j12012_3.sm 69 j12016_8.sm 114 j12020_3.sm 159 j12024_8.sm
25 j12012_4.sm 70 j12016_9.sm 115 j12020_4.sm 160 j12024_9.sm
26 j12012_5.sm 71 j12017_1.sm 116 j12020_5.sm 161 j12025_1.sm
27 j12012_6.sm 72 j12017_10.sm 117 j12020_6.sm 162 j12025_10.sm
28 j12012_7.sm 73 j12017_2.sm 118 j12020_7.sm 163 j12025_2.sm
29 j12012_8.sm 74 j12017_3.sm 119 j12020_8.sm 164 j12025_3.sm
30 j12012_9.sm 75 j12017_4.sm 120 j12020_9.sm 165 j12025_4.sm
31 j12013_1.sm 76 j12017_5.sm 121 j12021_1.sm 166 j12025_5.sm
32 j12013_10.sm 77 j12017_6.sm 122 j12021_10.sm 167 j12025_6.sm
33 j12013_2.sm 78 j12017_7.sm 123 j12021_2.sm 168 j12025_7.sm
34 j12013_3.sm 79 j12017_8.sm 124 j12021_3.sm 169 j12025_8.sm
35 j12013_4.sm 80 j12017_9.sm 125 j12021_4.sm 170 j12025_9.sm
36 j12013_5.sm 81 j12018_1.sm 126 j12021_5.sm 171 j12026_1.sm
37 j12013_6.sm 82 j12018_10.sm 127 j12021_6.sm 172 j12026_10.sm
38 j12013_7.sm 83 j12018_2.sm 128 j12021_7.sm 173 j12026_2.sm
39 j12013_8.sm 84 j12018_3.sm 129 j12021_8.sm 174 j12026_3.sm
40 j12013_9.sm 85 j12018_4.sm 130 j12021_9.sm 175 j12026_4.sm
41 j12014_1.sm 86 j12018_5.sm 131 j12022_1.sm 176 j12026_5.sm
42 j12014_10.sm 87 j12018_6.sm 132 j12022_10.sm 177 j12026_6.sm
43 j12014_2.sm 88 j12018_7.sm 133 j12022_2.sm 178 j12026_7.sm
44 j12014_3.sm 89 j12018_8.sm 134 j12022_3.sm 179 j12026_8.sm
45 j12014_4.sm 90 j12018_9.sm 135 j12022_4.sm 180 j12026_9.sm
397
No. File Name No. File Name No. File Name No. File Name
398
No. File Name No. File Name No. File Name No. File Name
361 j12043_1.sm 406 j12047_5.sm 451 j12051_1.sm 496 j12055_5.sm
362 j12043_10.sm 407 j12047_6.sm 452 j12051_10.sm 497 j12055_6.sm
363 j12043_2.sm 408 j12047_7.sm 453 j12051_2.sm 498 j12055_7.sm
364 j12043_3.sm 409 j12047_8.sm 454 j12051_3.sm 499 j12055_8.sm
365 j12043_4.sm 410 j12047_9.sm 455 j12051_4.sm 500 j12055_9.sm
366 j12043_5.sm 411 j12048_1.sm 456 j12051_5.sm 501 j12056_1.sm
367 j12043_6.sm 412 j12048_10.sm 457 j12051_6.sm 502 j12056_10.sm
368 j12043_7.sm 413 j12048_2.sm 458 j12051_7.sm 503 j12056_2.sm
369 j12043_8.sm 414 j12048_3.sm 459 j12051_8.sm 504 j12056_3.sm
370 j12043_9.sm 415 j12048_4.sm 460 j12051_9.sm 505 j12056_4.sm
371 j12044_1.sm 416 j12048_5.sm 461 j12052_1.sm 506 j12056_5.sm
372 j12044_10.sm 417 j12048_6.sm 462 j12052_10.sm 507 j12056_6.sm
373 j12044_2.sm 418 j12048_7.sm 463 j12052_2.sm 508 j12056_7.sm
374 j12044_3.sm 419 j12048_8.sm 464 j12052_3.sm 509 j12056_8.sm
375 j12044_4.sm 420 j12048_9.sm 465 j12052_4.sm 510 j12056_9.sm
376 j12044_5.sm 421 j12049_1.sm 466 j12052_5.sm 511 j12057_1.sm
377 j12044_6.sm 422 j12049_10.sm 467 j12052_6.sm 512 j12057_10.sm
378 j12044_7.sm 423 j12049_2.sm 468 j12052_7.sm 513 j12057_2.sm
379 j12044_8.sm 424 j12049_3.sm 469 j12052_8.sm 514 j12057_3.sm
380 j12044_9.sm 425 j12049_4.sm 470 j12052_9.sm 515 j12057_4.sm
381 j12045_1.sm 426 j12049_5.sm 471 j12053_1.sm 516 j12057_5.sm
382 j12045_10.sm 427 j12049_6.sm 472 j12053_10.sm 517 j12057_6.sm
383 j12045_2.sm 428 j12049_7.sm 473 j12053_2.sm 518 j12057_7.sm
384 j12045_3.sm 429 j12049_8.sm 474 j12053_3.sm 519 j12057_8.sm
385 j12045_4.sm 430 j12049_9.sm 475 j12053_4.sm 520 j12057_9.sm
386 j12045_5.sm 431 j1204_1.sm 476 j12053_5.sm 521 j12058_1.sm
387 j12045_6.sm 432 j1204_10.sm 477 j12053_6.sm 522 j12058_10.sm
388 j12045_7.sm 433 j1204_2.sm 478 j12053_7.sm 523 j12058_2.sm
389 j12045_8.sm 434 j1204_3.sm 479 j12053_8.sm 524 j12058_3.sm
390 j12045_9.sm 435 j1204_4.sm 480 j12053_9.sm 525 j12058_4.sm
391 j12046_1.sm 436 j1204_5.sm 481 j12054_1.sm 526 j12058_5.sm
392 j12046_10.sm 437 j1204_6.sm 482 j12054_10.sm 527 j12058_6.sm
393 j12046_2.sm 438 j1204_7.sm 483 j12054_2.sm 528 j12058_7.sm
394 j12046_3.sm 439 j1204_8.sm 484 j12054_3.sm 529 j12058_8.sm
395 j12046_4.sm 440 j1204_9.sm 485 j12054_4.sm 530 j12058_9.sm
396 j12046_5.sm 441 j12050_1.sm 486 j12054_5.sm 531 j12059_1.sm
397 j12046_6.sm 442 j12050_10.sm 487 j12054_6.sm 532 j12059_10.sm
398 j12046_7.sm 443 j12050_2.sm 488 j12054_7.sm 533 j12059_2.sm
399 j12046_8.sm 444 j12050_3.sm 489 j12054_8.sm 534 j12059_3.sm
400 j12046_9.sm 445 j12050_4.sm 490 j12054_9.sm 535 j12059_4.sm
401 j12047_1.sm 446 j12050_5.sm 491 j12055_1.sm 536 j12059_5.sm
402 j12047_10.sm 447 j12050_6.sm 492 j12055_10.sm 537 j12059_6.sm
403 j12047_2.sm 448 j12050_7.sm 493 j12055_2.sm 538 j12059_7.sm
404 j12047_3.sm 449 j12050_8.sm 494 j12055_3.sm 539 j12059_8.sm
405 j12047_4.sm 450 j12050_9.sm 495 j12055_4.sm 540 j12059_9.sm
399
No. File Name No. File Name No. File Name No. File Name
541 j1205_1.sm 556 j12060_5.sm 571 j1207_1.sm 586 j1208_5.sm
542 j1205_10.sm 557 j12060_6.sm 572 j1207_10.sm 587 j1208_6.sm
543 j1205_2.sm 558 j12060_7.sm 573 j1207_2.sm 588 j1208_7.sm
544 j1205_3.sm 559 j12060_8.sm 574 j1207_3.sm 589 j1208_8.sm
545 j1205_4.sm 560 j12060_9.sm 575 j1207_4.sm 590 j1208_9.sm
546 j1205_5.sm 561 j1206_1.sm 576 j1207_5.sm 591 j1209_1.sm
547 j1205_6.sm 562 j1206_10.sm 577 j1207_6.sm 592 j1209_10.sm
548 j1205_7.sm 563 j1206_2.sm 578 j1207_7.sm 593 j1209_2.sm
549 j1205_8.sm 564 j1206_3.sm 579 j1207_8.sm 594 j1209_3.sm
550 j1205_9.sm 565 j1206_4.sm 580 j1207_9.sm 595 j1209_4.sm
551 j12060_1.sm 566 j1206_5.sm 581 j1208_1.sm 596 j1209_5.sm
552 j12060_10.sm 567 j1206_6.sm 582 j1208_10.sm 597 j1209_6.sm
553 j12060_2.sm 568 j1206_7.sm 583 j1208_2.sm 598 j1209_7.sm
554 j12060_3.sm 569 j1206_8.sm 584 j1208_3.sm 599 j1209_8.sm
555 j12060_4.sm 570 j1206_9.sm 585 j1208_4.sm 600 j1209_9.sm
400
************************************************************************
file with base data: j30_17.bas
initial value random generator: 28123
************************************************************************
projects: 1
jobs (incl. super source/sink): 32
horizon: 158
RESOURCES
- renewable: 4 R
- nonrenewable: 0 N
- doubly constrained: 0 D
************************************************************************
PROJECT INFORMATION:
pronr. #Jobs rel. date due date tard cost MPM-Time
1 30 0 38 26 38
************************************************************************
PRECEDENCE RELATIONS:
jobnr. #modes #successors successors
1 1 3 2 3 4
2 1 3 6 11 15
3 1 3 7 8 13
4 1 3 5 9 10
5 1 1 20
6 1 1 30
7 1 1 27
8 1 3 12 19 27
9 1 1 14
10 1 2 16 25
11 1 2 20 26
12 1 1 14
13 1 2 17 18
14 1 1 17
15 1 1 25
16 1 2 21 22
17 1 1 22
18 1 2 20 22
19 1 2 24 29
20 1 2 23 25
21 1 1 28
22 1 1 23
23 1 1 24
24 1 1 30
25 1 1 30
26 1 1 31
27 1 1 28
28 1 1 31
29 1 1 32
30 1 1 32
31 1 1 32
32 1 0
************************************************************************
401
402
Sub Convert_sm2txt()
'This Procedure changes the extension of each file in folder -
PathToUse- from.sm to,txt
403
%This code reads the contents of each text file, representing a network
initially converted from.sm format to.txt format using a VBA code and saved in
DirFileA folder. All file names are listed in filenameListFile file. The
information obtained serves to calculate the triangular probabilistic durations
for each network. New files are saved to new folder called DirFileC.
for k=1:nFileN
filenameA=[DirFileA,MFileName{k}];
filenameC=[DirFileC,'\Tri_Dur_',MFileName{k}];
T1 = readtable(filenameA,'HeaderLines', 0, 'ReadVariableNames',
false,'Format', '%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s');
%T1 = readtable(filenameA,'HeaderLines', 0, 'ReadVariableNames',
false,'Format', '%s');
%T1 = readtable(filenameA);
%Tn=size(T1); %here calculate the size (row and column) of the table, then it
can generate a matrix the same size as the excel
sheet=1;
ID1=xlsread(filenameB,sheet,['B2:B',num2str(Tn1+1)]);
nID=size(ID1);
Tn1=nID(1);Tn2=nID(2);
[~,pd1] = xlsread(filenameB,sheet);
NameID=xlsread(filenameB,sheet,['B2:B',num2str(Tn1+1)]);
n1=i2; n2=n1+Tn1-1;
n3=i4; n4=n3+Tn1-1;
aa=T1(n1:n2,1); %9x1 cell
aaD=T1(n3:n4,1);
AA=table2array(aaD);
AAp=table2array(aa);%cell Tn1x1
nq=length(aa{1,:});
Cp1=cell(Tn1,Tn2); Cp=cell(Tn1,6); %6
C1=cell(Tn1,Tn2); C=cell(Tn1,7);%7
if nq==1
for i=1:Tn1
404
Mat=CC;Matp=CCp;
szC=size(CC);rC=szC(1,2); szCp=size(CCp);rCp=szCp(1,2);
Dur=zeros(Tn1,4);
Dur(:,1)=Mat(:,3);
Index=cell(Tn1,1);
for i=1:Tn1
if or( cell2mat(pd1(i+1,3))=='tri', cell2mat(pd1(i+1,3))=='Tri')
if round(Dur(i,1),2)>0
Dur(i,2)=Dur(i,1)*0.9;
Dur(i,3)=Dur(i,1);
405
Succe_Index=Matp(:,4:rCp);
Successor=cell(Tn1,1);
for i=1:Tn1
j=1;
a=nonzeros(Succe_Index(i,:));
na=length(a);
if na~=0
b='';
while j<=na
if j<na
b1=[num2str(Succe_Index(i,j)),'-FTS-0;'];
b=[b,b1];
else
b1=[num2str(Succe_Index(i,j)),'-FTS-0'];
b=[b,b1];
end
j=j+1;
end
else
b='N/A';
end
Successor{i}=b;
end
Duration=Dur(:,1);
pd=pd1(2:Tn1+1,3);
ID=ID1(:,1);
Table_Net=table(ID,NameID,Duration,pd,Index,Successor);
writetable(Table_Net,filenameC,'Delimiter','\t','WriteRowNames',
true);
warning('off')
end
DetermineRows.m (subscript)
clc;
%This code finds the beginning of the 1st row of the activities/successors
%table. It also determines the total number of activities per file. It also
%finds the beginning of the 1st row of activities/durations
MFileName_1=table2cell(TFileName);
DirFileA=['j',jn,'.sm\J',jn,'_txt_Files\'];
nfile=length(MFileName_1);
%nfile=20;
pos=zeros(nfile,3);
for j=1:nfile
406
T1 = readtable(filenameA,'HeaderLines', 0, 'ReadVariableNames',
false,'Format', '%s%s%s%s%s%s%s%s');
i=1;c=0;cc=0;
n_act=str2num(jn)+2;
while c~=1
u=table2cell(T1(i,'Var1'));
v=cell2mat(u);
if isequal(u,"PRECEDENCE RELATIONS:")==1
i2=i+2;
%i3=i2+n_act-1;
c=0;
end
if isequal(u,"REQUESTS/DURATIONS:")==1
i4=i+3;
%i3=i4-u+1; c=1;
c=1;
end
i=i+1;
end
pos(j,:)=[i2,n_act,i4];
end
moda=mode(pos(:,1));
list1name=cell(nfile,1);list2name=cell(nfile,1);c1=0;c2=0;
for k=1:nfile
if pos(k,1)>moda
list1name(k)=MFileName_1(k);c1=c1+1;
else
list2name(k)=MFileName_1(k);c2=c2+1;k1=k;
end
end
nact=pos(k1,2);
NameFileIssue=cell(c1,1);NameFileFnl=cell(c2,1);c1=0;c2=0;
for k=1:nfile
if isequal(list2name(k),{''})
c1=c1+1;NameFileIssue(c1)=list1name(k);
end
if isequal(list1name(k),{''})
c2=c2+1;NameFileFnl(c2)=list2name(k);
end
end
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
Group #1 (CNC Values Provided for 120 Networks only out of 680) --------page 423
Group #2 None
Group #3 (CNC Values Provided for 120 Networks only out of 680) --------page 424
Group #4 None
Group #5 (CNC Values Provided for 120 Networks only out of 680) --------page 425
422
423
424
425
Group #1 (Ratios provided for 135 Networks only out of 1659) --------page 427
Group #2 (Ratios provided for 135 Networks only out of 284) --------page 428
Group #3 (Ratios provided for 135 Networks only out of 67) --------page 429
Group #4 (Ratios provided for 135 Networks only out of 24) --------page 430
Group #5 (Ratios provided for 135 Networks only out of 6) --------page 430
426
No. Net Name Ratio No. Net Name Ratio No. Net Name Ratio
1 j3043_8 0.64935 46 j3039_1 2.08333 91 j3040_9 3.19149
2 j3045_8 0.79365 47 j3029_9 2.12766 92 j3048_9 3.19149
3 j3037_9 0.87719 48 j3047_2 2.12766 93 j3028_9 3.22581
4 j3045_2 0.89286 49 j3046_2 2.15054 94 j3030_7 3.22581
5 j3035_8 0.95238 50 j3039_7 2.17391 95 j3043_5 3.22581
6 j3042_10 0.96154 51 j3037_4 2.19780 96 j3039_9 3.25203
7 j3036_3 0.99010 52 j3030_2 2.22222 97 j3034_7 3.30579
8 j3036_10 1.02041 53 j3046_6 2.24719 98 j3017_2 3.33333
9 j3042_5 1.02041 54 j3026_8 2.27273 99 j303_4 3.33333
10 j3045_6 1.02041 55 j3034_5 2.32558 100 j3035_1 3.36134
11 j3043_4 1.03093 56 j3046_1 2.32558 101 j3045_9 3.37079
12 j3033_4 1.05263 57 j3038_6 2.36220 102 j3032_1 3.40909
13 j3042_8 1.06383 58 j3019_3 2.38095 103 j3035_4 3.48837
14 j3035_3 1.07527 59 j3043_1 2.38095 104 j3043_3 3.52941
15 j3041_9 1.19048 60 j3047_1 2.45098 105 j3045_5 3.57143
16 j3044_1 1.20482 61 j3041_6 2.46914 106 j3039_5 3.61446
17 j3046_3 1.21951 62 j3034_4 2.50000 107 j3047_6 3.61446
18 j3041_3 1.25000 63 j3040_3 2.50000 108 j3018_4 3.63636
19 j3035_7 1.38889 64 j3048_7 2.50000 109 j3019_5 3.63636
20 j3042_9 1.38889 65 j3034_6 2.53165 110 j3034_9 3.63636
21 j3048_1 1.39860 66 j3042_1 2.53165 111 j3041_7 3.64964
22 j3038_4 1.51515 67 j3036_4 2.58621 112 j3040_6 3.67647
23 j3022_9 1.61290 68 j3048_2 2.58621 113 j3023_3 3.70370
24 j3040_5 1.63934 69 j3033_6 2.63158 114 j3033_9 3.70370
25 j3044_6 1.71429 70 j3035_6 2.63158 115 j3042_6 3.70370
26 j3039_2 1.72414 71 j3045_4 2.67857 116 j3018_3 3.77358
27 j3045_7 1.72414 72 j3031_1 2.70270 117 j3039_4 3.79747
28 j3028_5 1.75439 73 j3031_2 2.70270 118 j3043_7 3.79747
29 j3036_1 1.76991 74 j3043_6 2.70270 119 j3048_5 3.79747
30 j3037_8 1.76991 75 j3042_3 2.75229 120 j3039_8 3.80952
31 j3036_8 1.78571 76 j3023_8 2.77778 121 j3024_3 3.84615
32 j3043_10 1.81818 77 j3025_1 2.77778 122 j3037_6 3.84615
33 j3025_10 1.85185 78 j3038_7 2.77778 123 j3037_7 3.84615
34 j3038_1 1.86916 79 j3048_3 2.77778 124 j3025_5 3.92157
35 j3047_3 1.88679 80 j3044_9 2.81690 125 j3032_2 3.92157
36 j3035_5 1.90476 81 j3035_10 2.85714 126 j3040_2 4.00000
37 j3037_3 1.90476 82 j3045_1 2.85714 127 j3037_10 4.06504
38 j3042_7 1.90476 83 j3019_8 2.94118 128 j3028_6 4.08163
39 j3047_4 1.98020 84 j3046_9 3.00000 129 j3037_2 4.10959
40 j3021_3 2.00000 85 j3048_4 3.01205 130 j3023_1 4.16667
41 j3021_7 2.00000 86 j3026_1 3.03030 131 j303_3 4.16667
42 j3028_10 2.04082 87 j3040_8 3.03030 132 j3017_3 4.25532
43 j3024_4 2.08333 88 j3045_10 3.12500 133 j3030_5 4.25532
44 j3027_8 2.08333 89 j3046_8 3.17460 134 j3044_10 4.25532
45 j3036_5 2.08333 90 j3048_8 3.17460 135 j3041_5 4.26829
427
No. Net Name Ratio No. Net Name Ratio No. Net Name Ratio
1 j3047_8 7.00000 46 j308_2 8.69565 91 j3026_10 10.34483
2 j3018_5 7.14286 47 j3023_9 8.82353 92 j3012_5 10.52632
3 j3027_2 7.14286 48 j3036_7 8.82353 93 j3015_1 10.52632
4 j3027_6 7.14286 49 j3026_5 8.88889 94 j302_2 10.52632
5 j3031_8 7.14286 50 j3032_6 8.88889 95 j3032_4 10.52632
6 j3028_7 7.27273 51 j3010_2 9.09091 96 j304_3 10.52632
7 j3023_6 7.31707 52 j3013_9 9.09091 97 j305_10 10.52632
8 j3032_8 7.31707 53 j3015_5 9.09091 98 j305_2 10.52632
9 j3021_2 7.35294 54 j3019_6 9.09091 99 j306_8 10.52632
10 j3038_2 7.50000 55 j304_2 9.09091 100 j307_6 10.52632
11 j3044_7 7.56303 56 j304_5 9.09091 101 j308_7 10.52632
12 j3018_6 7.69231 57 j3033_5 9.19540 102 j304_8 10.71429
13 j3023_4 7.69231 58 j3039_3 9.25926 103 j3018_10 10.81081
14 j3023_7 7.69231 59 j3018_8 9.30233 104 j3025_4 10.81081
15 j3024_2 7.69231 60 j3027_1 9.30233 105 j3012_6 11.11111
16 j309_10 7.69231 61 j3011_1 9.52381 106 j3017_4 11.11111
17 j3029_4 7.84314 62 j3012_3 9.52381 107 j3018_9 11.11111
18 j3044_3 7.86517 63 j3013_6 9.52381 108 j3020_10 11.11111
19 j3047_5 7.86517 64 j3015_3 9.52381 109 j3020_4 11.11111
20 j3019_10 7.89474 65 j301_7 9.52381 110 j3020_9 11.11111
21 j309_7 8.00000 66 j303_9 9.52381 111 j3023_10 11.11111
22 j3041_1 8.06452 67 j305_3 9.52381 112 j3024_9 11.11111
23 j3036_2 8.10811 68 j309_5 9.52381 113 j302_7 11.11111
24 j3047_7 8.13953 69 j3021_1 9.61538 114 j3032_10 11.11111
25 j3022_10 8.16327 70 j3022_8 9.75610 115 j3041_10 11.11111
26 j3019_7 8.33333 71 j3025_3 9.75610 116 j305_9 11.11111
27 j3024_6 8.33333 72 j3030_1 9.75610 117 j306_2 11.11111
28 j3025_6 8.33333 73 j3047_9 9.82143 118 j3022_5 11.62791
29 j302_4 8.33333 74 j3041_8 9.89011 119 j3020_3 11.76471
30 j3030_6 8.33333 75 j3010_5 10.00000 120 j3012_7 12.00000
31 j3041_4 8.43373 76 j3011_4 10.00000 121 j3016_5 12.00000
32 j3022_2 8.47458 77 j3014_7 10.00000 122 j3030_4 12.00000
33 j3025_7 8.51064 78 j301_1 10.00000 123 j3017_1 12.12121
34 j3042_2 8.53659 79 j301_3 10.00000 124 j3022_6 12.12121
35 j3022_7 8.57143 80 j3025_8 10.00000 125 j3018_2 12.16216
36 j3037_5 8.57143 81 j3025_9 10.00000 126 j3029_7 12.19512
37 j3043_2 8.57143 82 j3026_7 10.00000 127 j3013_3 12.50000
38 j3044_4 8.62069 83 j3032_3 10.00000 128 j303_8 12.50000
39 j3021_4 8.64198 84 j308_10 10.00000 129 j304_1 12.50000
40 j3014_3 8.69565 85 j3034_2 10.09174 130 j3027_5 12.76596
41 j3015_10 8.69565 86 j3046_10 10.11236 131 j3014_5 13.04348
42 j3016_9 8.69565 87 j3021_10 10.20408 132 j306_9 13.04348
43 j3022_1 8.69565 88 j3038_5 10.22727 133 j3019_1 13.15789
44 j303_10 8.69565 89 j3021_6 10.25641 134 j3019_4 13.15789
45 j307_4 8.69565 90 j3021_5 10.34483 135 j3029_5 13.15789
428
No. Net Name Ratio No. Net Name Ratio No. Net Name Ratio
1 j3017_5 14.00000 24 j308_6 15.00000 47 j6012_6 16.66667
2 j3018_1 14.00000 25 j309_1 15.00000 48 j6012_9 16.66667
3 j3020_5 14.00000 26 j309_9 15.00000 49 j3022_4 17.07317
4 j3012_4 14.28571 27 j6011_1 15.00000 50 j3021_9 17.14286
5 j3013_4 14.28571 28 j608_2 15.15152 51 j3029_3 17.14286
6 j3016_2 14.28571 29 j3027_7 15.21739 52 j3011_5 17.39130
7 j306_5 14.28571 30 j3014_10 15.38462 53 j306_7 17.39130
8 j308_1 14.28571 31 j3010_10 15.78947 54 j3023_5 17.94872
9 j6014_1 14.28571 32 j3011_2 15.78947 55 j3010_4 18.18182
10 j6014_5 14.28571 33 j3013_5 15.78947 56 j3024_8 18.18182
11 j608_4 14.28571 34 j301_5 15.78947 57 j304_6 18.18182
12 j9015_2 14.58333 35 j302_5 15.78947 58 j3011_10 19.04762
13 j6016_9 14.63415 36 j3032_7 15.78947 59 j302_3 19.04762
14 j6013_5 14.70588 37 j307_7 15.78947 60 j3030_9 19.04762
15 j602_5 14.70588 38 j308_9 15.78947 61 j306_10 19.04762
16 j602_7 14.70588 39 j309_3 15.78947 62 j3029_6 19.51220
17 j3010_1 15.00000 40 j302_10 16.00000 63 j3013_8 20.00000
18 j3014_4 15.00000 41 j306_3 16.00000 64 j3015_2 20.00000
19 j301_9 15.00000 42 j3019_9 16.21622 65 j307_10 20.00000
20 j304_9 15.00000 43 j3028_3 16.21622 66 j6016_6 20.00000
21 j305_5 15.00000 44 j3015_9 16.66667 67 j3014_1 20.83333
22 j306_6 15.00000 45 j302_8 16.66667
23 j307_3 15.00000 46 j307_8 16.66667
429
No. Net Name Ratio No. Net Name Ratio No. Net Name Ratio
1 j3010_8 21.05263
2 j3012_2 21.05263
3 j3014_6 21.05263
4 j3016_3 21.05263
5 j3016_4 21.05263
6 j302_9 21.05263
7 j305_1 21.05263
8 j307_5 21.05263
9 j308_8 21.05263
10 j3011_7 22.22222
11 j301_2 22.22222
12 j305_6 22.22222
13 j307_2 22.72727
14 j3016_7 23.80952
15 j303_6 23.80952
16 j3020_6 24.48980
17 j301_10 25.00000
18 j3012_1 26.31579
19 j3014_2 26.31579
20 j305_4 26.31579
21 j301_6 26.92308
22 j3013_2 27.77778
23 j305_7 27.77778
24 j309_2 27.77778
PSPLIB Networks - Paths Ratios (%) for 100 Simulation Runs Group#5: Ratios > 28%
No. Net Name Ratio No. Net Name Ratio No. Net Name Ratio
1 j303_1 28.00000
2 j302_1 30.00000
3 j3013_1 30.43478
4 j303_5 30.43478
5 j3012_8 31.57895
6 j306_4 31.57895
430
Group #1 (D Values Provided for 135 Networks only out of 480) --------page 432
Group #2 (D Values Provided for 135 Networks only out of 639) --------page 433
Group #3 (D Values Provided for 135 Networks only out of 361) --------page 434
Group #4 (D Values Provided for 135 Networks only out of 360) --------page 435
Group #5 (D Values Provided for 135 Networks only out of 200) --------page 436
431
432
433
434
435
436
Group #1 (Cn Values provided for 135 Networks only out of 520) --------page 438
Group #2 (Cn Values provided for 135 Networks only out of 681) --------page 439
Group #3 (Cn Values provided for 135 Networks only out of 519) --------page 440
Group #4 (Cn Values provided for 135 Networks only out of 160) --------page 441
Group #5 (Cn Values provided for 135 Networks only out of 160) --------page 442
437
No. Net Name Cn (%) No. Net Name Cn (%) No. Net Name Cn (%)
1 j6010_1 15.2960 41 j6014_8 15.2960 81 j603_10 15.2960
2 j6010_10 15.2960 42 j6014_9 15.2960 82 j603_2 15.2960
3 j6010_2 15.2960 43 j6015_1 15.2960 83 j603_3 15.2960
4 j6010_3 15.2960 44 j6015_10 15.2960 84 j603_4 15.2960
5 j6010_4 15.2960 45 j6015_2 15.2960 85 j603_5 15.2960
6 j6010_5 15.2960 46 j6015_3 15.2960 86 j603_6 15.2960
7 j6010_6 15.2960 47 j6015_4 15.2960 87 j603_8 15.2960
8 j6010_7 15.2960 48 j6015_5 15.2960 88 j604_1 15.2960
9 j6010_8 15.2960 49 j6015_7 15.2960 89 j604_10 15.2960
10 j6011_10 15.2960 50 j6015_8 15.2960 90 j604_2 15.2960
11 j6011_3 15.2960 51 j6015_9 15.2960 91 j604_3 15.2960
12 j6011_4 15.2960 52 j6016_1 15.2960 92 j604_4 15.2960
13 j6011_5 15.2960 53 j6016_10 15.2960 93 j604_5 15.2960
14 j6011_6 15.2960 54 j6016_2 15.2960 94 j604_6 15.2960
15 j6011_7 15.2960 55 j6016_3 15.2960 95 j604_7 15.2960
16 j6011_8 15.2960 56 j6016_4 15.2960 96 j604_9 15.2960
17 j6011_9 15.2960 57 j6016_5 15.2960 97 j605_1 15.2960
18 j6012_1 15.2960 58 j6016_6 15.2960 98 j605_10 15.2960
19 j6012_10 15.2960 59 j6016_7 15.2960 99 j605_2 15.2960
20 j6012_2 15.2960 60 j6016_8 15.2960 100 j605_3 15.2960
21 j6012_3 15.2960 61 j601_1 15.2960 101 j605_4 15.2960
22 j6012_5 15.2960 62 j601_10 15.2960 102 j605_5 15.2960
23 j6012_6 15.2960 63 j601_2 15.2960 103 j605_6 15.2960
24 j6012_7 15.2960 64 j601_3 15.2960 104 j605_7 15.2960
25 j6012_9 15.2960 65 j601_4 15.2960 105 j605_8 15.2960
26 j6013_1 15.2960 66 j601_5 15.2960 106 j605_9 15.2960
27 j6013_10 15.2960 67 j601_6 15.2960 107 j606_1 15.2960
28 j6013_2 15.2960 68 j601_7 15.2960 108 j606_10 15.2960
29 j6013_4 15.2960 69 j601_8 15.2960 109 j606_2 15.2960
30 j6013_5 15.2960 70 j601_9 15.2960 110 j606_3 15.2960
31 j6013_6 15.2960 71 j602_1 15.2960 111 j606_4 15.2960
32 j6013_7 15.2960 72 j602_10 15.2960 112 j606_5 15.2960
33 j6013_8 15.2960 73 j602_2 15.2960 113 j606_6 15.2960
34 j6013_9 15.2960 74 j602_3 15.2960 114 j606_7 15.2960
35 j6014_2 15.2960 75 j602_4 15.2960 115 j606_8 15.2960
36 j6014_3 15.2960 76 j602_5 15.2960 116 j606_9 15.2960
37 j6014_4 15.2960 77 j602_6 15.2960 117 j607_1 15.2960
38 j6014_5 15.2960 78 j602_7 15.2960 118 j607_10 15.2960
39 j6014_6 15.2960 79 j602_8 15.2960 119 j607_2 15.2960
40 j6014_7 15.2960 80 j603_1 15.2960 120 j607_3 15.2960
438
No. Net Name Cn (%) No. Net Name Cn (%) No. Net Name Cn (%)
1 j3010_1 20.7094 41 j3014_10 20.7094 81 j302_3 20.7094
2 j3010_10 20.7094 42 j3014_2 20.7094 82 j302_4 20.7094
3 j3010_2 20.7094 43 j3014_3 20.7094 83 j302_5 20.7094
4 j3010_3 20.7094 44 j3014_4 20.7094 84 j302_6 20.7094
5 j3010_4 20.7094 45 j3014_5 20.7094 85 j302_7 20.7094
6 j3010_5 20.7094 46 j3014_6 20.7094 86 j302_8 20.7094
7 j3010_6 20.7094 47 j3014_7 20.7094 87 j302_9 20.7094
8 j3010_7 20.7094 48 j3014_8 20.7094 88 j303_1 20.7094
9 j3010_8 20.7094 49 j3014_9 20.7094 89 j303_10 20.7094
10 j3010_9 20.7094 50 j3015_1 20.7094 90 j303_2 20.7094
11 j3011_1 20.7094 51 j3015_10 20.7094 91 j303_3 20.7094
12 j3011_10 20.7094 52 j3015_2 20.7094 92 j303_4 20.7094
13 j3011_2 20.7094 53 j3015_3 20.7094 93 j303_5 20.7094
14 j3011_3 20.7094 54 j3015_4 20.7094 94 j303_6 20.7094
15 j3011_4 20.7094 55 j3015_5 20.7094 95 j303_7 20.7094
16 j3011_5 20.7094 56 j3015_6 20.7094 96 j303_8 20.7094
17 j3011_6 20.7094 57 j3015_7 20.7094 97 j303_9 20.7094
18 j3011_7 20.7094 58 j3015_9 20.7094 98 j304_1 20.7094
19 j3011_8 20.7094 59 j3016_1 20.7094 99 j304_10 20.7094
20 j3011_9 20.7094 60 j3016_10 20.7094 100 j304_2 20.7094
21 j3012_1 20.7094 61 j3016_2 20.7094 101 j304_3 20.7094
22 j3012_10 20.7094 62 j3016_3 20.7094 102 j304_4 20.7094
23 j3012_2 20.7094 63 j3016_4 20.7094 103 j304_5 20.7094
24 j3012_3 20.7094 64 j3016_5 20.7094 104 j304_7 20.7094
25 j3012_4 20.7094 65 j3016_6 20.7094 105 j304_9 20.7094
26 j3012_5 20.7094 66 j3016_7 20.7094 106 j305_1 20.7094
27 j3012_6 20.7094 67 j3016_8 20.7094 107 j305_10 20.7094
28 j3012_7 20.7094 68 j3016_9 20.7094 108 j305_2 20.7094
29 j3012_8 20.7094 69 j301_1 20.7094 109 j305_3 20.7094
30 j3012_9 20.7094 70 j301_10 20.7094 110 j305_4 20.7094
31 j3013_1 20.7094 71 j301_2 20.7094 111 j305_5 20.7094
32 j3013_10 20.7094 72 j301_3 20.7094 112 j305_6 20.7094
33 j3013_2 20.7094 73 j301_4 20.7094 113 j305_7 20.7094
34 j3013_3 20.7094 74 j301_5 20.7094 114 j305_8 20.7094
35 j3013_4 20.7094 75 j301_6 20.7094 115 j305_9 20.7094
36 j3013_5 20.7094 76 j301_7 20.7094 116 j306_1 20.7094
37 j3013_7 20.7094 77 j301_8 20.7094 117 j306_2 20.7094
38 j3013_8 20.7094 78 j301_9 20.7094 118 j306_3 20.7094
39 j3013_9 20.7094 79 j302_1 20.7094 119 j306_4 20.7094
40 j3014_1 20.7094 80 j302_2 20.7094 120 j306_5 20.7094
439
No. Net Name Cn (%) No. Net Name Cn (%) No. Net Name Cn (%)
1 j6017_1 22.0386 41 j6021_6 22.0386 81 j6026_2 22.0386
2 j6017_10 22.0386 42 j6021_8 22.0386 82 j6026_4 22.0386
3 j6017_2 22.0386 43 j6021_9 22.0386 83 j6026_5 22.0386
4 j6017_3 22.0386 44 j6022_1 22.0386 84 j6026_6 22.0386
5 j6017_4 22.0386 45 j6022_10 22.0386 85 j6026_7 22.0386
6 j6017_5 22.0386 46 j6022_2 22.0386 86 j6026_8 22.0386
7 j6017_6 22.0386 47 j6022_3 22.0386 87 j6026_9 22.0386
8 j6017_7 22.0386 48 j6022_4 22.0386 88 j6027_1 22.0386
9 j6017_9 22.0386 49 j6022_5 22.0386 89 j6027_10 22.0386
10 j6018_10 22.0386 50 j6022_6 22.0386 90 j6027_3 22.0386
11 j6018_2 22.0386 51 j6022_7 22.0386 91 j6027_4 22.0386
12 j6018_3 22.0386 52 j6022_8 22.0386 92 j6027_5 22.0386
13 j6018_4 22.0386 53 j6022_9 22.0386 93 j6027_6 22.0386
14 j6018_5 22.0386 54 j6023_1 22.0386 94 j6027_7 22.0386
15 j6018_6 22.0386 55 j6023_10 22.0386 95 j6027_8 22.0386
16 j6018_7 22.0386 56 j6023_2 22.0386 96 j6027_9 22.0386
17 j6018_8 22.0386 57 j6023_3 22.0386 97 j6028_1 22.0386
18 j6018_9 22.0386 58 j6023_5 22.0386 98 j6028_10 22.0386
19 j6019_1 22.0386 59 j6023_6 22.0386 99 j6028_2 22.0386
20 j6019_10 22.0386 60 j6023_8 22.0386 100 j6028_3 22.0386
21 j6019_2 22.0386 61 j6023_9 22.0386 101 j6028_4 22.0386
22 j6019_3 22.0386 62 j6024_1 22.0386 102 j6028_5 22.0386
23 j6019_4 22.0386 63 j6024_10 22.0386 103 j6028_7 22.0386
24 j6019_5 22.0386 64 j6024_3 22.0386 104 j6028_8 22.0386
25 j6019_6 22.0386 65 j6024_4 22.0386 105 j6028_9 22.0386
26 j6019_7 22.0386 66 j6024_5 22.0386 106 j6029_1 22.0386
27 j6019_8 22.0386 67 j6024_6 22.0386 107 j6029_10 22.0386
28 j6019_9 22.0386 68 j6024_7 22.0386 108 j6029_2 22.0386
29 j6020_10 22.0386 69 j6024_8 22.0386 109 j6029_3 22.0386
30 j6020_2 22.0386 70 j6024_9 22.0386 110 j6029_4 22.0386
31 j6020_3 22.0386 71 j6025_1 22.0386 111 j6029_6 22.0386
32 j6020_5 22.0386 72 j6025_2 22.0386 112 j6029_7 22.0386
33 j6020_7 22.0386 73 j6025_3 22.0386 113 j6029_8 22.0386
34 j6020_8 22.0386 74 j6025_4 22.0386 114 j6030_1 22.0386
35 j6020_9 22.0386 75 j6025_5 22.0386 115 j6030_10 22.0386
36 j6021_1 22.0386 76 j6025_6 22.0386 116 j6030_2 22.0386
37 j6021_10 22.0386 77 j6025_8 22.0386 117 j6030_3 22.0386
38 j6021_3 22.0386 78 j6025_9 22.0386 118 j6030_4 22.0386
39 j6021_4 22.0386 79 j6026_1 22.0386 119 j6030_5 22.0386
40 j6021_5 22.0386 80 j6026_10 22.0386 120 j6030_6 22.0386
440
No. Net Name Cn (%) No. Net Name Cn (%) No. Net Name Cn (%)
1 j3017_1 29.6731 41 j3021_10 29.6731 81 j3025_4 29.6731
2 j3017_10 29.6731 42 j3021_2 29.6731 82 j3025_5 29.6731
3 j3017_2 29.6731 43 j3021_3 29.6731 83 j3025_6 29.6731
4 j3017_3 29.6731 44 j3021_4 29.6731 84 j3025_7 29.6731
5 j3017_4 29.6731 45 j3021_5 29.6731 85 j3025_8 29.6731
6 j3017_5 29.6731 46 j3021_6 29.6731 86 j3025_9 29.6731
7 j3017_6 29.6731 47 j3021_7 29.6731 87 j3026_1 29.6731
8 j3017_7 29.6731 48 j3021_8 29.6731 88 j3026_10 29.6731
9 j3017_8 29.6731 49 j3022_1 29.6731 89 j3026_2 29.6731
10 j3017_9 29.6731 50 j3022_10 29.6731 90 j3026_3 29.6731
11 j3018_1 29.6731 51 j3022_2 29.6731 91 j3026_4 29.6731
12 j3018_10 29.6731 52 j3022_3 29.6731 92 j3026_5 29.6731
13 j3018_2 29.6731 53 j3022_4 29.6731 93 j3026_6 29.6731
14 j3018_3 29.6731 54 j3022_5 29.6731 94 j3026_7 29.6731
15 j3018_4 29.6731 55 j3022_7 29.6731 95 j3026_8 29.6731
16 j3018_5 29.6731 56 j3022_8 29.6731 96 j3026_9 29.6731
17 j3018_6 29.6731 57 j3022_9 29.6731 97 j3027_1 29.6731
18 j3018_7 29.6731 58 j3023_1 29.6731 98 j3027_10 29.6731
19 j3018_8 29.6731 59 j3023_10 29.6731 99 j3027_2 29.6731
20 j3018_9 29.6731 60 j3023_2 29.6731 100 j3027_3 29.6731
21 j3019_1 29.6731 61 j3023_3 29.6731 101 j3027_4 29.6731
22 j3019_10 29.6731 62 j3023_4 29.6731 102 j3027_5 29.6731
23 j3019_2 29.6731 63 j3023_5 29.6731 103 j3027_6 29.6731
24 j3019_3 29.6731 64 j3023_6 29.6731 104 j3027_7 29.6731
25 j3019_4 29.6731 65 j3023_7 29.6731 105 j3027_8 29.6731
26 j3019_5 29.6731 66 j3023_8 29.6731 106 j3027_9 29.6731
27 j3019_6 29.6731 67 j3023_9 29.6731 107 j3028_1 29.6731
28 j3019_7 29.6731 68 j3024_10 29.6731 108 j3028_10 29.6731
29 j3019_9 29.6731 69 j3024_2 29.6731 109 j3028_2 29.6731
30 j3020_1 29.6731 70 j3024_3 29.6731 110 j3028_3 29.6731
31 j3020_10 29.6731 71 j3024_4 29.6731 111 j3028_4 29.6731
32 j3020_2 29.6731 72 j3024_5 29.6731 112 j3028_5 29.6731
33 j3020_3 29.6731 73 j3024_6 29.6731 113 j3028_6 29.6731
34 j3020_4 29.6731 74 j3024_7 29.6731 114 j3028_7 29.6731
35 j3020_5 29.6731 75 j3024_8 29.6731 115 j3028_8 29.6731
36 j3020_6 29.6731 76 j3024_9 29.6731 116 j3028_9 29.6731
37 j3020_7 29.6731 77 j3025_1 29.6731 117 j3029_1 29.6731
38 j3020_8 29.6731 78 j3025_10 29.6731 118 j3029_10 29.6731
39 j3020_9 29.6731 79 j3025_2 29.6731 119 j3029_2 29.6731
40 j3021_1 29.6731 80 j3025_3 29.6731 120 j3029_3 29.6731
441
No. Net Name Cn (%) No. Net Name Cn (%) No. Net Name Cn (%)
1 j3033_1 37.2075 41 j3037_3 37.2075 81 j3041_5 37.2075
2 j3033_10 37.2075 42 j3037_4 37.2075 82 j3041_6 37.2075
3 j3033_2 37.2075 43 j3037_5 37.2075 83 j3041_7 37.2075
4 j3033_3 37.2075 44 j3037_6 37.2075 84 j3041_8 37.2075
5 j3033_4 37.2075 45 j3037_7 37.2075 85 j3042_1 37.2075
6 j3033_5 37.2075 46 j3037_8 37.2075 86 j3042_10 37.2075
7 j3033_6 37.2075 47 j3037_9 37.2075 87 j3042_2 37.2075
8 j3033_7 37.2075 48 j3038_1 37.2075 88 j3042_3 37.2075
9 j3033_9 37.2075 49 j3038_10 37.2075 89 j3042_4 37.2075
10 j3034_1 37.2075 50 j3038_2 37.2075 90 j3042_5 37.2075
11 j3034_10 37.2075 51 j3038_3 37.2075 91 j3042_6 37.2075
12 j3034_2 37.2075 52 j3038_4 37.2075 92 j3042_7 37.2075
13 j3034_3 37.2075 53 j3038_5 37.2075 93 j3042_8 37.2075
14 j3034_4 37.2075 54 j3038_6 37.2075 94 j3042_9 37.2075
15 j3034_5 37.2075 55 j3038_7 37.2075 95 j3043_1 37.2075
16 j3034_6 37.2075 56 j3038_8 37.2075 96 j3043_10 37.2075
17 j3034_7 37.2075 57 j3038_9 37.2075 97 j3043_2 37.2075
18 j3034_8 37.2075 58 j3039_1 37.2075 98 j3043_3 37.2075
19 j3034_9 37.2075 59 j3039_10 37.2075 99 j3043_4 37.2075
20 j3035_1 37.2075 60 j3039_2 37.2075 100 j3043_5 37.2075
21 j3035_10 37.2075 61 j3039_3 37.2075 101 j3043_6 37.2075
22 j3035_2 37.2075 62 j3039_4 37.2075 102 j3043_7 37.2075
23 j3035_3 37.2075 63 j3039_5 37.2075 103 j3043_8 37.2075
24 j3035_4 37.2075 64 j3039_7 37.2075 104 j3043_9 37.2075
25 j3035_5 37.2075 65 j3039_8 37.2075 105 j3044_1 37.2075
26 j3035_6 37.2075 66 j3039_9 37.2075 106 j3044_10 37.2075
27 j3035_7 37.2075 67 j3040_1 37.2075 107 j3044_2 37.2075
28 j3035_9 37.2075 68 j3040_10 37.2075 108 j3044_3 37.2075
29 j3036_1 37.2075 69 j3040_2 37.2075 109 j3044_4 37.2075
30 j3036_10 37.2075 70 j3040_3 37.2075 110 j3044_5 37.2075
31 j3036_2 37.2075 71 j3040_4 37.2075 111 j3044_6 37.2075
32 j3036_3 37.2075 72 j3040_6 37.2075 112 j3044_7 37.2075
33 j3036_4 37.2075 73 j3040_7 37.2075 113 j3044_8 37.2075
34 j3036_5 37.2075 74 j3040_8 37.2075 114 j3044_9 37.2075
35 j3036_6 37.2075 75 j3040_9 37.2075 115 j3045_1 37.2075
36 j3036_7 37.2075 76 j3041_1 37.2075 116 j3045_10 37.2075
37 j3036_8 37.2075 77 j3041_10 37.2075 117 j3045_2 37.2075
38 j3036_9 37.2075 78 j3041_2 37.2075 118 j3045_4 37.2075
39 j3037_1 37.2075 79 j3041_3 37.2075 119 j3045_5 37.2075
40 j3037_2 37.2075 80 j3041_4 37.2075 120 j3045_6 37.2075
442
Group #1 (OS Values provided for 135 Networks only out of 920) --------page 444
Group #2 (OS Values provided for 135 Networks only out of 480) --------page 445
Group #3 (OS Values provided for 135 Networks only out of 160) --------page 446
Group #4 (OS Values provided for 135 Networks only out of 160) --------page 447
Group #5 (OS Values provided for 135 Networks only out of 320) --------page 448
443
444
445
446
447
448
Group #1 (RT Values provided for 135 Networks only out of 480) --------page 450
Group #2 (RT Values provided for 135 Networks only out of 639) --------page 451
Group #3 (RT Values provided for 135 Networks only out of 361) --------page 452
Group #4 (RT Values provided for 135 Networks only out of 360) --------page 453
Group #5 (RT Values provided for 135 Networks only out of 200) --------page 454
449
450
451
452
453
454
455
456
Appendix F.2 – Plots of Deviations ∆μ,100 versus Sample Size n Required to Construct the
PSPLIB j60s
457
458
459
PSPLIB j60s
460
461
462
Deviations ∆μ,100
463
Deviations (∆ ,
) of 𝑙 ̅ from 𝜇 found each with 100 simulated Project Network 𝑺
464
465
466
Norm I
1000 Simulations
Normalized 1st Eigenvalues (Norm I, 0.025)
Significance level α / Probability P
0.01 / 0.99 0.05 / 0.95 0.10 / 0.9 0.20 / 0.80
Network n RT D₂ KS-test 1 KS-test 2 KS-test 3 KS-test 4
j3011-1 103 0.46 0.02 1 1 1 1
j3011-1 104 0.46 0.024 1 1 1 1
j3011-1 105 0.46 0.021 1 1 1 1
j3011-1 106 0.46 0.025 1 1 1 1
j3011-1 107 0.46 0.012 1 1 1 1
j3024-8 104 0.46 0.045 1 0 0 0
j3024-8 105 0.46 0.051 1 0 0 0
j3037-6 116 0.688 0.051 1 0 0 0
j3038-7 127 0.581 0.046 1 0 0 0
j3038-7 128 0.581 0.039 1 1 0 0
j3038-7 129 0.581 0.036 1 1 1 0
j3038-7 131 0.581 0.049 1 0 0 0
j3041-8 124 0.579 0.051 1 0 0 0
j3041-8 128 0.579 0.045 1 0 0 0
j3041-8 124 0.579 0.043 1 1 0 0
j3041-8 125 0.579 0.044 1 0 0 0
j3041-8 127 0.579 0.045 1 0 0 0
j3048-2 143 0.651 0.023 1 1 1 1
j3048-2 144 0.651 0.039 1 1 0 0
j3048-2 145 0.651 0.026 1 1 1 1
j3048-2 146 0.651 0.032 1 1 1 1
j3048-2 147 0.651 0.034 1 1 1 0
j305-7 118 0.339 0.036 1 1 1 0
j305-7 119 0.339 0.048 1 0 0 0
467
1000 Simulations
Normalized 1st Eigenvalues (Norm II, 0.025)
Significance level α / Probability P
0.01 / 0.99 0.05 / 0.95 0.10 / 0.9 0.20 / 0.80
Network n RT D₂ KS-test 1 KS-test 2 KS-test 3 KS-test 4
j3038-5 53 0.617 0.0311 1 1 1 1
j3038-5 54 0.617 0.0482 1 0 0 0
j3038-5 55 0.617 0.0361 1 1 1 0
j3038-5 56 0.617 0.0495 1 0 0 0
j3038-5 57 0.617 0.0476 1 0 0 0
468
Norm I
1000 Simulations
Normalized 1st Eigenvalues (Norm I, 0.025)
Significance level α / Probability P
0.01 / 0.99 0.05 / 0.95 0.10 / 0.9 0.20 / 0.80
Network n RT D₂ KS-test 1 KS-test 2 KS-test 3 KS-test 4
j3024-8 104 0.46 0.017 1 1 1 1
j3024-8 105 0.46 0.022 1 1 1 1
j3024-8 106 0.46 0.018 1 1 1 1
j3024-8 107 0.46 0.029 1 1 1 1
j3024-8 108 0.46 0.023 1 1 1 1
j3041-8 124 0.579 0.018 1 1 1 1
j3041-8 125 0.579 0.014 1 1 1 1
j3041-8 126 0.579 0.028 1 1 1 1
j3041-8 127 0.579 0.02 1 1 1 1
j3041-8 128 0.579 0.019 1 1 1 1
469
1000 Simulations
Normalized 1st Eigenvalues (Norm II, 0.025)
Significance level α / Probability P
0.01 / 0.99 0.05 / 0.95 0.10 / 0.9 0.20 / 0.80
Network n RT D₂ KS-test 1 KS-test 2 KS-test 3 KS-test 4
j3011-1 49 0.4597 0.0396 1 1 0 0
j3011-1 50 0.4597 0.027 1 1 1 1
j3011-1 51 0.4597 0.0317 1 1 1 1
j3012-6 59 0.3952 0.0512 1 0 0 0
j3034-10 50 0.5907 0.0451 1 0 0 0
j3034-10 51 0.5907 0.0343 1 1 1 0
j3034-10 52 0.5907 0.0207 1 1 1 1
j3034-10 53 0.5907 0.0376 1 1 1 0
j3037-6 50 0.6875 0.0425 1 1 0 0
j3037-6 51 0.6875 0.0418 1 1 0 0
j3037-6 52 0.6875 0.0393 1 1 0 0
j3037-6 53 0.6875 0.0398 1 1 0 0
j3038-7 51 0.5806 0.044 1 0 0 0
j3038-7 52 0.5806 0.0403 1 1 0 0
j3038-7 53 0.5806 0.0468 1 0 0 0
j6028-9 122 0.403 0.0461 1 0 0 0
j6028-9 124 0.403 0.0424 1 1 0 0
j6028-9 125 0.403 0.0411 1 1 0 0
j6028-9 126 0.403 0.0343 1 1 1 0
j6042-6 115 0.5738 0.0316 1 1 1 1
j6042-6 116 0.5738 0.0355 1 1 1 0
j6042-6 117 0.5738 0.0304 1 1 1 1
j6042-6 118 0.5738 0.0343 1 1 1 0
j6042-6 119 0.5738 0.027 1 1 1 1
j9010-5 174 0.2174 0.0467 1 0 0 0
j9010-5 175 0.2174 0.043 1 1 0 0
j9010-5 176 0.2174 0.0502 1 0 0 0
j12012-1 348 0.2179 0.0393 1 1 0 0
j12012-1 349 0.2179 0.0435 1 0 0 0
j12012-1 350 0.2179 0.0498 1 0 0 0
j12014-1 260 0.178 0.0377 1 1 1 0
j12014-1 261 0.178 0.0331 1 1 1 1
j12014-1 262 0.178 0.0452 1 0 0 0
470
K-S TEST
Test 1 Test 2 Test 3 Test 4
Network n D2 0.01 0.05 0.1 0.2
j9010_5 1007 0.1653 0 0 0 0
1008 0.1388 0 0 0 0
1009 0.1459 0 0 0 0
1010 0.1378 0 0 0 0
1011 0.1410 0 0 0 0
j9014_5 1050 0.1003 0 0 0 0
1051 0.0789 0 0 0 0
1052 0.0544 0 0 0 0
1053 0.0422 1 1 0 0
1054 0.0405 1 1 0 0
j901_3 689 0.0994 0 0 0 0
690 0.0936 0 0 0 0
691 0.0664 0 0 0 0
692 0.0716 0 0 0 0
693 0.0927 0 0 0 0
j90 - Norm I
Sample Statistics
Network n Median Mode Mean Variance Skewness Kurtosis
j9010_5 1007 -0.9146 -8.2506 -0.9090 4.3014 -0.0341 0.0358
1008 -1.0735 -7.0683 -1.0699 4.5727 0.1007 -0.0994
1009 -1.1035 -7.6716 -1.0869 4.7206 -0.0567 -0.1518
1010 -1.2844 -7.9919 -1.2241 4.4182 -0.0166 -0.1697
1011 -1.1573 -7.2551 -1.2111 4.5736 -0.0573 0.0222
j9014_5 1050 -1.0046 -5.5804 -0.9743 2.0825 0.0540 0.1337
1051 -1.0393 -5.9595 -1.0577 1.8915 -0.0614 -0.0233
1052 -1.1111 -4.7403 -1.0891 1.6937 0.0611 -0.2390
1053 -1.1600 -5.1489 -1.1970 1.7254 -0.0636 -0.1756
1054 -1.2869 -5.5193 -1.2866 1.7904 0.0286 0.2135
j901_3 689 -1.1022 -5.7798 -0.9368 2.6187 0.5464 0.4098
690 -1.1653 -5.6746 -1.0335 2.6891 0.4323 0.6950
691 -1.3215 -6.5441 -1.1254 2.9931 0.7407 1.3939
692 -1.2752 -6.1249 -1.1782 2.5958 0.3315 0.2867
693 -1.3921 -5.6574 -1.2924 2.7591 0.3605 0.1375
471
472
𝑇
𝑿𝑛 𝑝 𝑥1 𝒙2 … 𝒙𝑝 with each column vector 𝒙𝑗 𝐸𝐹1𝑗 , … , 𝐸𝐹𝑛𝑗 representing observed EF times of the
project activity “j.”]
𝑇
Standardizing 𝑿 to obtain a new matrix 𝑺 𝑿 𝑿. Where:
𝒙 𝒙
𝒘 , Denoted with 𝒙 is the mean of 𝒙 .
𝒙 𝒙
𝑹 𝜒 𝑛, 𝑛, 𝑝 𝐶𝑡 , ,
474