DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SUBJECT: Advanced Computer Architecture SUBJECT CODE: CS903
Prepared By :
V.GOMATHI, AP/CSE
Verified by: Approved by :
UNIT – II
Principles of Scalable Performance - Performance Metrics and Measures, Parallel Processing
Applications, Speedup Performance Laws, Scalability Analysis and Approaches. Processor and
Memory Hierarchy - Advanced Processor Technology, Superscalar and Vector Processors,
Memory Hierarchy Technology, Virtual Memory Technology.
1. SPEEDUP PERFORMANCE LAWS, SCALABILITY ANALYSIS AND
APPROACHES
• Speedup is the ratio of the execution time of the best possible serial algorithm on a single
processor T(1) to the parallel execution time of the chosen algorithm on n-processor parallel
system T(n):
S(n) = T(1)/T(n)
• Speedup measure the absolute merits of parallel algorithms with respect to the “optimal”
sequential version.
Amdahl’s Law
• b pure sequential mode
• 1 - b ~ a probability that the system operates in a fully parallel mode using n processors.
S = T(1)/T(n)
T(n) = T(1)b + T(1)(1- b )/n
S = 1/ b + (1- b )/n
= n/ bn + (1- b )
Efficiency
• The system efficiency for an n-processor system:
• Efficiency is a measure of the speedup achieved per processor.
S (n ) T (1)
E(n )= =
n nT (n )
Communication overhead
• tc is the communication overhead
• Speedup
n + (1- )+ntc/T(1)
S=n/
• Efficiency
S (n ) 1
E(n )= =
n βn+(1−β )+nt c /T (1)
Parallelism Profile in Programs
• Degree of Parallelism For each time period, the number of processors used to execute a
program is defined as the degree of parallelism (DOP).
• The plot of the DOP as a function of time is called the parallelism profile of a given program.
• Fluctuation of the profile during an observation period depends on the algorithmic structure,
program optimization, resource utilization, and run-time conditions of a computer system.
Average Parallelism
• The average parallelism A is computed by:
m m
• where:
A=
( i =1
) (∑ )
∑ i⋅t i /
i =1
ti
– m is the maximum parallelism in a profile
m
– ti is the total amount of time that DOP = i ∑i=1 ti =t 2−t 1
Scalability of Parallel Algorithms
• Scalability analysis determines whether parallel processing of a given problem can offer the
desired improvement in performance.
• Parallel system is scalable if its efficiency can be kept fixed as the number of processors is
increased assuming that the problem size is also increased.
• Example: Adding m numbers using n processors. Communication and computation take one unit
time.
• Steps:
1. Each processor adds m/n numbers
2.The processors combine their sums
m
S=
m/n+2 log 2 n
Scalability Example
• Efficiency for different values of m and n
n 2 4 8 16 32
64 0.94 0.8 0.57 0.33 0.167
128 0.97 0.888 0.73 0.5 0.285
256 0.985 0.94 0.84 0.67 0.444
512 0.99 0.97 0.91 0.8 0.062
1024 0.995 0.985 0.995 0.89 0.76
2. MEMORY HIERARCHY
3. VIRTUAL MEMORY
Provides illusion of very large memory
– sum of the memory of many jobs greater than physical memory
– address space of each job larger than physical memory
Allows available (fast and expensive) physical memory to be very well utilized
Simplifies memory management (main reason today)
Exploits memory hierarchy to keep average access time low.
Involves at least two storage levels: main and secondary
Virtual Address -- address used by the programmer
Virtual Address Space -- collection of such addresses
Memory Address -- address of word in physical memory also known as “physical address” or
“real address”
Basic Issues in VM System Design
size of information blocks that are transferred from secondary to main storage
block of information brought into M, and M is full, then some region of M must be released to
make room for the new block -->
replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrence
of a fault --> fetch/load policy
frame
Paging Organization
virtual and physical address space partitioned into blocks of equal size
actually, concatenation
is more likely
Page Table address Mapping
Address Mapping Algorithm
Protection Fault: access rights violation; causes trap to hardware, microcode, or software fault
handler
Page Fault: page not resident in physical memory, also causes a trap usually accompanied by a
context switch: current process
suspended while page is fetched from secondary storage
Fragmentation & Relocation
Fragmentation is when areas of memory space become unavailable for
External Fragmentation: Space left between blocks.
Optimal Page Size
-- memories get larger as the price of RAM drops
-- the gap between processor speed and disk speed grow wide
-- programmers desire larger virtual address spaces
Fragmentation
Page Table
Relocation: move program or data to a new region of the address
space (possibly fixing all the pointers)
Page Replacement Algorithms
Least Recently Used:
-- selects the least recently used page for replacement
-- requires knowledge about past references, more difficult to implement (thread thru page table
entries from most recently referenced to least recently referenced; when a page is referenced it is
placed at the head of the list; the end of the list is the page to replace)
-- good performance, recognizes principle of locality
Not Recently Used:
Associated with each page is a reference flag such that
ref flag = 1 if the page has been referenced in recent past
= 0 otherwise
-- if replacement is necessary, choose any page frame such that its
reference bit is 0. This is a page that has not been referenced in the recent past
-- clock implementation of NRU:
An optimization is to search for the a page that is both
not recently referenced AND not dirty.
Demand Paging and Prefetching Pages
Fetch Policy
when is the page brought into memory?
if pages are loaded solely in response to page faults, then the
policy is demand paging
An alternative is prefetching:
anticipate future references and load such pages before their actual use
+ reduces page transfer overhead
- removes pages already in page frames, which could adversely affect the page fault rate
- predicting future references usually difficult
Most systems implement demand paging without prepaging
Virtual Address and a Cache
data
It takes an extra memory access to translate VA to PA
This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as
possible
synonym problem: two different virtual addresses map to same
physical address => two different cache entries holding data for the same physical address!
for update: must update all cache entries with same physical address or memory becomes
inconsistent
determining this requires significant hardware, essentially an associative lookup on the physical
address tags to see if you have multiple hits
TLBs (Translation Look aside Buffer)
A way to speed up translation is to use a special cache of recently
used page table entries -- this has many names, but the most
frequently used is Translation Lookaside Buffer or TLB
TLB access time comparable to, though shorter than, cache ac
(still much less than main memory access time)
TLB: acts as a specialized cache for address translation
Translation Look-Aside Buffers
TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This
permits fully associative lookup on these machines. Most mid-range machines use small
n-way set associative organizations.
Translation
with a TLB
4. PERFORMANCE METRICS AND SCALABILITY ANALYSIS
Performance Testing
System process speed (Max./Min./Average)
o system processes, tasks, transactions, responses.
o data retrieval, data loading
System throughput (Max./Min./Average)
o loads, messages, tasks, processes
System latency (Max./Min./Average)
o Message/event, task/process
System utilization (Max./Min./Average)
o Network, server/client machines.
System availability (component-level/system-level)
o component/system, services/functions
o system network, computer hardware/software
System reliability (component-level/system-level)
o component/system, services/functions
o system network, computer hardware/software
System scalability (component-level/system-level)
o load/speed/throughput boundary
o improvements on process speed, throughput
System successes/failures rates for
o communications, transactions, connections
o call processing, recovery, …
Domain-specific/application-specific
o agent performance
o real-time report generation speed
workflow performance
Performance Test - Tools
Performance test tools can be classified into:
Simulators and data generators:
o Message-based or table-based simulators
o State-based simulators
o Model-based data generators, such as
o Pattern-based data generators
o Random data generators
Performance data collectors and tracking tools
o Performance tracking tools
Performance evaluation and analysis tool
o Performance metric computation
o Model-based performance evaluation tool
Performance monitors
o For example, sniffer, Microsoft performance monitor
o External third-party tools
Performance report generators
Performance Evaluation
What is performance evaluation?
Using a well-defined approach to study, analyze, and measure the performance of a given
system.
The basic tasks and scope:
Collect system performance data
Define system performance metrics
Model system performance
Measure, analyze, estimate system performance
Present and report system performance
Performance Evaluation
– Objectives and Needs
The major objectives:
Understand product capacity
Discover system performance issues
Measure and evaluate system performance
Estimate and predict system performance
The basic needs are:
Well-defined performance metrics
Well-defined performance evaluation models
Performance evaluation tools and supporting environment
Performance Evaluation - Approaches
Performance testing: (during production)
o measure and analyze the system performance based on performance test data and
results
o Performance simulation: (pre-production)
o study and estimate system performance using a simulation approach
o Performance measurement at the customer site:
(post-production)
o measure and evaluation system performance during system operations
Performance Evaluation - Models
What is a performance evaluation model?
A well-defined formal model which depicts different prospects of system performance of a
system.
Why do we need performance evaluation models?
To present the system performance properties
To provide a guideline for engineers to find the strategy on performance evaluation.
To set up a foundation to define performance metrics.
To identify the needs of the target performance environment.
Performance Evaluation - Models
Type of performance evaluation models:
Queuing model
Scenario-based evaluation model
Architecture-based evaluation model
o Component-based evaluation model
Process-based evaluation model
Transaction-based evaluation model
Transition-based models
o State-based evaluation model
o Domain-oriented evaluation model
Performance Evaluation - Models for Portal V5
Two performance evaluation models are defined:
Event-Based Function Scenario Model for System Performance Evaluation
Event-Based Transition Model for System Performance Evaluation
Portal V5 systems can be viewed as a component-based distributed system over the Internet. It accepts,
processes,and supports the following types of event messages.
System-caller interaction event messages
System-agent interaction event messages
Call-oriented event interactions between components
Event-Based Function Scenario Model
Model definition: An event-based function scenario model is an event-based scenario diagram with
tracked time stamps.
We use G(N, E, T, O) to represent a functional scenario diagram.
• N is a set of component nodes.
• E represents a set of direct links. Each link e is a direct edge (Cr, Cd), which represents an
event that is sent by component Cr and received by the component Cd.
• T is a set of tracked time stamps for links. Each link has a pair of time stamps (ti, to),
where ti indicates the incoming time stamp of event e in Cr, and where to indicates the
outgoing time stamp of event Ep in Cd.
• O is a set of pairs (Ep, order), each of them represents the order of for an event occurred
in the scenario.
An event-based functional scenario Sq is a sequence of interaction events between components in a
system. Sq can be represented as a path in a functional scenario diagram. Thus, Sq can be denoted as Ei1,
…., Eim, where Ei1 is the starting event of Sq, and Eim is the ending event of Sq.
Event-Based Transition Model: An Example
Transition
Performance Evaluation - Metrics
Performance metrics
o Call request process time
o Event interaction latency
Throughput metrics (component and system level)
o Call processing throughput
o Call load throughput rate
Availability metrics (component and system level)
Reliability metrics (component and system level)
Scalability metrics
Utilization metrics
Common used performance metrics: (for components/systems)
Functional or process speed metrics
User response time metric
Communication speed metric
Transaction speed metric
Latency metric
Performance metrics for Portal V.5 products:
Call process time metric
Call process speed metric
Event latency metric
Performance Metric – Call Process Time
Outgoing
Event
Component Process Time Metric for call requests Req(media type):
Req(media type) = { req1, …., reqn }
Max-Process-TimeSq (Ck,, Req(media type))
= Maxi { Process-TimeSq (Ck,, reqi) } (i = 1,…n)
Min-Process-TimeSq (Ck,, Req(media type))
= Mini { Process-TimeSq (Ck,, reqi) } (i = 1,…n)
Avg-Process-TimeSq (Ck,, Req(media type))
= [Si Process-TimeSq (Ck,, reqi)] / n (i = 1,…n)
Portal V5
Agent
simulator
System Call Process Time Metric for a specific media load.
Req(media type) = { req1, …., reqn }
Max-Process-Time (S,, Req(media type))
= Maxi { Process-Time (S,, reqi) } (i = 1,…n)
Min-Process-Time(S,, Req(media type))
= Mini { Process-Time (S,, reqi) } (i = 1,…n)
Avg-Process-Time (S,, Req(media type))
= [Si Process-Time (S,, reqi)] / n (i = 1,…n)
Performance Metric – Latency
Latency metrics are used to measure the delays of:
- Messages, transactions, tasks, …..
We can define an event message latency metric for Portal V.5
to measure delay of event messages between components.
Event-LatencySq (Ei, reqj)
= Ej’s incoming timestamp – Ej’s outgoing timestamp = Ti - To
We can compute the Max, Min, and Average of Latency.
Outgoing
Event
Throughput Metrics
Objective: To measure the call processing capacity of a system.
The four types of throughput metrics are defined:
System processing throughput for a special type of media requests
System processing throughput for all types of media requests
System-load throughput rate for a special type of media requests
System-load throughput rate for all types of media requests
Concerns:
Maximum of the
throughput
Minimum of the
throughput
Average of the throughput
Throughput Metric - Component Process Throughput
Definition: In a given functional scenario, the component-process throughput for a special type of media
requests Req(media-type) in a given performance test period Tp refers to the total number of its
outgoing events which are generated by processing its incoming events for these media requests.
System-Process-Throughput Tp
(Req(media type))
= total number of outgoing Ek
Definition: In a given functional scenario, the component-process throughput rate for a special type of
media requests Req(media-type) in a given performance test period Tp refers to the ratio of the total
number of its outgoing events (which are generated by processing its incoming events for these media
requests) to the total number of its incoming events.
System-Process-Throughput-Rate Tp
(Req(media type))
= total number of outgoing Ek+1 / total
number of incoming Ek
Availability Metrics
Availability metrics are defined to evaluate the availability of components and systems in providing their
specified functions and services to system users and customers.
Several availability metrics are defined here:
Component-level
Component Availability Metrics
Component Service Availability Metrics
System-level
System Availability Metric
System-Service Availability Metric
Definition: The component availability of a component Ck in a system during a time period T refers to the
ratio of the total available time of the component to the total time T, including both available and
unavailable time.
Component Availability Metric:
Component-Availability T (Ck)
= available-time(Ck) / (available-time(Ck) +unavailable-time(Ck))
= available-time(Ck) / |T|
Where available-time(Ck) represents the available time of Ck during Tp, and unavailable-time(Ck) includes
the time when the component can not provide the specified functions and services.
Application: to measure the availability of components.
Availability Metrics - System Availability
Definition: For a system S, its system availability during a given test period Tp refers to the ratio of its
total available time to provide all specified function sets for its users in all given media types to the total
test time |Tp|.
Let use System-Availability Tp (S) to denote the system availability for system S. S = {C1,…., Cm} consists of
M components. To help engineers evaluate system availability, we define the following metric by
considering the single failure criterion:
System-Availability Tp (S)
= available-time(S) / (available-time(S) +unavailable-time(S))
= available-time(S) / |Tp|
available-time(S) = Min k { available-time(Ck) } k = 1,..M.
Where uptime(S) refers to the system uptime in Tp, and downtime(S) represents the system downtime in
TP.
System Service Availability Model
Event-Based Transition Path:
Scalability Metrics
The major applications of scalability metrics:
To check the boundary and threshold of a system and its components.
To check the speed-up of a system and its components when more hardware machines and
application servers are added into a system.
To evaluate the throughput improvement of a system (or its components) when more hardware
machines and application servers are added into a system.
The scalability metrics defined here:
Speed-up
Throughput Improvement
Load Boundary and Threshold
Definition: The throughput improvement for a system (or a component cluster) for a fixed load Req(S) in
a given time Tp, denoted as Throughput-Improvement(A,B) refers to the throughput rate increase from
configuration setting A to configuration setting B. The detailed metric can be defined below:
Throughput-Improvement (A,B)
= (system load throughput under B
– system load throughput under A)/|Req(S)|
Application: to measure the improvement of system throughput
Speed-Up
Definition: The process speed-up for a system (or a component cluster) under a fixed load Req(S) in a
given time Tp, denoted as Speed-up(A,B), refers to the process speed increase from configuration setting
A to configuration setting B. The detailed metric can be defined below:
Speed-up (A,B)
= (process speed under B – process speed under A)/ process speed under A
Application: to measure the improvement of the system process speed
5. SUPERSCALAR AND VECTOR PROCESSORS
Pipelining is a method to realize, overlapped parallelism in the proposed solution of a problem, on a
digital computer in an economical way. To understand the concept of pipelining, we need to understand
first the concept of assembly lines in an automated production plant where items are assembled from
separate parts (stages) and output of one stage becomes the input to another stage. Taking the analogy of
assembly lines, pipelining is the method to introduce temporal parallelism in computer operations.
Assembly line is the pipeline and the separate parts of the assembly line are different stages through
which operands of an operation are passed.
To introduce pipelining in a processor P, the following steps must be followed:
• Sub-divide the input process into a sequence of subtasks. These subtasks will make stages of pipeline,
which are also known as segments.
• Each stage Si of the pipeline according to the subtask will perform some operation on a distinct set of
operands.
• When stage Si has completed its operation, results are passed to the next stage S i+1 for the next
operation.
• The stage Si receives a new set of input from previous stage Si-1 .
6.
7.
8.