Chapter 2:Computer Evolution and
Performance
   Computer Architecture
     and Organization
     Department of Electrical and Computer Engineering
              Generations of Computer
• Vacuum tube - 1946-1957
• Transistor - 1958-1964
• Small scale integration - 1965 on
   • Up to 100 devices on a chip
• Medium scale integration - to 1971
   • 100-3,000 devices on a chip
• Large scale integration - 1971-1977
   • 3,000 - 100,000 devices on a chip
• Very large scale integration - 1978 -1991
   • 100,000 - 100,000,000 devices on a chip
• Ultra large scale integration – 1991 -
   • Over 100,000,000 devices on a chip
                 Computer Architecture and Organization
                                                          2
                        ENIAC - background
• Electronic Numerical
  Integrator And Computer
• Eckert and Mauchly
• University of Pennsylvania
• Trajectory tables for weapons
• Started 1943
• Finished 1946
   Too late for war effort
• Used until 1955
                            Computer Architecture and Organization   3
                       ENIAC - details
•   Decimal (not binary)
•   20 accumulators of 10 digits
•   Programmed manually by switches
•   18,000 vacuum tubes
•   30 tons
•   15,000 square feet
•   140 kW power consumption
•   5,000 additions per second
•   Showed the general purpose of a computer
•   The first task was to perform a series of complex
    calculation for the feasibility of hydrogen bomb.
                       Computer Architecture and Organization   4
                      Von Neumann/Turing
•   Stored Program Concept
•   Main memory storing programs and data
•   The Computer gets its instruction by reading them from memory
•   A program can be set or altered by setting the value of a portion of
    memory.
•   ALU operating on binary data
•   Control unit interpreting instructions from memory and executing
•   Input and output equipment operated by control unit
•   Princeton Institute for Advanced Studies
      • IAS Computer- Von Neumann and his colleagues in 1946
•   Completed 1952
• The first publication of the idea was in a 1945 proposal by von
    Neumann for a new computer, the EDVAC (Electronic Discrete
    Variable Computer)
                           Computer Architecture and Organization          5
             Structure of von Neumann machine
•   A main memory, which
    stores both data and
    instructions
•   An arithmetic and logic
    unit (ALU)capable of
    operating on binary data
•   A control unit, which
    interprets the instructions in
    memory and causes them to
    be executed
• Input/output(I/O)equipment
    operated by the control unit
                               Computer Architecture and Organization   6
        Brief Description of the IAS computer
• The IAS Computer is the prototype of all subsequent general purpose
  computers.
• The memory of the IAS consists of 1000 storage locations, called
  words, of 40 binary digits (bits) each.
• Both data and instructions are stored there.
• Numbers are represented in binary form, and each instruction is a
  binary code.
                          Computer Architecture and Organization   7
                       IAS Memory Format
•   Each number is represented by a sign bit and a 39-bit value.
•   A word may also contain two 20-bit instructions, with each instruction
•   consisting of an 8-bit operation code (Opcode)
•   12-bit address
                           Computer Architecture and Organization      8
                       IAS Main Registers
• Memory buffer register (MBR):Contains a word to be stored in
  memory or sent to the I/O unit, or is used to receive a word from
  memory or from the I/O unit.
• Memory address register (MAR):Specifies the address in memory of
  the word to be written from or read into the MBR.
• Instruction register (IR):Contains the 8-bit opcode instruction being
  executed.
• Instruction buffer register (IBR):Employed to hold temporarily the
  right hand instruction from a word in memory.
• Program counter (PC):Contains the address of the next instruction
  pair to be fetched from memory.
• Accumulator (AC) and multiplier quotient (MQ): Employed to hold
  temporarily operands and results of ALU operations. For example, the
  result of multiplying two 40-bit numbers is an 80-bit number; the most
  significant 40 bits are stored in the AC and the least significant in the
  MQ.
                          Computer Architecture and Organization       9
Expanded Structure of IAS Computer
       Computer Architecture and Organization   10
                               IAS operation
• Operates by repetitively performing an Instruction cycle.
• Each instruction cycle consists of two sub cycles. Instruction Fetch and
    Instruction Execute Cycle
•   In fetch cycle the Opcode of the next instruction is loaded into the IR and the
    address portion is loaded into the MAR.
• This instruction may be taken from the IBR or from memory by loading a word
    into the MBR, and then down to the IBR, IR, and MAR.
• There is only one register that is used to specify the address in memory for a
    read or write and One register is used for the source or destination.
• Once the Opcode is in the IR, the execute cycle is performed.
• Control circuitry interprets the Opcode and executes the instruction.
                              Computer Architecture and Organization         11
IAS Operation Partial Flow Chart
        Computer Architecture and Organization   12
                            IAS Instruction Set
The IAS computer had a total of 21 instructions, which can be grouped as follows:
• Data transfer: Move data between memory and ALU registers or between two
  ALU registers.
• Unconditional branch: Normally, the control unit executes instructions in
  sequence from memory. This sequence can be changed by a branch
  instruction, which facilitates repetitive operations.
• Conditional branch: The branch can be made dependent on a condition, thus
  allowing decision points.
• Arithmetic: Operations performed by the ALU.
• Address modify: Permits addresses to be computed in the ALU and then
  inserted into instructions stored in memory. This allows a program considerable
  addressing flexibility.
                              Computer Architecture and Organization        13
IAS Instruction Set Details
   Computer Architecture and Organization   14
IAS Program Example
• Write an IAS Program for the following code, use sum
  memory location at memory location 100.
   sum = 0;
   val = 20;
   sum = Val + 30;
• Solution:
   Location             Instruction                   Comment
   ▪ 0                      20                         constant - val
   ▪ 1                      30                         constant
   ▪ 2                       0                         constant
   ▪ 3L                  LOAD M(0)                     AC  M(0) - val
   ▪ 3R                  ADD M(1)                     AC  AC + 30
   ▪ 4L                  STOR M(100)                  M(100) / sum  AC
                     Computer Architecture and Organization
IAS Program Example
• Write an IAS Program for the following code, use sum, val
  memory location at 200 and 300 respectively.
   sum = 0;
   val = 5;
   for ( int i = 10; i >= 0; i--) {
      sum = sum + (i + val);
      val = val + 1;
   }
• Solution:
   Location                           Instruction                Comment
   ▪ 0                                   10                      constant – i
   ▪ 1                                    0                      constant - sum
   ▪ 2                                    5                       constant – val
   ▪ 3                                    1                       constant - 1
                        Computer Architecture and Organization
IAS Program Example
  Location               Instruction                    Comment
  ▪ 4L                    LOAD M(1)                     AC  M(1)
  ▪ 4R                    STOR M(200)                   M(200)  AC
  ▪ 5L                    LOAD M(2)                     AC  M(2)
  ▪ 5R                    STOR M(300)                   M(300)  AC
  ▪ 6L                    LOAD M(0)                     AC  M(0)
  ▪ 6R                    ADD M(300)                  AC  AC + M(300)
  ▪ 7L                    ADD M(200)                  AC  AC + M(200)
  ▪ 7R                    STOR M(200)                    M(200)  AC
  ▪ 8L                    LOAD M(300)                   AC  M(300)
  ▪ 8R                    ADD M(3)                      AC  AC + M(3)
  ▪ 9L                    STOR M(300)                   M(300)  AC
  ▪ 9R                    LOAD M(0)                     AC  M(0)
  ▪ 10L                   SUB M(3)                      AC  AC - M(3)
  ▪ 10R                   STOR M(0)                     M(0)  AC
  ▪ 11L                   JUMP+ M(6, 0:19)              Jump if AC >= 0
  ▪ 11R                   JUMP M(11, 20:39)             HALT
             Computer Architecture and Organization
                Commercial Computers
•   1947 - Eckert-Mauchly Computer Corporation
•   UNIVAC I (Universal Automatic Computer)
•   US Bureau of Census 1950 calculations
•   Became part of Sperry-Rand Corporation
•   Late 1950s - UNIVAC II
     • Faster
     • More memory
                  Computer Architecture and Organization   18
                                 IBM
• Punched-card processing equipment
• 1953 - the 701
   • IBM’s first Electronic stored program computer
   • For Scientific calculations
• 1955 - the 702
   • For Business applications
• Lead to 700/7000 series
                       Computer Architecture and Organization   19
                               Transistors
• Replaced vacuum tubes
     • Smaller
     • Cheaper
     • Less heat dissipation
•   Solid State device
•   Made from Silicon (Sand)
•   Invented 1947 at Bell Labs
•   William Shockley et al.
                         Computer Architecture and Organization   20
             Second Generation : Transistors
• Transistor Based Computers
• More complex arithmetic and logic units and control units,
• The use high-level programming languages and software provided the ability to
  load programs,(beginning of OS)
• Data channels:-independent I/O module with its own processor & Instruction
  set.       -relieves the CPU of a considerable processing burden.
• Multiplexor :- termination point for data channels, the CPU, and memory.
            - schedules access to the memory from the CPU and data channels
• NCR & RCA were front runners to produced small transistor machines
• IBM followed with 7000 series
• DEC (Digital Equipment Corporation) - 1957
    • Produced PDP-1
                             Computer Architecture and Organization           21
        Third Generation - Microelectronics
• Literally - “small electronics”
• A computer is made up of gates, memory cells and
  interconnections
• These can be manufactured on a semiconductor material.
  e.g. silicon wafer
• Jack Kilby invented the integrated circuit at Texas
  Instruments in 1958 creating a transistor , resistor and
  capacitor from a piece of Germanium
• Robert Noyce also
  independently invented
  IC
                    Computer Architecture and Organization   22
   Microelectronics- Modern IC Manufacturing
• Using technique of Photolithography
                    Computer Architecture and Organization   23
                          Moore’s Law
• Gordon Moore – co-founder of Intel
   • Predict number of transistors on a chip will double every year
• Since 1970’s development has slowed a little
   • Number of transistors doubles every 18 months
• Cost of a chip has remained almost unchanged
• Consequence of Moore’s Law
   • Higher packing density means shorter electrical paths, giving
     higher performance
   • Smaller size gives increased flexibility
   • Reduced power and cooling requirements
   • Fewer interconnections increases reliability
                        Computer Architecture and Organization        24
Growth in Transistor Count on Integrated Circuits
                 Computer Architecture and Organization   25
                            IBM 360 series
•   In 1964 IBM introduce IBM 360 series
•   Replaced (& not compatible with) 7000 series
•   It coves wide range of performance and cost
•   First planned “family” of computers
     •   Similar or identical instruction sets
     •   Similar or identical OS
     •   Increasing speed
     •   Increasing number of I/O ports (i.e. more terminals)
     •   Increased memory size
     •   Increased cost
                            Computer Architecture and Organization   26
                          DEC PDP-8
• 1964
• First minicomputer
  (after miniskirt!)
• Small enough to
  sit on a lab bench
• Lower cost & size
• $16,000
   • $100k+ for IBM 360
                       Computer Architecture and Organization   27
                                 Intel
• 1971 - 4004
   • First microprocessor
   • All CPU components
    on a single chip
   • 4 bit
                       Computer Architecture and Organization   28
                                 Intel …
• Followed in 1972 by 8008
   • 8 bit
   • Both designed for specific applications
• 1974 - 8080
   • Intel’s first general purpose microprocessor
• IBM(1981) produced PC using Intel 8088
  microprocessor
• 8086: A 16-bit machine
• 80386: Intel’s first 32-bit machine.
                          Computer Architecture and Organization   29
                    Semiconductor Memory
• IC were for processors only
• In 1970 Fairchild produce a semiconductor memory
• Size of a single core
     • i.e. 1 bit of magnetic core storage
•   Holds 256 bits
•   Non-destructive read
•   Much faster than core
•   Capacity approximately doubles each year
                           Computer Architecture and Organization   30
                          Speeding it up
• By increasing hardware speed of microprocessor
   • Fundamentally due to shrinking logic gate size
       • More gates, packed more tightly, reduce propagation time for signals
         which intern increase clock speed
• Chipmakers fabricate chips of greater density and speed
  and then the processor designers should come up with
  techniques that use the higher speed
                         Computer Architecture and Organization          31
                       Microprocessor Speed
• Raw speed of the microprocessor is not used unless constant work is feed to
    the CPU
• Among the techniques used are
•   Pipelining: With pipelining, a processor can simultaneously work on multiple
    instructions. I.e while one instruction is being executed, the computer is
    decoding the next instruction.
• Branch prediction: The processor looks ahead in the instruction code fetched
    from memory and predicts which branches, or groups of instructions, are likely
    to be processed next. The more sophisticated examples of this strategy predict
    not just the next branch but multiple branches ahead.
                              Computer Architecture and Organization        32
                     Microprocessor Speed …
•   Data flow analysis: The processor analyzes which instructions are dependent
    on each other’s results, or data, to create an optimized schedule of
    instructions. In fact, instructions are scheduled to be executed when ready,
    independent of the original program order. This prevents unnecessary delay.
• Speculative execution: Using branch prediction and data flow analysis, some
    processors speculatively execute instructions ahead of their actual appearance
    in the program execution, holding the results in temporary locations. This
    enables the processor to keep its execution engines as busy as possible by
    executing instructions that are likely to be needed.
                              Computer Architecture and Organization        33
                   Performance Balance
•   Processor speed increased
•   Memory capacity increased
•   Memory speed lags behind processor speed
•   If a constant flow of program instruction and data between
    memory chips and the processor, fails to keep pace with
    processor instant demands, the processor stall in wait state
    and valuable processing time is lost.
                       Computer Architecture and Organization   34
Logic and Memory Performance Gap
         Computer Architecture and Organization   35
                 Ways to reduce the gap
• Increase number of bits retrieved at one time
   • Make DRAM “wider” rather than “deeper”
• Change DRAM interface
   • By including Cache
• Reduce frequency of memory access
   • More complex cache and cache on chip
• Increase interconnection bandwidth
   • High speed buses
   • Hierarchy of buses
                      Computer Architecture and Organization   36
     Approaches to Increase Processor Speed
• Increase the hardware speed of the processor.
   -pack more gates, reduce propagation time, increase clock rate
• Increase the size and speed of caches.
   -reduce cache access time by dedicate a portion of CPU to the cache
• Make changes to the processor organization and
  architecture
      -parallelize
 Problems with Clock Speed and Logic Density
• Power
   • Power density increases with density of logic and clock speed
   • Dissipating large amount of heat and controlling the heat is
     becoming difficult which will lead to permanent damage of
     transistor
• RC delay
   • Speed at which electrons flow limited by resistance and
     capacitance of metal wires connecting them
   • Delay increases as RC product increases
   • Wire interconnects thinner, increasing resistance
   • Wires closer together, increasing capacitance
• Solution:
   • More emphasis on other organizational and architectural
     approaches
                        Computer Architecture and Organization       38
              Increased Cache Capacity
• Typically two or three levels of cache between processor
  and main memory
• Chip density increased
   • More cache memory on chip
      • Faster cache access
• Pentium chip devoted about 10% of chip area to cache
• Pentium 4 devotes about 50%
                      Computer Architecture and Organization   39
            More Complex Execution Logic
• Enable parallel execution of instructions
• Pipeline works like assembly line
   • Different stages of execution of different instructions at same time
     along pipeline
   • Starts the execution of the next instruction before the first
     completed
• Superscalar allows multiple pipelines within single
  processor
   • Instructions that do not depend on one another can be executed in
     parallel
                        Computer Architecture and Organization       40
                    Diminishing Returns
• As processors getting complex further significant increases
  in parallelism likely to be relatively modest
   • Designing different techniques become difficult
• Benefits from cache are reaching limit
                        Computer Architecture and Organization   41
           New Approach – Multiple Cores
• Multiple processors on single chip
   • Large shared cache
• If software can use multiple processors, doubling number
  of processors almost doubles performance
• So, use two simpler processors on the chip rather than one
  more complex processor
• With two processors, larger caches are justified
   • Important because Power consumption of memory logic less than
     the processing logic
                      Computer Architecture and Organization   42
Intel Microprocessor Performance
        Computer Architecture and Organization   43
                           x86 Evolution (1)
• 8080
    • first general purpose microprocessor
    • 8 bit data path
    • Used in first personal computer – Altair
• 8086 – 5MHz – 29,000 transistors
    • much more powerful
    • 16 bit
    • instruction cache, prefetch few instructions
    • 8088 (8 bit external data bus) used in first IBM PC
• 80286
    • 16 Mbyte memory addressable
    • up from 1Mb
• 80386
    • 32 bit
    • Support for multitasking
• 80486
    • sophisticated powerful cache and instruction pipelining
    • built in maths co-processor
                             Computer Architecture and Organization   44
                           x86 Evolution (2)
• Pentium
    • Superscalar
    • Multiple instructions executed in parallel
• Pentium Pro
    •   Increased superscalar organization
    •   Aggressive register renaming
    •   branch prediction
    •   data flow analysis
    •   speculative execution
• Pentium II
    • MMX technology
    • graphics, video & audio processing
• Pentium III
    • Additional floating point instructions for 3D graphics
                            Computer Architecture and Organization   45
                             x86 Evolution (3)
• Pentium 4
     • Further floating point and multimedia enhancements
• Core
     • First x86 with dual core
• Core 2
     • 64 bit architecture
• Core 2 Quad – 3GHz – 820 million transistors
     • Four processors on chip
•   x86 architecture dominant outside embedded systems
•   Organization and technology changed dramatically
•   Instruction set architecture evolved with backwards compatibility
•   ~1 instruction per month added
•   500 instructions available
                              Computer Architecture and Organization    46
                  Performance Assessment
• Key parameters of computer assessment
   • Performance, cost, size, security, reliability, power consumption
Clock Speed
• System clock speed is given in terms of Hz or multiples of
  clock rate, clock cycle, clock tick, cycle time
• Signals in CPU take time to settle down to 1 or 0 and the
  worst time taken by a signal inside the CPU is taken to be
  the clock speed
• Instruction execution in discrete steps
   • Fetch, decode, load and store, arithmetic or logical
   • Usually require multiple clock cycles per instruction
• Pipelining gives simultaneous execution of instructions
• So, clock speed is not the whole story in performance
  assessment
                          Computer Architecture and Organization    47
                 Instruction Execution Rate
• A processor is driven by a clock with a constant frequency for,
  equivalently, a constant cycle time t, where t =1/f.
• Instruction count, Ic
• An important parameter is the average cycles per instruction (CPI) for a
  program.
• If all instructions required the same number of clock cycles, then CPI
  would be a constant value for a processor.
• The number of clock cycles required varies for different types of
  instructions, such as load, store, branch, and so on.
• Let CPIi be the number of cycles required for instruction type i and Ii
  be the number of executed instructions of type i for a given program.
                          Computer Architecture and Organization       48
                     Instruction Execution Rate
• The processor time T needed to execute a given program can be expressed as
• We can refine this formulation by recognizing that during the execution of an instruction,
  part of the work is done by the processor, and part of the time a word is being transferred
  to or from memory. In this latter case, the time to transfer depends on the memory cycle
  time, which may be greater than the processor cycle time.
• where p is the number of processor cycles needed to decode and execute the
  instruction, m is the number of memory references needed, and k is the ratio between
  memory cycle time and processor cycle time.
• The five performance factors in the preceding equation (Ic, p, m, k, t) are influenced by
  four system attributes: the design of the instruction set (known as instruction set
  architecture), compiler technology (how effective the compiler is in producing an efficient
  machine language program from a high-level language program), processor
  implementation, and cache and memory hierarchy.
                                Computer Architecture and Organization                  49
               Instruction Execution Rate …
• A common measure of performance for a processor is the rate at which
  instruction is executed, millions of instruction per second (MIPS), MIPS
  rate
                          Computer Architecture and Organization      50
          Instruction Execution Rate Example
• For example, consider the execution of a program that results in the execution
  of 2 million instructions on a 400-MHz processor. The program consists of four
  major types of instructions. The instruction mix and the CPI for each instruction
  type are given below based on the result of a program trace experiment:
• Calculate the Average CPI and MIPS ?
                            Computer Architecture and Organization            51
         Instruction Execution Rate Example
The average CPI when the program is executed on a uniprocessor
with the above trace results is
CPI =0.6 +(2 X0.18) +(4X0.12) +(8X0.1) =2.24.
The corresponding MIPS rate is
                       Computer Architecture and Organization    52
                           Benchmarks
• Programs designed to test performance
• Written in high level language
   • Portable across different machines
• Represents style of task
   • Systems, numerical, commercial
• Easily measured
• Widely distributed
• E.g. System Performance Evaluation Corporation (SPEC)
   • CPU2006 for computation bound
       • 17 floating point programs in C, C++, Fortran
       • 12 integer programs in C, C++
       • 3 million lines of code
   • Speed and rate metrics
       • Single task and throughput
                         Computer Architecture and Organization   53
                    SPEC Speed Metric
• Single task
• Base runtime defined for each benchmark using
  reference machine
• Results are reported as ratio of reference time to system
  run time
   • Trefi execution time for benchmark i on reference machine
   • Tsuti execution time of benchmark i on system under test.
• Overall performance calculated by averaging
  ratios for all 12 integer benchmarks
   — Use geometric mean
      – Appropriate for normalized numbers such as ratios
                       Computer Architecture and Organization    54
                       SPEC Speed Metric
The Sun system executes this program in 934 seconds. The reference
implementation requires 22,135 seconds. The ratio is calculated as:
22136/934 =23.7.
                       Computer Architecture and Organization
                                                                      55
                         SPEC Rate Metric
• Measures throughput or rate of a machine carrying out a number of
  tasks
• Multiple copies of benchmarks run simultaneously
    • Typically, same as number of processors
• Ratio is calculated as follows:
    • Trefi reference execution time for benchmark i
    • N number of copies run simultaneously
    • Tsuti elapsed time from start of execution of program on all N processors
      until completion of all copies of program
    • Again, a geometric mean is calculated
                           Computer Architecture and Organization          56
                     Amdahl’s Law
• Gene Amdahl [AMDA67]
• Speedup in one aspect of the technology or design does
  not result in a corresponding improvement in performance
• Consider any enhancement to a feature of a system that
  results in a speedup. The speedup can be expressed as
                    Computer Architecture and Organization   57
                   Amdahl’s Law Formula
• Potential speed up of program using multiple processors
• For program running on single processor
   — Fraction f of code infinitely parallelizable with no scheduling
     overhead
   — Fraction (1-f) of code inherently serial
   — T is total execution time for program on single processor
   — N is number of processors that fully exploit parallel portions of
     code
• Conclusions
   • f small, use of parallel processors has little effect
   • N ->∞, speedup bound by 1/(1 – f)
       • Diminishing returns for using more processors
                         Computer Architecture and Organization        58
                          Amdahl’s Law
• These conclusion could be pessimistic for some tasks-
• Server
   -Execute tasks in parallel up to the limit of number of
     processors to handle multiple clients
• Database
   - Many database applications involve computations on massive
      amounts of data that can be split up into multiple parallel tasks.
• Example:- Suppose that a task makes extensive use of floating-point
  operations, with 40% of the time is consumed by floating-point
  operations. With a new hardware design, the floating-point module is
  speeded up by a factor of K. Over all speed up?
Amdahl’s Law Formula …
    Computer Architecture and Organization   60
           End of Chapter 2
Computer Architecture
  and Organization
 Department of Electrical and Computer Engineering