I/O Systems:
Devices, Buses, & You ARE the
weakest link!
Queues
“I/O certainly has been lagging in the last
decade.”
- Seymour Cray (1976)
“Also, I/O needs a lot of work.”
- David Kuck, 15th ISCA (1988)
1
Today’s Menu:
I/O Systems
Design
Performance: Throughput vs Latency
Basic Disk Drive Anatomy
Busses
Types of busses
Design Choices
Arbitration
2
The Big Picture: Where are We Now?
Today’s Topic: I/O Systems
Network
Processor Processor
Input Input
Control Control
Memory Memory
Datapath Output Datapath
Output
3
I/O System Design Issues
Performance
Expandability
Resilience in the face of failure
interrupts
Processor
Cache
Memory - I/O Bus
Main I/O I/O I/O
Memory Controller Controller Controller
Disk Disk Graphics Network
4
Application Performance
1996 - 1997
CPU performance improves by
N = 400/200 = 2 100.00
program performance improves by 90.00
CPU Time
N = 100/55 = 1.81
80.00
1997 - 1998 70.00
I/O Time
CPU performance - factor of 2 60.00
Seconds
program performance
50.00
N = 55/32.5 = 1.7
40.00
1998 - 1999 30.00
CPU performance - factor of 2
20.00
program performance
10.00
N = 32.5 / 21.25 = 1.53
0.00
1999 - 2000 1996 1997 1998 1999 2000
CPU Performance - factor of 2
program performance
N = 21.25 / 15.6 = 1.36
5
Performance for Web Surfing
Assume 50 seconds CPU & 50 seconds I/O
1996 - 1997
CPU performance improves by
100.00
N = 400/200 = 2
90.00 CPU Time
program performance improves by
80.00
N = 100/75 = 1.33 I/O Time
70.00
1997 - 1998 60.00
Seconds
CPU performance - factor of 2 50.00
40.00
program performance
30.00
N = 75/62.5= 1.2
20.00
1998 - 1999 10.00
CPU performance -f actor of 2 0.00
1996 1997 1998 1999 2000
program performance
N = 62.5/56.5 = 1.11
6
I/O Device Examples
Device Behavior Partner Data Rate (KB/sec)
Keyboard Input Human 0.01
Mouse Input Human 0.02
Printer Output Human 3.00
Floppy disk Storage Machine 50.00
Laser Printer Output Human 100.00
Optical Disk Storage Machine 500.00
Magnetic Disk Storage Machine 5,000.00
Network-LAN Input or Output Machine 20 – 1,000.00
Graphics Display Output Human 30,000.00
7
I/O System Performance
I/O System performance depends on many aspects of the system
(“limited by weakest link in the chain”):
The CPU
The memory system:
Internal and external caches
Main Memory
The underlying interconnection (buses)
The I/O controller
The I/O device
The speed of the I/O software (Operating System)
The efficiency of the software’s use of the I/O devices
Two common performance metrics:
Throughput: I/O bandwidth
Response time: Latency
8
Throughput versus Respond Time
Response
Time (ms)
300
200
100
20% 40% 60% 80% 100%
Percentage of maximum throughput
9
What’s Inside A Disk Drive?
Spindle
Arm Platters
Actuator
Electronics
SCSI
Image courtesy of Seagate Technology Corporation 10
Magnetic Disk
Purpose:
Registers
Long term, nonvolatile storage
Cache
Memory
Large, inexpensive, and slow
Disk
Lowest level in the memory hierarchy
Two major types:
Floppy disk
Hard disk
Both types of disks:
Rely on a rotating platter coated with a magnetic surface
Use a moveable read/write head to access the disk
Advantages of hard disks over floppy disks:
Platters are more rigid ( metal or glass) so they can be larger
Higher density because it can be controlled more precisely
Higher data rate because it spins faster
Can incorporate more than one platter 11
And If You Look More Closely
Platters
Tracks
Sectors
Two sides, write
on top and bottom
Cylinders: the set of corresponding
tracks on all the platters.
12
Organization of a Hard Magnetic Disk
A Track
Platters
A Sector
Typical numbers (depending on the disk size):
500 to 2,000 tracks per surface
32 to 128 sectors per track
A sector is the smallest unit that can be read or written
Traditionally all tracks have the same number of sectors:
Constant bit density: record more sectors on the outer tracks
Recently relaxed: constant bit size, speed varies with track location
13
Disk Drive Performance: the Numbers
Seek time
move head to the desired track
today’s drives - 5 to 15 ms
average seek = time for all possible seeks/no. of possible seeks
actual average seek = 25% to 33% due to locality
Rotational latency
today’s drives - 5,400 to 12,000 RPM
Track
approximately 12 ms to 5 ms
Sector
average rotational latency = (0.5)(rotational latency)
Transfer time
time to transfer a sector (1 KB/sector)
function of rotation speed, recording density Cylinder
today’s drives - 10 to 40 MBytes/second Platter
Head
Controller time
overhead on drive electronics adds to manage drive
but also gives prefetching and caching
14
Disk Drive Performance (cont.)
Average access time =
(seek time) + (rotational latency) + (transfer) + (controller time)
Track and cylinder skew
cylinder switch time
delay to change from one cylinder to the next
may have to wait an extra rotation
solution - drives incorporate skew
offset sectors between cylinders to account for switch time
head switch time
change heads to go from one track to next on same cylinder
incur additional settling time
Prefetching
disks usually read an entire track at a time
assumes that request for the next sector will come soon
Caching
limited amount of caching across requests, but prefetching is preferred
15
Example
Disk characteristics
512 byte sector, rotate at 5400 RPM, advertised seeks is 12 ms,
transfer rate is 4 MB/sec, controller overhead is 1 ms,
queue idle so no service time
Disk access time = ?
Access Time = Seek time + Rotational Latency + Transfer time
+ Controller Time + Queuing Delay
Access Time = 12 ms + 0.5 / 5400 RPM + 0.5 KB / 4 MB/s + 1 ms + 0 ms
= 12 ms + 0.5 / 90 RPS + 0.125 / 1024 s + 1 ms + 0 ms
= 12 ms + 5.5 ms + 0.1 ms + 1 ms + 0 ms
= 18.6 ms
Be very very careful about
the units on things. For example, at left,
rotations per minute transformed into
rotations per sec here so we can cancel
the “rotations” part and get out “seconds”
16
ASIDE: Disk I/O Performance
Request Rate Service Rate
λ µ
Disk Disk
Controller
Queue
Processor
Disk Disk
Controller
Queue
Disk Access Time
Access time = Seek time + Rotational Latency + Transfer time
+ Controller Time + Queuing Delay
17
I/O Benchmarks for Magnetic Disks
Supercomputer application:
Large-scale scientific problems => large files
One large read and many small writes to snapshot computation
Data Rate: MB/second between memory and disk
Transaction processing:
Examples: Airline reservations systems and bank ATMs
Small changes to large shared software
I/O Rate: No. disk accesses / second given upper limit for latency
File system:
Measurements of UNIX file systems in an engineering environment:
80% of accesses are to files less than 10 KB
90% of all file accesses are to data with sequential addresses on disk
67% of the accesses are reads, 27% writes, 6% read-write
I/O Rate & Latency: No. disk accesses /second and response time
18
Magnetic Storage Is Cheaper Than Paper
File cabinet: cabinet (four drawer) $250
paper (24,000 sheets) $250
space (2x3 @ 10$/ft2) $180
total $700
3¢/sheet
Disk: disk (40 GB) $100
ASCII = 20 million pages
0.0005¢/sheet (6000x cheaper)
Capacity (per unit area) doubles every 12 months!
Conclusion - Store Everything on Disk
Courtesy of Jim Gray, Microsoft Research 19
But What Do We Have To Store?
Databases
One popular Information at Your Fingertips™
suggestion: Information Network™
Knowledge Navigator™
You might record everything you
read - 10 MB/day, 400 GB/lifetime
(eight tapes today)
hear - 400 MB/day, 16 TB/lifetime
(three tapes/year today)
see - 1 MB/s, 40GB/day, 1.6 PB/lifetime
(maybe someday)
All information will be in an online database (somewhere)
Courtesy of Jim Gray, Microsoft Research 20
System-Level View - Bandwidth
System Bus
1600 MB/s Memory
Processor
PCI
133 MB/s
Disk
Disks are pretty far away... SCSI 10 MB/s
40 MB/s
21
System-Level View - Latency
System Bus
Memory
40 ns
Processor
1 ns
PCI
Disk
And slow too... SCSI 7 ms
22
Busses
Lots of sub-systems need to communicate
CPU
Video Bus Disk
Mem
Busses: Shared wires for common communication
23
Other Bus Issues
PRO: System flexibility
Buy new components and integrate
Build and integrate new components
PRO: Shared resource
No point to point interconnects that might not be fully utilized
CON: Physical constraints
Performance is limited by physical design
CON: Standards trail the state of the art
By the time its fully adopted, it is five years old
CON: Shared Resource
Simultaneous usage not possible
24
Bus Classifications
CPU-memory busses
CPU-Memory Bus
Fast
Proprietary
Closed and controlled Cache Bus Adapter
Support only memory transactions Main Memory
CPU
IO busses
Standardized (SCSI, PCI) I/O Bus
More diversity
More length
IO controller IO controller
Bus Bridges/Adapter
Cross from one bus to another
25
Bus Design Decisions
High Performance Low Performance
Structure Split Addr & Data Multiplex Addr & Data
Width Wide Narrow
Transfer Size Large / Flexible Small
Split Transact. Yes No
Mastering Multiple bus master Single bus master
Clocking Synchronous Asynchronous
26
Bus Clocking: Synchronous
Synchronous
Sample the control signals at edge of clock
Clock
Addr Addr 0 Addr 1 Addr 2
Data Data 0 Data 1 Data 2
R/~W
Pro: Fast and High Performance
Con:
Can’t be long (skew) or fast at same time
All bus members must run at the right speed
27
Bus Clocking: Asynchronous
Asynchronous
Edge of control signals determines communication
“Handshake Protocol”
Write Req
2
Addr Addr 0 Addr 1
3
Data Data 0 Data 1
1 4
Ack
1. Request (with actual transaction)
2. Acknowledge causes de-assert of Request
3. De-assert of Request causes de-assert of Ack
4. De-assert of Ack allows re-assertion of Request 28
Asynchronous Busses
Pros:
No clock
Slow and fast components on the same bus
Con:
Inefficient: two round trips
Like somebody who always repeats what was said to them
Clock Skew Synchronous Better
(bus length)
Mixture of IO speeds
29
Structure, Width, and Transfer Length
Separate vs. Multiplexed Address/Data
Multiplexed: save wires
Separate: more performance
Wide words: higher throughput, less control per transfer
On-chip cache to CPU busses: 256 bits wide
Serial Busses
Data Transfer Length
More data per address/control transfer
Example: Multiplexed Addr/Data with Data transfer of 4
Addr/Data Addr Data 0 Data 1 Data 2 Data 3
30
Split Transactions
Problem: Long wait times
Clock
Addr Addr
Addr
Data Data
Data
6 cycles
ClockSolution: Split Transaction Bus
Addr Addr 0 Addr 1 Addr 2 Addr 3
Data Data0 Data1 Data0 Data1
Tag Tag 0 Tag 0 Tag 1 Tag 1
31
Bus Mastering
Bus Master: a device that can initiate a bus transfer
1 2
CPU Mem Disk
3
Example:
1. CPU makes memory request
2. Page Fault in VM requires disk access to load page
3. Mover data from disk to memory
If the CPU is master, does it have to check to see if the disk is
ready to transfer?
32
Multiple Bus Masters
What if multiple devices could initiate transfers?
Update might take place in background while CPU operates
Multiple CPUs on shared memory systems
Challenge: Arbitration
If two or more masters want the bus at the same time, who gets it?
33
Arbitration Goals
Functionality
Prevent bus conflicts (two bussed simultaneous drivers)
Performance
Need to make decisions quickly
Priority
Some masters are more desperate than others
Example: DRAM refresh
Fairness
Every equal priority master should get equal service
No “starvation”: Every requestor should eventually get bus
34
Arbitration Options
Bus Request
Bus Grant
Bus Release
Option 1: Daisy Chain
Device 1 Device2 Device3 Device4
Grant Grant Grant
Problems:
Not fair
Not fast, especially for lowest priority
35
Centralized and Distributed Arbitration
Centralized: Arbiter
Require roundtrip
communication
Device 1 Device2 Device3 Device4
Distributed:
Self-selection
Faster
Require duplicated state Arb Arb Arb Arb
Device 1 Device2 Device3 Device4
36
Summary:
I/O performance…
… is limited by weakest link in chain between OS and device
Disk I/O Benchmarks
I/O rate vs. Data rate vs. latency
Three Components of Disk Access Time:
Seek Time: advertised to be 5 to 15 ms. May be lower in real life.
Rotational Latency: 4 ms at 7200 RPM and 6 ms at 5400 RPM
Transfer Time: 10 to 40 MB per second
Busses
Synchronous vs. Asynchronous
Serial and Parallel
Bus Mastering and Arbitration
37