0% found this document useful (0 votes)
36 views67 pages

Os - 10 - IO Subsystems

The document discusses I/O subsystems in operating systems, focusing on device management, including device drivers, caching, spooling, scheduling, and buffering. It covers the naming and discovery of devices, the distinction between block and character devices, and the dynamic configuration of devices in modern Linux systems. Additionally, it outlines the network I/O interface and the internal workings of packet transmission and reception in BSD and Linux environments.

Uploaded by

dpanic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views67 pages

Os - 10 - IO Subsystems

The document discusses I/O subsystems in operating systems, focusing on device management, including device drivers, caching, spooling, scheduling, and buffering. It covers the naming and discovery of devices, the distinction between block and character devices, and the dynamic configuration of devices in modern Linux systems. Additionally, it outlines the network I/O interface and the internal workings of packet transmission and reception in BSD and Linux environments.

Uploaded by

dpanic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

spcl.inf.ethz.

ch
@spcl_eth

ADRIAN PERRIG & TORSTEN HOEFLER

Networks and Operating Systems (252-0062-00)


Chapter 10: I/O Subsystems (2)

BE CAREFUL WITH I/O DEVICES!


spcl.inf.ethz.ch
@spcl_eth

Our Small Quiz


 True or false (raise hand)
 Open files are part of the process’ address-space
 Unified buffer caches improve the access times
 A partition table can unify the view of multiple disks
 Unix enables to bind arbitrary file systems to arbitrary locations
 The virtual file system interface improves modularity of OS code
 Programmed I/O is efficient for the CPUs
 DMA enables devices to access virtual memory of processes
 IOMMUs enable memory protection for devices
 IOMMUs improve memory access performance
 First level interrupt handlers process the whole request from the hardware
 Software interrupts reduce the request processing latency
 Deferred procedure calls execute second-level interrupt handlers

2
spcl.inf.ethz.ch
@spcl_eth

The I/O subsystem


spcl.inf.ethz.ch
@spcl_eth

Generic I/O functionality


 Device drivers essentially move data to and from I/O devices
 Abstract hardware
 Manage asynchrony

 OS I/O subsystem includes generic functions for dealing with


this data
 Such as…
spcl.inf.ethz.ch
@spcl_eth

The I/O Subsystem


 Caching - fast memory holding copy of data
 Always just a copy
 Key to performance

 Spooling - hold output for a device


 If device can serve only one request at a time
 E.g., printing
spcl.inf.ethz.ch
@spcl_eth

The I/O Subsystem


 Scheduling
 Some I/O request ordering via per-device queue
 Some OSs try fairness

 Buffering - store data in memory while transferring between


devices or memory
 To cope with device speed mismatch
 To cope with device transfer size mismatch
 To maintain “copy semantics”
spcl.inf.ethz.ch
@spcl_eth

Naming and Discovery


 What are the devices the OS needs to manage?
 Discovery (bus enumeration)
 Hotplug / unplug events
 Resource allocation (e.g., PCI BAR programming)

 How to match driver code to devices?


 Driver instance ≠ driver module
 One driver typically manages many models of device

 How to name devices inside the kernel?

 How to name devices outside the kernel?


spcl.inf.ethz.ch
@spcl_eth

Matching drivers to devices

 Devices have unique (model) identifiers


 E.g., PCI vendor/device identifiers

 Drivers recognize particular identifiers


 Typically a list…

 Kernel offers a device to each driver in turn


 Driver can “claim” a device it can handle
 Creates driver instance for it.
spcl.inf.ethz.ch
@spcl_eth

Naming devices in the Unix kernel


(Actually, naming device driver instances)

 Kernel creates identifiers for


 Block devices
 Character devices
 [ Network devices – see later… ]

 Major device number:


 Class of device (e.g., disk, CD-ROM, keyboard)

 Minor device number:


 Specific device within a class
spcl.inf.ethz.ch
@spcl_eth

Unix Block Devices


 Used for “structured I/O”
 Deal in large “blocks” of data at a time

 Often look like files (seekable, mappable)


 Often use Unix’ shared buffer cache

 Mountable:
 File systems implemented above block devices
spcl.inf.ethz.ch
@spcl_eth

Character Devices
 Used for “unstructured I/O”
 Byte-stream interface – no block boundaries
 Single character or short strings get/put
 Buffering implemented by libraries

 Examples:
 Keyboards, serial lines, mice

 Distinction with block devices somewhat arbitrary…


spcl.inf.ethz.ch
@spcl_eth

Mid-lecture mini-quiz
 Character or block device (raise hand)
 Video card
 USB stick
 Microphone
 Screen (graphics adapter)
 Network drive

12
spcl.inf.ethz.ch
@spcl_eth

Naming devices outside the kernel


 Device files: special type of file
 Inode encodes <type, major num, minor num>
 Created with mknod() system call

 Devices are traditionally put in /dev


 /dev/sda – First SCSI/SATA/SAS disk
 /dev/sda5 – Fifth partition on the above
 /dev/cdrom0 – First DVD-ROM drive
 /dev/ttyS1 – Second UART
spcl.inf.ethz.ch
@spcl_eth

Pseudo-devices in Unix
 Devices with no hardware!
 Still have major/minor device numbers. Examples:

/dev/stdin (was a device earlier, now link)


/dev/kmem
/dev/random
/dev/null
/dev/loop0

etc.
spcl.inf.ethz.ch
@spcl_eth

Old-style Unix device configuration


 All drivers compiled into the kernel
 Each driver probes for any supported devices
 System administrator populates /dev
 Manually types mknod when a new device is purchased!
 Pseudo devices similarly hard-wired in kernel
spcl.inf.ethz.ch
@spcl_eth

Linux device configuration today


 Physical hardware configuration readable from /sys
 Special fake file system: sysfs
 Plug events delivered by a special socket
 Drivers dynamically loaded as kernel modules
 Initial list given at boot time
 User-space daemon can load more if required
 /dev populated dynamically by udev
 User-space daemon which polls /sys
spcl.inf.ethz.ch
@spcl_eth

Interface to network I/O


spcl.inf.ethz.ch
@spcl_eth

Unix interface to network I/O


 You will soon know the data path
 BSD sockets
 bind(), listen(), accept(), connect(), send(), recv(),
etc.
 Have not yet seen:
 Device driver interface
 Configuration
 Routing
spcl.inf.ethz.ch
@spcl_eth

Software routing

 OS protocol stacks
include routing
Routing daemon
functionality
 Routing protocols
typically in a user-space
User space daemon
Kernel space Routing  Non-critical
Routing
protocol control  Easier to change
messages
 Forwarding information
typically in kernel
FIB (forwarding
Protocol stack
information base)
 Needs to be fast
 Integrated into protocol stack

Network
spcl.inf.ethz.ch
@spcl_eth

Network stack implementation


spcl.inf.ethz.ch
@spcl_eth

Networking stack
 Probably most important peripheral
 GPU is increasingly not a peripheral
 Disk interfaces look increasingly like a network
 But…
 NO standard OS textbook talks about the network stack!
 Good references:
 The 4.4BSD book (for Unix at least)
 George Varghese: “Network Algorithmics” (up to a point)
spcl.inf.ethz.ch
@spcl_eth

Receiving a packet in BSD

Application Application

Stream Datagram
socket socket

TCP UDP ICMP

IP

Receive queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Receiving a packet in BSD

Application Application

Stream Datagram
socket socket

TCP UDP ICMP

IP
1. Interrupt
Receive queue
1.1 Allocate buffer
1.2 Enqueue packet
Kernel 1.3 Post s/w interrupt
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Receiving a packet in BSD

Application Application

Stream Datagram
socket socket

2. S/W Interrupt
TCP UDP ICMP High priority
Any process context
Defragmentation
IP TCP processing
Enqueue on socket
Receive queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Receiving a packet in BSD

Application Application

3. Application
Stream Datagram Copy buffer to user
socket space
socket
Application process
context

TCP UDP ICMP

IP

Receive queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Sending a packet in BSD

Application Application

Stream Datagram
socket socket

TCP UDP ICMP

IP

Send queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Sending a packet in BSD

Application Application

1. Application
Copy from user space to buffer
Datagram
Stream
Call TCP code and process
socket
socket
Possible enqueue on socket
queue

TCP UDP ICMP

IP

Send queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Sending a packet in BSD

Application Application

Stream Datagram
socket socket

2. S/W Interrupt
TCP UDP ICMP Any process context
Remaining TCP
processing
IP IP processing
Enqueue on i/f queue
Send queue

Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth

Sending a packet in BSD

Application Application

Stream Datagram
socket socket

TCP UDP ICMP

IP

Send queue
3. Interrupt
Send packet
Kernel Free buffer
Network
interface
spcl.inf.ethz.ch
@spcl_eth

The TCP state machine

Closed
Active open / SYN
Passive Open Close
Close
Listen
SYN / SYNACK Send / SYN

SYN_rcvd SYN / SYNACK SYN_sent


ACK SYNACK / ACK

Close / FIN Established

Close / FIN FIN / ACK

FIN_wait_1 FIN / ACK Close_wait


ACK Close / FIN

FIN_wait_2 Closing Last_Ack


ACK
Timeout after 2 ACK
segment lifetimes
FIN / ACK Time_wait Closed
spcl.inf.ethz.ch
@spcl_eth

OS TCP state machine


 More complex! Also needs to handle:
 Congestion control state (window, slow start, etc.)
 Flow control window
 Retransmission timeouts
 Etc.
 State transitions triggered when:
 User request: send, recv, connect, close
 Packet arrives
 Timer expires
 Actions include:
 Set or cancel a timer
 Enqueue a packet on the transmit queue
 Enqueue a packet on the socket receive queue
 Create or destroy a TCP control block
spcl.inf.ethz.ch
@spcl_eth

In-kernel protocol graph

TCP UDP ICMP


Interfaces can be
standard (e.g. X-
kernel, Windows) or
IP protocol-specific (e.g.
Unix)

e.g. ARP
Tunneling
Ethernet

Ethernet device
spcl.inf.ethz.ch
@spcl_eth

Protocol graphs
Graph nodes can be:
 Per-protocol (handle all flows)
 Packets are “tagged” with demux tags
 Per-connection (instantiated dynamically)
 Multiple interfaces as well as connections
 Ethernet  Ethernet  bridging
 IP  IP  IP routing
spcl.inf.ethz.ch
@spcl_eth

Memory management
spcl.inf.ethz.ch
@spcl_eth

Memory management
 Problem: how to ship packet data around
 Need a data structure that can:
 Easily add, remove headers
 Avoid copying lots of payload
 Uniformly refer to half-defined packets
 Fragment large datasets into smaller units
 Solution:
 Data is held in a linked list of “buffer structures”
spcl.inf.ethz.ch
@spcl_eth

BSD Unix mbufs (Linux equivalent: sk_buffs)

next
offset
length
type

Data
(112 bytes)

next object
spcl.inf.ethz.ch
@spcl_eth

BSD Unix mbufs (Linux equivalent: sk_buffs)

next Next mbuf


offset for this 36
length object 24 36
type Type: DATA bytes

24 bytes
Data
(112 bytes)

next object Next object


in a list
spcl.inf.ethz.ch
@spcl_eth

Case Study: Linux 3.x

• Implementing a simple protocol over Ethernet


• Why?
 You want to play with networking equipment (well, RAW sockets
are easier)
 You want to develop specialized protocols
E.g., application-specific “TCP”
E.g., for low-latency cluster computing
 You’ll understand how it works!
spcl.inf.ethz.ch
@spcl_eth

Sending Data in Linux 3.x


Socket
• Many layers
char*
• Most use the sk_buf struct
tcp_send_msg
ip_fragment
struct sk_buff

tcp_transmit_skb
ip_route_output_flow
struct sk_buff,
TCP Header
ip_forward
ip_queue_xmit dev_queue_xmit

Driver
spcl.inf.ethz.ch
@spcl_eth

Register a receive hook

• Fill packet_type struct:


 .type = your ethertype Receive hook table:
 .func = your receive function 0x0800 IPv4 hdlr.
• Receive handler recv_hook(...) 0x8864 PPPOE hdlr.
 Gets sk_buff, packet_type, net_device, ... 0x8915 RoCE hdlr.
 Called for each incoming frame …
 Reads skb->data field and processes protocols
spcl.inf.ethz.ch
@spcl_eth

Interaction with applications

 Socket Interface
 Need to implement handlers for connect(), bind(), listen(), etc.

 Call sock_register(struct net_proto_family*)


 Register a protocol family
 Enables user to create socket of this type
spcl.inf.ethz.ch
@spcl_eth

Anatomy of struct sk_buff

 Called “skb” in Linux jargon


 Allocate via alloc_skb() (or dev_alloc_skb() if in driver)
 Free with kfree_skb() (dev_kfree_skb())
 Use pskb_may_pull(skb, len) to check if data is available
 skb_pull(skb, len) to advance the data pointer
... it even has a webpage! http://www.skbuff.net/
spcl.inf.ethz.ch
@spcl_eth

SKB fields
 Double-linked list, each skb has .next/.prev
 .data contains payload (size of data field is set by skb_alloc)
 .sk is the socket this skb is owned by
 .mac_header, .network_header, .transport_header contain headers of
various layers
 .dev is the device this skb uses
 ... 58 member fields total
spcl.inf.ethz.ch
@spcl_eth

Case study: IP fragmenting


 Linux <2.0.32: // Determine the position of this fragment.
 Two fragments: end = offset + iph->tot_len - ihl; #1: 100, #2: 200
// Check for overlap with preceding fragment, and, if needed,
 #1 // align things so that any overlaps are eliminated.
Offset: 0 if (prev != NULL && offset < prev->end) {
Length: 100 i = prev->end - offset;
offset += i; /* ptr into datagram */
 #2 ptr += i; /* ptr into fragment data */
Offset 100 }
Length: 100 // initialize segment structure #1: 0, #2: 100
fp->offset = offset;
fp->end = end; #1: 100, #2: 200
fp->len = end - offset; #1: 100, #2: 100
.... // collect multiple such fragments in queue
// process each fragment
if(count+fp->len > skb->len) {
error_to_big;
}
memcpy((ptr + fp->offset), fp->ptr, fp->len);
count += fp->len;
fp = fp->next;
spcl.inf.ethz.ch
@spcl_eth

Case study: IP fragmenting


 Linux <2.0.32: // Determine the position of this fragment.
end = offset + iph->tot_len - ihl; #1: 100, #2: 30
 Two fragments:
// Check for overlap with preceding fragment, and, if needed,
 #1 // align things so that any overlaps are eliminated.
Offset: 0 if (prev != NULL && offset < prev->end) {
i = prev->end - offset; #2: 100-10=90
Length: 100
offset += i; /* ptr into datagram */ #2: 100
 #2 ptr += i; /* ptr into fragment data */
Offset 10 }
Length: 20 // initialize segment structure
#1: 0, #2: 100
fp->offset = offset;
fp->end = end; #1: 100, #2: 30
fp->len = end - offset; #1: 100, #2: -70
.... // collect multiple such fragments in queue
// process each fragment (size_t)-70 = 4294967226
if(count+fp->len > skb->len) {
error_to_big;
}
memcpy((ptr + fp->offset), fp->ptr, fp->len);
count += fp->len;
fp = fp->next;
spcl.inf.ethz.ch
@spcl_eth

Case study: IP fragmenting


 Linux <2.0.32: // Determine the position of this fragment.
 Two fragments: end = offset + iph->tot_len - ihl; #1: 100, #2: 30
// Check for overlap with preceding fragment, and, if needed,
 #1 // align things so that any overlaps are eliminated.
Offset: 0 if (prev != NULL && offset < prev->end) {
i = prev->end - offset; #2: 100-10=90
Length: 100
offset += i; /* ptr into datagram */ #2: 100
 #2 ptr += i; /* ptr into fragment data */
Offset 10 }
Length: 20 // initialize segment structure
#1: 0, #2: 100
fp->offset = offset;
fp->end = end; #1: 100, #2: 30
fp->len = end - offset; #1: 100, #2: -70
.... // collect multiple such fragments in queue
// process each fragment
if(count+fp->len > skb->len) {
(size_t)-70 = 4294967226
error_to_big;
}
memcpy((ptr + fp->offset), fp->ptr, fp->len);
count += fp->len;
fp = fp->next;
spcl.inf.ethz.ch
@spcl_eth

2.0.32 … that’s so last century!


spcl.inf.ethz.ch
@spcl_eth

Performance issues
spcl.inf.ethz.ch
@spcl_eth

Life cycle of an I/O request

• Request I/O User process • I/O complete

System call Return from system call


• Transfer data to/from user
Can satisfy
space,
request? Yes • Return completion code
No I/O subsystem
• Send request to driver
• Block process if needed

• Issue commands to • Demultiplex I/O complete


device Device driver • Determine source of
• Block until interrupted request

• Handle interrupt
Interrupt handler • Signal device driver

Interrupt

• Issue interrupt when I/O Physical device • I/O complete


completed • Generate Interrupt

Time
spcl.inf.ethz.ch
@spcl_eth

Consider 10 Gb/s Ethernet


spcl.inf.ethz.ch
@spcl_eth

At full line rate for 1 x 10 Gb port


 ~1GB (gigabyte) per second
 ~ 700k full-size Ethernet frames per second
 At 2GHz, must process a packet in ≤ 3000 cycles

 This includes:
 IP and TCP checksums
 TCP window calculations and flow control
 Copying packet to user space
spcl.inf.ethz.ch
@spcl_eth

A few numbers…
 L3 cache miss (64-byte lines) ≈ 300 cycles
 At most 10 cache misses per packet
Note: DMA ensures cache is cold for the packet!

 Interrupt latency ≈ 500 cycles


 Kernel entry/exit
 Hardware access
 Context switch / DPC
 Etc.
spcl.inf.ethz.ch
@spcl_eth

Plus…
 You also have to send packets.
 Card is full duplex  can send at 10 Gb/s
 You have to do something useful with the packets!
 Can an application make use of 1.5kB of data every 1000 machine cycles
or so?
 This card has two 10 Gb/s ports.
spcl.inf.ethz.ch
@spcl_eth

And Plus …

 And if you thought that


was fast …
 Mellanox EDR 100 Gb/s Adapter
 Impossible to use without
advanced features
RDMA
SR-IOV
TOE
Interrupt coalescing
spcl.inf.ethz.ch
@spcl_eth

What to do?
 TCP offload (TOE)
 Put TCP processing into hardware on the card
 Buffering
 Transfer lots of packets in a single transaction
 Interrupt coalescing / throttling
 Don’t interrupt on every packet
 Don’t interrupt at all if load is very high
 Receive-side scaling
 Parallelize: direct interrupts and data to different cores
spcl.inf.ethz.ch
@spcl_eth

Linux New API (NAPI)


 Mitigate interrupt pressure
1. Each packet interrupts the CPU
2. Vs. CPU polls driver
 NAPI switches between the two!
 NAPI-compliant drivers
 Offer a poll() function
 Which calls back into the receive path …
spcl.inf.ethz.ch
@spcl_eth

Linux NAPI Balancing


 Driver enables polling with netif_rx_schedule(struct net_device
*dev)
 Disables interrupts

 Driver deactivates polling with netif_rx_complete(struct


net_device *dev)
 Re-enable interrupts.

  but where does the data go???


spcl.inf.ethz.ch
@spcl_eth

Buffering
Key ideas:

 Decouple sending and receiving


 Neither side should wait for the other
 Only use interrupts to unblock host

 Batch requests together


 Spread cost of transfer over several packets
spcl.inf.ethz.ch
@spcl_eth

Producer-consumer buffer descriptor rings

Consumer
pointer
Free descriptors

Producer Descriptor format


pointer
Physical address
Size in bytes
Misc. flags
Full descriptors
spcl.inf.ethz.ch
@spcl_eth

Buffering for network cards


Producer, consumer pointers are NIC registers
 Transmit path:
 Host updates producer pointer,
adds packets to ring
 Device updates consumer pointer
 Receive path:
 Host updates consumer pointer,
adds empty buffers to ring
 Device updates producer pointer,
fills buffers with received packets.
More complex protocols are possible…
spcl.inf.ethz.ch
@spcl_eth

Example transmit state machine

Sends last packet;


Sends packet; None left in
More packets in descriptor ring
descriptor ring

Running Idle

Host updates
Sends packet;
producer pointer
Ring occupancy
Host updates
below threshold
producer pointer;
Ring now full

Running;
host blocked
Sends packet;
Ring still nearly full
spcl.inf.ethz.ch
@spcl_eth

Transmit interrupts

 Ring empty Sends last packet;


 all packets sent Sends packet;
More packets in
None left in
descriptor ring
descriptor ring
 device going idle
 Ring occupancy drops Running Idle

 host can now send again Sends packet;


Ring occupancy Host updates
 device continues running below threshold
Host updates
producer pointer

producer pointer;
Ring now full

Running;
host blocked
Sends packet;
Ring still nearly full

Exercise: devise a
similar state machine
for receive!
spcl.inf.ethz.ch
@spcl_eth

Buffering summary
 DMA used twice
 Data transfer
 Reading and writing descriptors
 Similar schemes used for any fast DMA device
 SATA/SAS interfaces (such as AHCI)
 USB2/USB3 controllers
 etc.
 Descriptors send ownership of memory regions
 Flexible – many variations possible:
 Host can send lots of regions in advance
 Device might allocate out of regions, send back subsets
 Buffers might be used out-of-order
 Particularly powerful with multiple send and receive queues…
spcl.inf.ethz.ch
@spcl_eth

Receive-side scaling
 Insight:
 Too much traffic for one core to handle
 Cores aren’t getting any faster
 Must parallelize across cores
 Key idea: handle different flows on different cores
 But: how to determine flow for each packet?
 Can’t do this on a core: same problem!
 Solution: demultiplex on the NIC
 DMA packets to per-flow buffers / queues
 Send interrupt only to core handling flow
spcl.inf.ethz.ch
@spcl_eth

Receive-side scaling

Flow table
Received
packet

pointer
Flow state:
• IP src + dest • Ring buffer
• TCP src + dest • Message-signaled interrupt
Etc.

Hash of
packet
header

DMA Core to
address interrupt
spcl.inf.ethz.ch
@spcl_eth

Receive-side scaling
 Can balance flows across cores
 Note: doesn’t help with one big flow!
 Assumes:
 n cores processing m flows is faster than one core
 Hence:
 Network stack and protocol graph must scale on a multiprocessor.
 Multiprocessor scaling: topic for later
spcl.inf.ethz.ch
@spcl_eth

Next (final) week


 Virtual machines
 Multiprocessor operating systems

You might also like