Os - 10 - IO Subsystems
Os - 10 - IO Subsystems
ch
@spcl_eth
2
spcl.inf.ethz.ch
@spcl_eth
Mountable:
File systems implemented above block devices
spcl.inf.ethz.ch
@spcl_eth
Character Devices
Used for “unstructured I/O”
Byte-stream interface – no block boundaries
Single character or short strings get/put
Buffering implemented by libraries
Examples:
Keyboards, serial lines, mice
Mid-lecture mini-quiz
Character or block device (raise hand)
Video card
USB stick
Microphone
Screen (graphics adapter)
Network drive
12
spcl.inf.ethz.ch
@spcl_eth
Pseudo-devices in Unix
Devices with no hardware!
Still have major/minor device numbers. Examples:
etc.
spcl.inf.ethz.ch
@spcl_eth
Software routing
OS protocol stacks
include routing
Routing daemon
functionality
Routing protocols
typically in a user-space
User space daemon
Kernel space Routing Non-critical
Routing
protocol control Easier to change
messages
Forwarding information
typically in kernel
FIB (forwarding
Protocol stack
information base)
Needs to be fast
Integrated into protocol stack
Network
spcl.inf.ethz.ch
@spcl_eth
Networking stack
Probably most important peripheral
GPU is increasingly not a peripheral
Disk interfaces look increasingly like a network
But…
NO standard OS textbook talks about the network stack!
Good references:
The 4.4BSD book (for Unix at least)
George Varghese: “Network Algorithmics” (up to a point)
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
IP
Receive queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
IP
1. Interrupt
Receive queue
1.1 Allocate buffer
1.2 Enqueue packet
Kernel 1.3 Post s/w interrupt
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
2. S/W Interrupt
TCP UDP ICMP High priority
Any process context
Defragmentation
IP TCP processing
Enqueue on socket
Receive queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
3. Application
Stream Datagram Copy buffer to user
socket space
socket
Application process
context
IP
Receive queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
IP
Send queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
1. Application
Copy from user space to buffer
Datagram
Stream
Call TCP code and process
socket
socket
Possible enqueue on socket
queue
IP
Send queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
2. S/W Interrupt
TCP UDP ICMP Any process context
Remaining TCP
processing
IP IP processing
Enqueue on i/f queue
Send queue
Kernel
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Application Application
Stream Datagram
socket socket
IP
Send queue
3. Interrupt
Send packet
Kernel Free buffer
Network
interface
spcl.inf.ethz.ch
@spcl_eth
Closed
Active open / SYN
Passive Open Close
Close
Listen
SYN / SYNACK Send / SYN
e.g. ARP
Tunneling
Ethernet
Ethernet device
spcl.inf.ethz.ch
@spcl_eth
Protocol graphs
Graph nodes can be:
Per-protocol (handle all flows)
Packets are “tagged” with demux tags
Per-connection (instantiated dynamically)
Multiple interfaces as well as connections
Ethernet Ethernet bridging
IP IP IP routing
spcl.inf.ethz.ch
@spcl_eth
Memory management
spcl.inf.ethz.ch
@spcl_eth
Memory management
Problem: how to ship packet data around
Need a data structure that can:
Easily add, remove headers
Avoid copying lots of payload
Uniformly refer to half-defined packets
Fragment large datasets into smaller units
Solution:
Data is held in a linked list of “buffer structures”
spcl.inf.ethz.ch
@spcl_eth
next
offset
length
type
Data
(112 bytes)
next object
spcl.inf.ethz.ch
@spcl_eth
24 bytes
Data
(112 bytes)
tcp_transmit_skb
ip_route_output_flow
struct sk_buff,
TCP Header
ip_forward
ip_queue_xmit dev_queue_xmit
Driver
spcl.inf.ethz.ch
@spcl_eth
Socket Interface
Need to implement handlers for connect(), bind(), listen(), etc.
SKB fields
Double-linked list, each skb has .next/.prev
.data contains payload (size of data field is set by skb_alloc)
.sk is the socket this skb is owned by
.mac_header, .network_header, .transport_header contain headers of
various layers
.dev is the device this skb uses
... 58 member fields total
spcl.inf.ethz.ch
@spcl_eth
Performance issues
spcl.inf.ethz.ch
@spcl_eth
• Handle interrupt
Interrupt handler • Signal device driver
Interrupt
Time
spcl.inf.ethz.ch
@spcl_eth
This includes:
IP and TCP checksums
TCP window calculations and flow control
Copying packet to user space
spcl.inf.ethz.ch
@spcl_eth
A few numbers…
L3 cache miss (64-byte lines) ≈ 300 cycles
At most 10 cache misses per packet
Note: DMA ensures cache is cold for the packet!
Plus…
You also have to send packets.
Card is full duplex can send at 10 Gb/s
You have to do something useful with the packets!
Can an application make use of 1.5kB of data every 1000 machine cycles
or so?
This card has two 10 Gb/s ports.
spcl.inf.ethz.ch
@spcl_eth
And Plus …
What to do?
TCP offload (TOE)
Put TCP processing into hardware on the card
Buffering
Transfer lots of packets in a single transaction
Interrupt coalescing / throttling
Don’t interrupt on every packet
Don’t interrupt at all if load is very high
Receive-side scaling
Parallelize: direct interrupts and data to different cores
spcl.inf.ethz.ch
@spcl_eth
Buffering
Key ideas:
Consumer
pointer
Free descriptors
Running Idle
Host updates
Sends packet;
producer pointer
Ring occupancy
Host updates
below threshold
producer pointer;
Ring now full
Running;
host blocked
Sends packet;
Ring still nearly full
spcl.inf.ethz.ch
@spcl_eth
Transmit interrupts
producer pointer;
Ring now full
Running;
host blocked
Sends packet;
Ring still nearly full
Exercise: devise a
similar state machine
for receive!
spcl.inf.ethz.ch
@spcl_eth
Buffering summary
DMA used twice
Data transfer
Reading and writing descriptors
Similar schemes used for any fast DMA device
SATA/SAS interfaces (such as AHCI)
USB2/USB3 controllers
etc.
Descriptors send ownership of memory regions
Flexible – many variations possible:
Host can send lots of regions in advance
Device might allocate out of regions, send back subsets
Buffers might be used out-of-order
Particularly powerful with multiple send and receive queues…
spcl.inf.ethz.ch
@spcl_eth
Receive-side scaling
Insight:
Too much traffic for one core to handle
Cores aren’t getting any faster
Must parallelize across cores
Key idea: handle different flows on different cores
But: how to determine flow for each packet?
Can’t do this on a core: same problem!
Solution: demultiplex on the NIC
DMA packets to per-flow buffers / queues
Send interrupt only to core handling flow
spcl.inf.ethz.ch
@spcl_eth
Receive-side scaling
Flow table
Received
packet
pointer
Flow state:
• IP src + dest • Ring buffer
• TCP src + dest • Message-signaled interrupt
Etc.
Hash of
packet
header
DMA Core to
address interrupt
spcl.inf.ethz.ch
@spcl_eth
Receive-side scaling
Can balance flows across cores
Note: doesn’t help with one big flow!
Assumes:
n cores processing m flows is faster than one core
Hence:
Network stack and protocol graph must scale on a multiprocessor.
Multiprocessor scaling: topic for later
spcl.inf.ethz.ch
@spcl_eth