Network Congestion Solutions
Network Congestion Solutions
DOI:10.1145/ 3009824
cause of these problems. When bottle-
Article development led by
queue.acm.org
neck buffers are large, loss-based con-
gestion control keeps them full, causing
bufferbloat. When bottleneck buffers
Measuring bottleneck bandwidth are small, loss-based congestion con-
and round-trip propagation time. trol misinterprets loss as a signal of
congestion, leading to low throughput.
BY NEAL CARDWELL, YUCHUNG CHENG, C. STEPHEN GUNN, Fixing these problems requires an alter-
native to loss-based congestion control.
SOHEIL HASSAS YEGANEH, AND VAN JACOBSON
Finding this alternative requires an un-
BBR:
derstanding of where and how network
congestion originates.
Congestion-Based
At any time, a (full-duplex) TCP connec-
tion has exactly one slowest link or bottle-
neck in each direction. The bottleneck is
important because:
mines behavior; otherwise, BtlBw some scheme to bound how far to the than ISP link BDPs, and the resulting
dominates. Constraint lines intersect right a connection operates on average. bufferbloat yielded RTTs of seconds in-
at inflight = BtlBw × RTprop, a.k.a. the Loss-based congestion control oper- stead of milliseconds.9
pipe’s BDP (bandwidth-delay product). ates at the right edge of the bandwidth- The left edge of the bandwidth-lim-
Since the pipe is full past this point, limited region, delivering full bottle- ited region is a better operating point
the inflight–BDP excess creates a queue neck bandwidth at the cost of high than the right. In 1979, Leonard Klein-
F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 59
practice
rock16 showed this operating point was tics, RTprop and BtlBw. That these can protocol that reacts to actual conges-
optimal, maximizing delivered band- be inferred from traces suggests that tion, not packet loss or transient queue
width while minimizing delay and loss, Jaffe’s result might not be as limiting delay, and converges with high proba-
both for individual connections and as it once appeared. His result rests on bility to Kleinrock’s optimal operating
for the network as a whole8. Unfortu- fundamental measurement ambigui- point. Thus began our three-year quest
nately, around the same time Jeffrey M. ties (For example, whether a measured to create a congestion control based
Jaffe14 proved it was impossible to cre- RTT increase is caused by a path- on measuring the two parameters that
ate a distributed algorithm that con- length change, bottleneck bandwidth characterize a path: bottleneck band-
verged to this operating point. This re- decrease, or queuing delay increase width and round-trip propagation
sult changed the direction of research from another connection’s traffic). time, or BBR.
from finding a distributed algorithm Although it is impossible to disam-
that achieved Kleinrock’s optimal op- biguate any single measurement, a Characterizing the Bottleneck
erating point to investigating different connection’s behavior over time tells a A connection runs with the highest
approaches to congestion control. clearer story, suggesting the possibili- throughput and lowest delay when
Our group at Google spends hours ty of measurement strategies designed (rate balance) the bottleneck packet
each day examining TCP packet head- to resolve ambiguity. arrival rate equals BtlBw and (full pipe)
er captures from all over the world, Combining these measurements the total data in flight is equal to the
making sense of behavior anomalies with a robust servo loop using recent BDP (= BtlBw × RTprop).
and pathologies. Our usual first step is control systems advances12 could re- The first condition guarantees that
finding the essential path characteris- sult in a distributed congestion-control the bottleneck can run at 100% utili-
zation. The second guarantees there
Figure 1. Delivery rate and round-trip time vs. inflight. is enough data to prevent bottleneck
starvation but not overfill the pipe.
BDP+ The rate balance condition alone does
BDP BtlneckBufSize not ensure there is no queue, only that
buffer it cannot change size (for example,
app limited bandwidth limited limited
if a connection starts by sending its
10-packet Initial Window into a five-
Round-trip Time
RT
prop = RTprop + min(ηt) = observe BtlBw. A multi-hour bulk data
min (RTTt) ∀t ∈ [T – WR, T] transfer might spend its entire lifetime
in the bandwidth-limited region and
(That is, a running min over time win- have only a single sample of RTprop
dow WR (which is typically tens of sec-
onds to minutes). Our group at Google from the first packet’s RTT. This intrin-
sic uncertainty means that in addition
Unlike RTT, nothing in the TCP spec
requires implementations to track
spends hours each to estimators to recover the two path pa-
rameters, there must be states that track
bottleneck bandwidth, but a good es- day examining both what can be learned at the current
timate results from tracking delivery
rate. When the ack for some packet ar-
TCP packet header operating point and, as information be-
comes stale, how to get to an operating
rives back at the sender, it conveys that captures from all point where it can be relearned.
packet’s RTT and announces the deliv-
ery of data inflight when that packet de-
over the world, Matching the Packet Flow
parted. Average delivery rate between making sense of to the Delivery Path
send and ack is the ratio of data deliv-
ered to time elapsed: deliveryRate behavior anomolies The core BBR algorithm has two parts:
When an ack is received. Each ack
= Δdelivered/Δt. This rate must be ≤ the and pathologies. provides new RTT and delivery rate
bottleneck rate (the arrival amount is
known exactly so all the uncertainty is Our usual first measurements that update the RTprop
and BtlBw estimates, as illustrated in
in the Δt, which must be ≥ the true ar-
rival interval; thus, the ratio must be ≤
step is finding the Figure 2.
The if checks address the uncer-
the true delivery rate, which is, in turn, essential path tainty issue described in the last para-
upper-bounded by the bottleneck ca-
pacity). Therefore, a windowed-max of
characteristics, graph: senders can be application lim-
ited, meaning the application runs out
delivery rate is an efficient, unbiased RTprop and BtlBw. of data to fill the network. This is quite
estimator of BtlBw: common because of request/response
traffic. When there is a send opportu-
Btl ax(deliveryRatet)
Bw = m nity but no data to send, BBR marks the
∀t ∈ [T – WB,T] corresponding bandwidth sample(s)
as application limited (see send()
where the time window WB is typically pseudocode to follow). The code here
six to 10 RTTs. decides which samples to include in
TCP must record the departure time the bandwidth model so it reflects net-
of each packet to compute RTT. BBR work, not application, limits. BtlBw is a
augments that record with the total hard upper bound on the delivery rate
data delivered so each ack arrival yields so a measured delivery rate larger than
both an RTT and a delivery rate mea- the current BtlBw estimate must mean
surement that the filters convert to RT- the estimate is too low, whether or not
prop and BtlBw estimates. the sample was app-limited. Other-
Note that these values are completely wise, application-limited samples are
independent: RTprop can change (for discarded. (Figure 1 shows that in the
example, on a route change) but still app-limited region deliveryRate
have the same bottleneck, or BtlBw underestimates, BtlBw. These checks
can change (for example, when a wire- prevent filling the BtlBw filter with un-
less link changes rate) without the path derestimates that would cause data to
changing. (This independence is why be sent too slowly.)
both constraints have to be known to When data is sent. To match the
match sending behavior to delivery packet-arrival rate to the bottleneck
path.) Since RTprop is visible only to the link’s departure rate, BBR paces ev-
left of BDP and BtlBw only to the right in ery data packet. BBR must match the
Figure 1, they obey an uncertainty prin- bottleneck rate, which means pacing is
ciple: whenever one can be measured, integral to the design and fundamental
the other cannot. Intuitively, this is be- to operation—pacing_rate is BBR’s pri-
cause the pipe has to be overfilled to mary control parameter. A secondary
find its capacity, which creates a queue parameter, cwnd_gain, bounds inflight
that obscures the length of the pipe. to a small multiple of the BDP to han-
For example, an application running a dle common network and receiver pa-
request/response protocol might never thologies (as we will discuss). Concep-
send enough data to fill the pipe and tually, the TCP send routine looks like
F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 61
practice
Figure 3. Packet-send half of the BBR algorithm. for each part of the cycle is shown time-
aligned with the data it influenced. The
function send(packet) gain is applied an RTT earlier, when
bdp = BtlBwFilter.currentMax the data is sent. This is indicated by the
* RTpropFilter.currentMin horizontal jog in the event sequence
if (inflight >= cwnd_gain * bdp)
// wait for ack or retransmission timeout
description running up the left side.
return BBR minimizes delay by spending
if (now >= nextSendTime) most of its time with one BDP in flight,
packet = nextPacketToSend()
paced at the BtlBw estimate. This
if (! packet)
app_limited_until = inflight moves the bottleneck to the sender so it
return cannot observe BtlBw increases. Con-
packet.app_limited = (app_limited_until > 0) sequently, BBR periodically spends an
packet.sendtime = now
packet.delivered = delivered
RTprop interval at a pacing_gain > 1,
packet.delivered_time = delivered_time which increases the sending rate and
ship(packet) inflight. If BtlBw hasn’t changed, then
nextSendTime = now + packet.size /
a queue is created at the bottleneck,
(pacing_gain * BtlBwFilter.currentMax)
increasing RTT, which keeps deliv-
timerCallbackAt(send, nextSendTime) eryRate constant. (This queue is re-
moved by sending at a compensating
pacing_gain < 1 for the next RTprop.) If
BtlBw has increased, deliveryRate
Figure 4. RTT (blue), inflight (green), and delivery rate (red) detail. increases and the new max immedi-
ately increases the BtlBw filter output,
increasing the base pacing rate. Thus,
52.5 BBR converges to the new bottleneck
pipe full so RTT
RTT (ms)
55
at 20Mbps (bottom graph).
50 (BBR is a simple instance of a Max-
45
plus control system, a new approach to
gain > 1 so
inflight increases control based on nonstandard algebra.12
cycle gain
This approach allows the adaptation rate
1.00 1.00 1.00 1.00 1.00 1.25 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.25 0.75 1.00 1.00 [controlled by the max gain] to be inde-
pendent of the queue growth [controlled
max BtlBw × cycle gain by the average gain]. Applied to this prob-
used as sending rate ack from send upates filter one RTT later
lem, it results in a simple, implicit con-
9.50 trol loop where the adaptation to physi-
BW (Mbps)
The Startup and Drain states are used at though unfairness can persist when ProbeRTT state: when the RTProp
connection start (Figure 6). To handle late starters overestimate RTprop as a estimate has not been updated (that
Internet link bandwidths spanning 12 result of starting when other flows have is, by measuring a lower RTT) for
orders of magnitude, Startup imple- (temporarily) created a queue. many seconds, BBR enters ProbeRTT,
ments a binary search for BtlBw by us- To learn the true RTProp, a flow which reduces the inflight to four
ing a gain of 2/ln2 to double the sending moves to the left of BDP using packets for at least one round trip,
rate while delivery rate is increasing.
This discovers BtlBw in log2BDP RTTs Figure 5. Bandwidth change.
but creates up to 2BDP excess queue in
the process. Once Startup finds BtlBw, 120
BBR transitions to Drain, which uses inflight
RTT
the inverse of Startup’s gain to get rid
F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 63
practice
then returns to the previous state. estimates expire at the same time, so ordination is the key to both fairness
Large flows entering ProbeRTT drain they enter ProbeRTT together, which and stability.
many packets from the queue, so sev- makes the total queue dip larger and BBR synchronizes flows around the
eral flows see a new RTprop (new min- causes more flows to see a new RT- desirable event of an empty bottleneck
imum RTT). This makes their RTprop prop, and so on. This distributed co- queue. By contrast, loss-based conges-
tion control synchronizes around the
Figure 7. First eight seconds of 10Mbps, 40ms cubic and BBR flows. undesirable events of periodic queue
growth and overflow, amplifying delay
500 and packet loss.
packet loss and
recovery episodes
400 Google B4 WAN
Deployment Experience
RTT (ms)
0.75
16 ○ opens persistent BBR and CUBIC con-
14
○
0.50
nections to remote datacenters, then
○ CUBIC transfers 8MB of data every minute.
BBR
12 ○
○ 0.25
Probers communicate via many B4
○
10
○
○○
paths within and between North Amer-
○
○○ ica, Europe, and Asia.
0
8 ○○ The huge improvement is a direct
○○ 1 5 20 100 500 5000
○○ Throughput (Mbps) – Log Scale
○○○
○○○○○○○ consequence of BBR not using loss
6 ○○○
○○○ as a congestion indicator. To achieve
○○○
4 ○○○
○○○○ ○○○ ○ ○ full bandwidth, existing loss-based
○○○○○○○ ○○○○○○ ○○ ○ ○
○○○○○○○ ○○○○○○ ○○○○ ○ congestion controls require the
2x Improvement ○○○○○○○ ○
2
loss rate to be less than the inverse
0 square of the BDP17 (for example, <
5 10 20 50 100 200 500 1000 2000 5000
one loss per 30 million packets for a
CUBIC Throughput (Mbps) – Log Scale
10Gbps/100ms path). Figure 10 com-
pares measured goodput at various
loss rates. CUBIC’s loss tolerance is a Figure 10. BBR vs. CUBIC goodput under loss.
structural property of the algorithm,
while BBR’s is a configuration param- 100
eter. As BBR’s loss rate approaches the
ProbeBW peak gain, the probability of
Throughput (Mbps)
75
measuring a delivery rate of the true
BtlBw drops sharply, causing the max
50
filter to underestimate.
Figure 10 shows BBR vs. CUBIC
goodput for 60-second flows on a 25
100Mbps/100ms link with 0.001 to
50% random loss. CUBIC’s throughput 0
decreases by 10 times at 0.1% loss and 0.001 0.01 0.1 1 2 5 10 20 30 50
totally stalls above 1%. The maximum Loss Rate (%) – Log Scale
possible throughput is the link rate
times fraction delivered (= 1 – lossRate).
BBR meets this limit up to a 5% loss Figure 11. BBR vs. CUBIC median RTT improvement.
and is close up to 15%.
YouTube Edge 5
Deployment Experience
CUBIC RTT / BBR RTT
400
11 shows BBR vs. CUBIC median RTT
improvement from more than 200
million YouTube playback connec- 200 new connections fail in Linux / Android
tions measured on five continents new connections fail in Windows / Mac OS / iOS
over a week.
0
More than half of the world’s seven
150 750 1500 3000 6000 9750
billion mobile Internet subscriptions Buffer (KB)
connect via 8kbps to 114kbps 2.5G sys-
tems,5 which suffer well-documented
problems because of loss-based con-
gestion control’s buffer-filling pro- The horizontal lines mark one of the cept packet is delayed for longer than
pensities.3 The bottleneck link for more serious consequences: TCP the fixed SYN timeout).
these systems is usually between the adapts to long RTT delay except on Figure 12 shows steady-state me-
SGSN (serving GPRS support node)18 the connection initiation SYN pack- dian RTT variation with link buffer
and mobile device. SGSN software et, which has an OS-dependent fixed size on a 128Kbps/40ms link with eight
runs on a standard PC platform with timeout. When the mobile device is BBR (green) or CUBIC (red) flows. BBR
ample memory, so there are frequent- receiving bulk data (for example, from keeps the queue near its minimum, in-
ly megabytes of buffer between the automatic app updates) via a large- dependent of both bottleneck buffer
Internet and mobile device. Figure 12 buffered SGSN, the device cannot con- size and number of active flows. CUBIC
compares (emulated) SGSN Internet- nect to anything on the Internet until flows always fill the buffer, so the delay
to-mobile delay for BBR and CUBIC. the queue empties (the SYN ACK ac- grows linearly with buffer size.
F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 65
practice