0% found this document useful (0 votes)
35 views9 pages

Network Congestion Solutions

congestion-based congestion control

Uploaded by

Haoming Fu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views9 pages

Network Congestion Solutions

congestion-based congestion control

Uploaded by

Haoming Fu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

practice

DOI:10.1145/ 3009824
cause of these problems. When bottle-
Article development led by
queue.acm.org
neck buffers are large, loss-based con-
gestion control keeps them full, causing
bufferbloat. When bottleneck buffers
Measuring bottleneck bandwidth are small, loss-based congestion con-
and round-trip propagation time. trol misinterprets loss as a signal of
congestion, leading to low throughput.
BY NEAL CARDWELL, YUCHUNG CHENG, C. STEPHEN GUNN, Fixing these problems requires an alter-
native to loss-based congestion control.
SOHEIL HASSAS YEGANEH, AND VAN JACOBSON
Finding this alternative requires an un-

BBR:
derstanding of where and how network
congestion originates.

Congestion and Bottlenecks

Congestion-Based
At any time, a (full-duplex) TCP connec-
tion has exactly one slowest link or bottle-
neck in each direction. The bottleneck is
important because:

Congestion Control ˲˲ It determines the connection’s


maximum data-delivery rate. This is
a general property of incompressible
flow (for example, picture a six-lane
freeway at rush hour where an acci-
dent has reduced one short section to
a single lane. The traffic upstream of
the accident moves no faster than the
traffic through that lane).
˲˲ It is where persistent queues form.
BY ALL ACCOUNTS, today’s Internet is not moving data Queues shrink only when a link’s de-
as well as it should. Most of the world’s cellular users parture rate exceeds its arrival rate. For
a connection running at maximum de-
experience delays of seconds to minutes; public Wi-Fi in livery rate, all links upstream of the bot-
airports and conference venues is often worse. Physics tleneck have a faster departure rate so
their queues migrate to the bottleneck.
and climate researchers need to exchange petabytes of Regardless of how many links a con-
data with global collaborators but find their carefully nection traverses or what their individual
engineered multi-Gbps infrastructure often delivers at speeds are, from TCP’s viewpoint an ar-
bitrarily complex path behaves as a sin-
only a few Mbps over intercontinental distances.6 gle link with the same RTT (round-trip
These problems result from a design choice made time) and bottleneck rate. Two physical
constraints, RTprop (round-trip propaga-
when TCP congestion control was created in the tion time) and BtlBw (bottleneck band-
1980s—interpreting packet loss as “congestion.”13 width), bound transport performance.
This equivalence was true at the time but was because (If the network path were a physical pipe,
RTprop would be its length and BtlBw its
of technology limitations, not first principles. As minimum diameter.)
NICs (network interface controllers) evolved from Figure 1 shows RTT and delivery
rate variation with the amount of data
Mbps to Gbps and memory chips from KB to GB, the in flight (data sent but not yet acknowl-
relationship between packet loss and congestion edged). Blue lines show the RTprop
became more tenuous. constraint, green lines the BtlBw con-
straint, and red lines the bottleneck
Today TCP’s loss-based congestion control—even buffer. Operation in the shaded re-
with the current best of breed, CUBIC11—is the primary gions is not possible since it would vio-

58 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 7 | VO L . 60 | NO. 2


late at least one constraint. Transitions at the bottleneck, which results in the delay and frequent packet loss. When
between constraints result in three linear dependence of RTT on inflight memory was expensive buffer sizes
different regions (app-limited, band- data shown in the upper graph. Pack- were only slightly larger than the BDP,
width-limited, and buffer-limited) with ets are dropped when the excess ex- which minimized loss-based conges-
qualitatively different behavior. ceeds the buffer capacity. Congestion is tion control’s excess delay. Subsequent
When there isn’t enough data in just sustained operation to the right of memory price decreases resulted in
flight to fill the pipe, RTprop deter- the BDP line, and congestion control is buffers orders of magnitude larger
IMAGE F RO M SH UTT ERSTOCK.CO M

mines behavior; otherwise, BtlBw some scheme to bound how far to the than ISP link BDPs, and the resulting
dominates. Constraint lines intersect right a connection operates on average. bufferbloat yielded RTTs of seconds in-
at inflight = BtlBw × RTprop, a.k.a. the Loss-based congestion control oper- stead of milliseconds.9
pipe’s BDP (bandwidth-delay product). ates at the right edge of the bandwidth- The left edge of the bandwidth-lim-
Since the pipe is full past this point, limited region, delivering full bottle- ited region is a better operating point
the inflight–BDP excess creates a queue neck bandwidth at the cost of high than the right. In 1979, Leonard Klein-

F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 59
practice

rock16 showed this operating point was tics, RTprop and BtlBw. That these can protocol that reacts to actual conges-
optimal, maximizing delivered band- be inferred from traces suggests that tion, not packet loss or transient queue
width while minimizing delay and loss, Jaffe’s result might not be as limiting delay, and converges with high proba-
both for individual connections and as it once appeared. His result rests on bility to Kleinrock’s optimal operating
for the network as a whole8. Unfortu- fundamental measurement ambigui- point. Thus began our three-year quest
nately, around the same time Jeffrey M. ties (For example, whether a measured to create a congestion control based
Jaffe14 proved it was impossible to cre- RTT increase is caused by a path- on measuring the two parameters that
ate a distributed algorithm that con- length change, bottleneck bandwidth characterize a path: bottleneck band-
verged to this operating point. This re- decrease, or queuing delay increase width and round-trip propagation
sult changed the direction of research from another connection’s traffic). time, or BBR.
from finding a distributed algorithm Although it is impossible to disam-
that achieved Kleinrock’s optimal op- biguate any single measurement, a Characterizing the Bottleneck
erating point to investigating different connection’s behavior over time tells a A connection runs with the highest
approaches to congestion control. clearer story, suggesting the possibili- throughput and lowest delay when
Our group at Google spends hours ty of measurement strategies designed (rate balance) the bottleneck packet
each day examining TCP packet head- to resolve ambiguity. arrival rate equals BtlBw and (full pipe)
er captures from all over the world, Combining these measurements the total data in flight is equal to the
making sense of behavior anomalies with a robust servo loop using recent BDP (= BtlBw × RTprop).
and pathologies. Our usual first step is control systems advances12 could re- The first condition guarantees that
finding the essential path characteris- sult in a distributed congestion-control the bottleneck can run at 100% utili-
zation. The second guarantees there
Figure 1. Delivery rate and round-trip time vs. inflight. is enough data to prevent bottleneck
starvation but not overfill the pipe.
BDP+ The rate balance condition alone does
BDP BtlneckBufSize not ensure there is no queue, only that
buffer it cannot change size (for example,
app limited bandwidth limited limited
if a connection starts by sending its
10-packet Initial Window into a five-
Round-trip Time

packet BDP, then runs at exactly the


w
tlB bottleneck rate, five of the 10 initial
1/B
e= packets fill the pipe so the excess forms
s lop
a standing queue at the bottleneck that
RTprop cannot dissipate). Similarly, the full
pipe condition does not guarantee
there is no queue (for example, a con-
nection sending a BDP in BDP/2 bursts
BtlBw
gets full bottleneck utilization, but
with an average queue of BDP/4). The
Delivery Rate

op only way to minimize the queue at the


r
Tp bottleneck and all along the path is to
/R optimum loss-based
1 operating congestion meet both conditions simultaneously.
=
e point
op control
BtlBw and RTprop vary over the life of
sl is here operates here
a connection, so they must be continu-
ously estimated. TCP currently tracks
Amount Inflight
RTT (the time interval from sending a
data packet until it is acknowledged)
since it is required for loss detection.
Figure 2. Ack-arrival half of the BBR algorithm. At any time t,

function onAck(packet) RTTt = RTpropt + ηt


rtt = now - packet.sendtime
update_min_filter(RTpropFilter, rtt)
delivered += packet.size where η ≥ 0 represents the “noise” in-
delivered_time = now troduced by queues along the path,
deliveryRate = (delivered - packet.delivered) /
(delivered_time - packet.delivered_time)
the receiver’s delayed ack strategy, ack
if (deliveryRate > BtlBwFilter.currentMax aggregation, and so on. RTprop is a
|| ! packet.app_limited) physical property of the connection’s
update_max_filter(BtlBwFilter, deliveryRate) path and changes only when the path
if (app_limited_until > 0)
app_limited_until = app_limited_until - packet.size
changes. Since path changes happen
on time scales » RTprop, an unbiased,
efficient estimator at time T is

60 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 7 | VO L . 60 | NO. 2


practice

RT
prop = RTprop + min(ηt) = observe BtlBw. A multi-hour bulk data
min (RTTt) ∀t ∈ [T – WR, T] transfer might spend its entire lifetime
in the bandwidth-limited region and
(That is, a running min over time win- have only a single sample of RTprop
dow WR (which is typically tens of sec-
onds to minutes). Our group at Google from the first packet’s RTT. This intrin-
sic uncertainty means that in addition
Unlike RTT, nothing in the TCP spec
requires implementations to track
spends hours each to estimators to recover the two path pa-
rameters, there must be states that track
bottleneck bandwidth, but a good es- day examining both what can be learned at the current
timate results from tracking delivery
rate. When the ack for some packet ar-
TCP packet header operating point and, as information be-
comes stale, how to get to an operating
rives back at the sender, it conveys that captures from all point where it can be relearned.
packet’s RTT and announces the deliv-
ery of data inflight when that packet de-
over the world, Matching the Packet Flow
parted. Average delivery rate between making sense of to the Delivery Path
send and ack is the ratio of data deliv-
ered to time elapsed: deliveryRate behavior anomolies The core BBR algorithm has two parts:
When an ack is received. Each ack
= Δdelivered/Δt. This rate must be ≤ the and pathologies. provides new RTT and delivery rate
bottleneck rate (the arrival amount is
known exactly so all the uncertainty is Our usual first measurements that update the RTprop
and BtlBw estimates, as illustrated in
in the Δt, which must be ≥ the true ar-
rival interval; thus, the ratio must be ≤
step is finding the Figure 2.
The if checks address the uncer-
the true delivery rate, which is, in turn, essential path tainty issue described in the last para-
upper-bounded by the bottleneck ca-
pacity). Therefore, a windowed-max of
characteristics, graph: senders can be application lim-
ited, meaning the application runs out
delivery rate is an efficient, unbiased RTprop and BtlBw. of data to fill the network. This is quite
estimator of BtlBw: common because of request/response
traffic. When there is a send opportu-
Btl  ax(deliveryRatet)
Bw = m nity but no data to send, BBR marks the
∀t ∈ [T – WB,T] corresponding bandwidth sample(s)
as application limited (see send()
where the time window WB is typically pseudocode to follow). The code here
six to 10 RTTs. decides which samples to include in
TCP must record the departure time the bandwidth model so it reflects net-
of each packet to compute RTT. BBR work, not application, limits. BtlBw is a
augments that record with the total hard upper bound on the delivery rate
data delivered so each ack arrival yields so a measured delivery rate larger than
both an RTT and a delivery rate mea- the current BtlBw estimate must mean
surement that the filters convert to RT- the estimate is too low, whether or not
prop and BtlBw estimates. the sample was app-limited. Other-
Note that these values are completely wise, application-limited samples are
independent: RTprop can change (for discarded. (Figure 1 shows that in the
example, on a route change) but still app-limited region deliveryRate
have the same bottleneck, or BtlBw underestimates, BtlBw. These checks
can change (for example, when a wire- prevent filling the BtlBw filter with un-
less link changes rate) without the path derestimates that would cause data to
changing. (This independence is why be sent too slowly.)
both constraints have to be known to When data is sent. To match the
match sending behavior to delivery packet-arrival rate to the bottleneck
path.) Since RTprop is visible only to the link’s departure rate, BBR paces ev-
left of BDP and BtlBw only to the right in ery data packet. BBR must match the
Figure 1, they obey an uncertainty prin- bottleneck rate, which means pacing is
ciple: whenever one can be measured, integral to the design and fundamental
the other cannot. Intuitively, this is be- to operation—pacing_rate is BBR’s pri-
cause the pipe has to be overfilled to mary control parameter. A secondary
find its capacity, which creates a queue parameter, cwnd_gain, bounds inflight
that obscures the length of the pipe. to a small multiple of the BDP to han-
For example, an application running a dle common network and receiver pa-
request/response protocol might never thologies (as we will discuss). Concep-
send enough data to fill the pipe and tually, the TCP send routine looks like

F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 61
practice

Figure 3. Packet-send half of the BBR algorithm. for each part of the cycle is shown time-
aligned with the data it influenced. The
function send(packet) gain is applied an RTT earlier, when
bdp = BtlBwFilter.currentMax the data is sent. This is indicated by the
* RTpropFilter.currentMin horizontal jog in the event sequence
if (inflight >= cwnd_gain * bdp)
// wait for ack or retransmission timeout
description running up the left side.
return BBR minimizes delay by spending
if (now >= nextSendTime) most of its time with one BDP in flight,
packet = nextPacketToSend()
paced at the BtlBw estimate. This
if (! packet)
app_limited_until = inflight moves the bottleneck to the sender so it
return cannot observe BtlBw increases. Con-
packet.app_limited = (app_limited_until > 0) sequently, BBR periodically spends an
packet.sendtime = now
packet.delivered = delivered
RTprop interval at a pacing_gain > 1,
packet.delivered_time = delivered_time which increases the sending rate and
ship(packet) inflight. If BtlBw hasn’t changed, then
nextSendTime = now + packet.size /
a queue is created at the bottleneck,
(pacing_gain * BtlBwFilter.currentMax)
increasing RTT, which keeps deliv-
timerCallbackAt(send, nextSendTime) eryRate constant. (This queue is re-
moved by sending at a compensating
pacing_gain < 1 for the next RTprop.) If
BtlBw has increased, deliveryRate
Figure 4. RTT (blue), inflight (green), and delivery rate (red) detail. increases and the new max immedi-
ately increases the BtlBw filter output,
increasing the base pacing rate. Thus,
52.5 BBR converges to the new bottleneck
pipe full so RTT
RTT (ms)

50.0 rate exponentially fast. Figure 5 shows


increases with inflight
47.5 (queue created) the effect on a 10Mbps, 40ms flow of
45.0
BtlBw abruptly doubling to 20Mbps af-
42.5
ter 20 seconds of steady operation (top
60 graph) then dropping to 10Mbps after
another 20 seconds of steady operation
inflight (kB)

55
at 20Mbps (bottom graph).
50 (BBR is a simple instance of a Max-
45
plus control system, a new approach to
gain > 1 so
inflight increases control based on nonstandard algebra.12
cycle gain
This approach allows the adaptation rate
1.00 1.00 1.00 1.00 1.00 1.25 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.25 0.75 1.00 1.00 [controlled by the max gain] to be inde-
pendent of the queue growth [controlled
max BtlBw × cycle gain by the average gain]. Applied to this prob-
used as sending rate ack from send upates filter one RTT later
lem, it results in a simple, implicit con-
9.50 trol loop where the adaptation to physi-
BW (Mbps)

9.25 cal constraint changes is automatically


9.00 handled by the filters representing those
8.75 ack arrival adds sample constraints. A conventional control sys-
to BtlBw max filter tem would require multiple loops con-
3.8 4.0 4.2 4.4 nected by a complex state machine to
Time (sec.) accomplish the same result.)

Single BBR Flow Startup Behavior


the code in Figure 3. (In Linux, sending tion to estimating the bottleneck con- Existing implementations handle
uses the efficient FQ/pacing queuing straints. This creates the novel control events such as startup, shutdown, and
discipline,4 which gives BBR line-rate loop shown in Figure 4, which illus- loss recovery with event-specific algo-
single-connection performance on trates the RTT (blue), inflight (green) rithms and many lines of code. BBR
multigigabit links and handles thou- and delivery rate (red) detail from uses the code detailed earlier for every-
sands of lower-rate paced connections 700ms of a 10Mbps, 40ms flow. The thing, handling events by sequencing
with negligible CPU overhead.) thick gray line above the delivery-rate through a set of “states” that are defined
Steady-state behavior. The rate and data is the state of the BtlBw max filter. by a table containing one or more fixed
amount BBR sends is solely a function The triangular structures result from gains and exit criteria. Most of the time
of the estimated BtlBw and RTprop, so BBR cycling pacing_gain to determine is spent in the ProbeBW state described
the filters control adaptation in addi- if BtlBw has increased. The gain used in the section on Steady-state Behavior.

62 COM MUNICATIO NS O F TH E AC M | F EBR UA RY 201 7 | VO L . 60 | NO. 2


practice

The Startup and Drain states are used at though unfairness can persist when ProbeRTT state: when the RTProp
connection start (Figure 6). To handle late starters overestimate RTprop as a estimate has not been updated (that
Internet link bandwidths spanning 12 result of starting when other flows have is, by measuring a lower RTT) for
orders of magnitude, Startup imple- (temporarily) created a queue. many seconds, BBR enters ProbeRTT,
ments a binary search for BtlBw by us- To learn the true RTProp, a flow which reduces the inflight to four
ing a gain of 2/ln2 to double the sending moves to the left of BDP using packets for at least one round trip,
rate while delivery rate is increasing.
This discovers BtlBw in log2BDP RTTs Figure 5. Bandwidth change.
but creates up to 2BDP excess queue in
the process. Once Startup finds BtlBw, 120
BBR transitions to Drain, which uses inflight
RTT
the inverse of Startup’s gain to get rid

Inflight (Kb) RTT (Ms)


100 e
of the excess queue, then to ProbeBW a t 5x
s t im 1 . 9 c l e s
once the inflight drops to a BDP. e es cy
B W r e a s in 3
Figure 6 shows the first second of a 80 n c
i .2 5
3)

10Mbps, 40ms BBR flow. The time/se- 1


BtlBw doubled (=
quence plot shows the sender (green) to 20Mbps estimate
and receiver (blue) progress vs. time. 60 doubled and
pipe full
The red line shows a CUBIC sender
under identical conditions. Vertical 40
gray lines mark BBR state transitions. 19 20 21 22
The lower figure shows the RTT of the Time (sec.)
two connections vs. time. Note that the
200 inflight increases,
time reference for this data is ack arrival
pushing up RTT, until
(blue) so, while they appear to be time clamped by cwnd_gain
Inflight (Kb) RTT (Ms)

shifted, events are shown at the point


150 20Mbps BtlBw
where BBR learns of them and acts. times out of filter
The lower graph of Figure 6 con-
trasts BBR and CUBIC. Their initial be- inflight reduction lowers
RTT which lowers inflight…
havior is similar, but BBR completely 100 until optimum regained
BtlBw halved;
drains its startup queue while CUBIC inflight doesn’t
can’t. Without a path model to tell it fit in pipe,
increasing RTT
how much of the inflight is excess, CU- 50
BIC makes inflight growth less aggres-
40 41 42 43 44
sive, but growth continues until either Time (Sec.)
the bottleneck buffer fills and drops a
packet or the receiver’s inflight limit
(TCP’s receive window) is reached.
Figure 7 shows RTT behavior during Figure 6. First second of a 10Mbps, 40ms BBR flow
the first eight seconds of the connec-
tions shown in Figure 6. CUBIC (red) startup drain probe BW
fills the available buffer, then cycles 1.00
Data Sent Or Acked (Mb)

from 70% to 100% full every few sec-


onds. After startup, BBR (green) runs 0.75
with essentially no queue.
0.50
Behavior of Multiple BBR
Flows Sharing a Bottleneck 0.25

Figure 8 shows how individual through-


0.00
puts for several BBR flows sharing a
0 0.25 0.50 0.75 1.00
100Mbps/10ms bottleneck converge Time (sec.)
to a fair share. The downward facing
triangular structures are connection 120 cwnd_gain clamps
BBR operating
ProbeRTT states whose self-synchro- BBR inflight at 3 BDP
RTT (Ms)

100 at full BW with


CUBIC switches from
nization accelerates final convergence. 80
exponential to linear no queue
ProbeBW gain cycling (Figure 4) 60
inflight growth RTprop
40
causes bigger flows to yield bandwidth
0 0.25 0.50 0.75 1.00
to smaller flows, resulting in each Time (Sec.)
learning its fair share. This happens
fairly quickly (a few ProbeBW cycles),

F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 63
practice

then returns to the previous state. estimates expire at the same time, so ordination is the key to both fairness
Large flows entering ProbeRTT drain they enter ProbeRTT together, which and stability.
many packets from the queue, so sev- makes the total queue dip larger and BBR synchronizes flows around the
eral flows see a new RTprop (new min- causes more flows to see a new RT- desirable event of an empty bottleneck
imum RTT). This makes their RTprop prop, and so on. This distributed co- queue. By contrast, loss-based conges-
tion control synchronizes around the
Figure 7. First eight seconds of 10Mbps, 40ms cubic and BBR flows. undesirable events of periodic queue
growth and overflow, amplifying delay
500 and packet loss.
packet loss and
recovery episodes
400 Google B4 WAN
Deployment Experience
RTT (ms)

bottleneck’s 250ms Google’s B4 network is a high-speed


300
buffer limit
WAN (wide-area network) built using
200 commodity switches.15 Losses on these
shallow-buffered switches result most-
100
ly from coincident arrivals of small
traffic bursts. In 2015, Google started
RTprop
switching B4 production traffic from
0 2 4 6 8
CUBIC to BBR. No issues or regres-
Time (sec.)
sions were experienced, and since 2016
all B4 TCP traffic uses BBR. Figure 9
shows one reason for switching: BBR’s
Figure 8. Throughputs of five BBR flows sharing a bottleneck throughput is consistently 2 to 25 times
greater than CUBIC’s. We had expected
even more improvement but discov-
80 ered that 75% of BBR connections were
Throughput (Mbps)

limited by the kernel’s TCP receive


60 buffer, which the network operations
team had deliberately set low (8MB)
40 to prevent CUBIC flooding the net-
fair
share work with megabytes of excess inflight
20 (8MB/200ms intercontinental RTT ⇒
335Mbps max throughput). Manually
0 raising the receive buffer on one U.S.-
0 10 20 30 40 50 Europe path caused BBR immediately
Time (sec.)
to reach 2Gbps, while CUBIC remained
at 15Mbps—the 133x relative improve-
ment predicted by Mathis et al.17
Figure 9. BBR vs. CUBIC relative throughput improvement. Figure 9 shows BBR vs. CUBIC rela-
tive throughput improvement; the in-
20
○ 1.00 set shows throughput CDFs (cumula-
18 tive distribution functions). Measures
Cumulative Probability

○ are from an active prober service that


BBR Throughput / CUBIC Throughput

0.75
16 ○ opens persistent BBR and CUBIC con-
14

0.50
nections to remote datacenters, then
○ CUBIC transfers 8MB of data every minute.
BBR
12 ○
○ 0.25
Probers communicate via many B4

10

○○
paths within and between North Amer-

○○ ica, Europe, and Asia.
0
8 ○○ The huge improvement is a direct
○○ 1 5 20 100 500 5000
○○ Throughput (Mbps) – Log Scale
○○○
○○○○○○○ consequence of BBR not using loss
6 ○○○
○○○ as a congestion indicator. To achieve
○○○
4 ○○○
○○○○ ○○○ ○ ○ full bandwidth, existing loss-based
○○○○○○○ ○○○○○○ ○○ ○ ○
○○○○○○○ ○○○○○○ ○○○○ ○ congestion controls require the
2x Improvement ○○○○○○○ ○
2
loss rate to be less than the inverse
0 square of the BDP17 (for example, <
5 10 20 50 100 200 500 1000 2000 5000
one loss per 30 million packets for a
CUBIC Throughput (Mbps) – Log Scale
10Gbps/100ms path). Figure 10 com-
pares measured goodput at various

64 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 7 | VO L . 60 | NO. 2


practice

loss rates. CUBIC’s loss tolerance is a Figure 10. BBR vs. CUBIC goodput under loss.
structural property of the algorithm,
while BBR’s is a configuration param- 100
eter. As BBR’s loss rate approaches the
ProbeBW peak gain, the probability of

Throughput (Mbps)
75
measuring a delivery rate of the true
BtlBw drops sharply, causing the max
50
filter to underestimate.
Figure 10 shows BBR vs. CUBIC
goodput for 60-second flows on a 25
100Mbps/100ms link with 0.001 to
50% random loss. CUBIC’s throughput 0
decreases by 10 times at 0.1% loss and 0.001 0.01 0.1 1 2 5 10 20 30 50
totally stalls above 1%. The maximum Loss Rate (%) – Log Scale
possible throughput is the link rate
times fraction delivered (= 1 – lossRate).
BBR meets this limit up to a 5% loss Figure 11. BBR vs. CUBIC median RTT improvement.
and is close up to 15%.

YouTube Edge 5
Deployment Experience
CUBIC RTT / BBR RTT

BBR is being deployed on Google.com 4


and YouTube video servers. Google
is running small-scale experiments
3
in which a small percentage of users
are randomly assigned either BBR or
CUBIC. Playbacks using BBR show 2

significant improvement in all of


YouTube’s quality-of-experience met- 1
rics, possibly because BBR’s behavior 0 1 2 3 4 5 6 7 8 9 10
is more consistent and predictable. CUBIC RTT (sec.)
BBR only slightly improves connec-
tion throughput because YouTube
already adapts the server’s streaming Figure 12. Steady-state median RTT variation with link buffer size
rate to well below BtlBw to minimize
bufferbloat and rebuffer events. Even
so, BBR reduces median RTT by 53% 600
on average globally and by more than
80% in the developing world. Figure
Latency (Sec.)

400
11 shows BBR vs. CUBIC median RTT
improvement from more than 200
million YouTube playback connec- 200 new connections fail in Linux / Android
tions measured on five continents new connections fail in Windows / Mac OS / iOS
over a week.
0
More than half of the world’s seven
150 750 1500 3000 6000 9750
billion mobile Internet subscriptions Buffer (KB)
connect via 8kbps to 114kbps 2.5G sys-
tems,5 which suffer well-documented
problems because of loss-based con-
gestion control’s buffer-filling pro- The horizontal lines mark one of the cept packet is delayed for longer than
pensities.3 The bottleneck link for more serious consequences: TCP the fixed SYN timeout).
these systems is usually between the adapts to long RTT delay except on Figure 12 shows steady-state me-
SGSN (serving GPRS support node)18 the connection initiation SYN pack- dian RTT variation with link buffer
and mobile device. SGSN software et, which has an OS-dependent fixed size on a 128Kbps/40ms link with eight
runs on a standard PC platform with timeout. When the mobile device is BBR (green) or CUBIC (red) flows. BBR
ample memory, so there are frequent- receiving bulk data (for example, from keeps the queue near its minimum, in-
ly megabytes of buffer between the automatic app updates) via a large- dependent of both bottleneck buffer
Internet and mobile device. Figure 12 buffered SGSN, the device cannot con- size and number of active flows. CUBIC
compares (emulated) SGSN Internet- nect to anything on the Internet until flows always fill the buffer, so the delay
to-mobile delay for BBR and CUBIC. the queue empties (the SYN ACK ac- grows linearly with buffer size.

F E B R UA RY 2 0 1 7 | VO L. 6 0 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 65
practice

Mobile Cellular vergence to a fair share. Unmanaged


Adaptive Bandwidth router buffers exceeding several Related articles
Cellular systems adapt per-subscrib- BDPs, however, cause long-lived loss- on queue.acm.org
er bandwidth based partly on a de- based competitors to bloat the queue Sender-side Buffers and the Case
mand estimate that uses the queue of and grab more than their fair share. for Multimedia Adaptation
packets destined for the subscriber. Mitigating this is another area of ac- Aiman Erbad and Charles “Buck” Krasic
http://queue.acm.org/detail.cfm?id=2381998
Early versions of BBR were tuned to tive research.
create very small queues, resulting You Don’t Know Jack
in connections getting stuck at low Conclusion about Network Performance
Kevin Fall and Steve McCanne
rates. Raising the peak ProbeBW Rethinking congestion control pays http://queue.acm.org/detail.cfm?id=1066069
pacing_gain to create bigger queues big dividends. Rather than using
A Guided Tour through
resulted in fewer stuck connections, events such as loss or buffer occupan- Data-center Networking
indicating it is possible to be too nice cy, which are only weakly correlated Dennis Abts and Bob Felderman
to some networks. With the current with congestion, BBR starts from http://quue.acm.org/detail.cfm?id=2208919
1.25 × BtlBw peak gain, no degrada- Kleinrock’s formal model of conges-
tion is apparent compared with CU- tion and its associated optimal oper- References
1. Abrahamsson, M. TCP ACK suppression. IETF AQM
BIC on any network. ating point. A pesky “impossibility” mailing list, 2015; https://www.ietf.org/mail-archive/
web/aqm/current/msg01480.html.
Delayed and stretched aks. Cel- result that the crucial parameters 2. Brakmo, L.S., Peterson, L.L. TCP Vegas: End-to-end
lular, Wi-Fi, and cable broadband of delay and bandwidth cannot be congestion avoidance on a global Internet. IEEE J.
Selected Areas in Communications 13, 8 (1995), 1465–1480.
networks often delay and aggregate determined simultaneously is side- 3. Chakravorty, R., Cartwright, J., Pratt, I. Practical
ACKs.1 When inflight is limited to stepped by observing they can be esti- experience with TCP over GPRS. IEEE GLOBECOM, 2002.
4. Corbet, J. TSO sizing and the FQ scheduler. LWN.net,
one BDP, this results in throughput- mated sequentially. Recent advances 2013; https://lwn.net/Articles/564978/.
reducing stalls. Raising ProbeBW’s in control and estimation theory are 5. Ericsson. Ericsson Mobility Report (June), 2015;
https://www.ericsson.com/res/docs/2015/ericsson-
cwnd_gain to two allowed BBR to then used to create a simple distrib- mobility-report-june-2015.pdf.
continue sending smoothly at the es- uted control loop that verges on the 6. ESnet. Application tuning to optimize international
astronomy workflow from NERSC to LFI-DPC at
timated delivery rate, even when ACKs optimum, fully utilizing the network INAF-OATs; http://fasterdata.es.net/data-transfer-
are delayed by up to one RTT. This while maintaining a small queue. tools/case-studies/nersc-astronomy/.
7. Flach, T. et al. An Internet-wide analysis of traffic
largely avoids stalls. Google’s BBR implementation is policing. SIGCOMM, 2016, 468–482.
Token-bucket policers. BBR’s ini- available in the open source Linux 8. Gail, R., Kleinrock, L. An invariant property of
computer network power. In Proceedings of the
tial YouTube deployment revealed kernel TCP. International Conference on Communications 63, 1
that most of the world’s ISPs mangle BBR is deployed on Google’s B4 (1981),1-63.1.5.
9. Gettys, J., Nichols, K. Bufferbloat: Dark buffers in the
traffic with token-bucket policers.7 backbone, improving throughput by Internet. acmqueue 9, 11 (2011); http://queue.acm.
The bucket is typically full at connec- org/detail.cfm?id=2071893.
orders of magnitude compared with 10. Ha, S., Rhee, I. 2011. Taming the elephants: new TCP
tion startup so BBR learns the un- CUBIC. It is also being deployed on slow start. Computer Networks 55, 9 (2011), 2092–2110.
11. Ha, S., Rhee, I., Xu, L. CUBIC: a new TCP-friendly
derlying network’s BtlBw, but once Google and YouTube Web servers, sub- high-speed TCP variant. ACM SIGOPS Operating
the bucket empties, all packets sent stantially reducing latency on all five Systems Review 42, 5 (2008), 64–74.
12. Heidergott, B., Olsder, G. J., Van Der Woude, J. Max
faster than the (much lower than continents tested to date, most dra- Plus at Work: Modeling and Analysis of Synchronized
BtlBw) bucket fill rate are dropped. matically in developing regions. BBR Systems: a Course on Max-Plus Algebra and its
Applications. Princeton University Press, 2014.
BBR eventually learns this new deliv- runs purely on the sender and does 13. Jacobson, V. Congestion avoidance and control. ACM
ery rate, but the ProbeBW gain cycle not require changes to the protocol, SIGCOMM Computer Communication Review 18, 4
(1988): 314–329.
results in continuous moderate loss- receiver, or network, making it incre- 14. Jaffe, J. Flow control power is nondecentralizable.
es. To minimize the upstream band- mentally deployable. It depends only IEEE Transactions on Communications 29, 9 (1981),
1301–1306.
width waste and application latency on RTT and packet-delivery acknowl- 15. Jain, S. et al. B4: experience with a globally-deployed
increase from these losses, we added edgment, so can be implemented for software defined WAN. ACM SIGCOMM Computer
Communication Review 43, 4 (2013), 3–14.
policer detection and an explicit po- most Internet transport protocols. 16. Kleinrock, L. 1979. Power and deterministic rules
licer model to BBR. We are also ac- of thumb for probabilistic problems in computer
communications. In Proceedings of the International
tively researching better ways to miti- Acknowledgments Conference on Communications (1979), 43.1.1-43.1.10.
gate the policer damage. 17. Mathis, M., Semke, J., Mahdavi, J., Ott, T. The
Thanks to Len Kleinrock for pointing macroscopic behavior of the TCP congestion
Competition with loss-based con- out the right way to do congestion con- avoidance algorithm. ACM SIGCOMM Computer
Communication Review 27, 3 (1997), 67–82.
gestion control. BBR converges to- trol and Larry Brakmo for pioneering 18. Wikipedia. GPRS core network serving GPRS support
ward a fair share of the bottleneck work on Vegas2 and New Vegas con- node; https://en.wikipedia.org/wiki/GPRS_core_
network#Serving_GPRS_support_node_.28SGSN.29.
bandwidth whether competing with gestion control that presaged many
other BBR flows or with loss-based elements of BBR, and for advice and
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn,
congestion control. Even as loss- guidance during BBR’s early develop- Soheil Hassas Yeganeh, and Van Jacobson are members
ment. We also thank Eric Dumazet, of Google’s make-tcp-fast project, whose goal is to evolve
based congestion control fills the Internet transport via fundamental research and open
available buffer, ProbeBW still ro- Nandita Dukkipati, Jana Iyengar, Ian source software. Project contributions include TFO (TCP
Fast Open), TLP (Tail Loss Probe), RACK loss recovery,
bustly moves the BtlBw estimate Swett, M. Fitz Nowlan, David Wether- fq/pacing, and a large fraction of the git commits to the
toward the flow’s fair share, and all, Leonidas Kontothanassis, Amin Linux kernel TCP code for the past five years. They can be
contacted at https://googlegroups.com/d/forum/bbr-dev.
ProbeRTT finds an RTProp estimate Vahdat, and the Google BwE and You-
just high enough for tit-for-tat con- Tube infrastructure teams. Copyright held by owners/authors.

66 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 7 | VO L . 60 | NO. 2

You might also like