Cost-Effective DDoS Mitigation
for Stateful Real-Time Cloud Applications
Andrew Garzon, Christy Westmoreland, Finn Bear
UW CSE {4,M5}53 Lab 4
Intro
Background
Distributed Denial of Service (DDoS) loads are a conceptually simple yet practically dangerous
threat to the availability of web applications. In the best case, DDoS arises when excessive
amounts of legitimate users collectively overload a service. In what is both the common and
worst case, DDoS is weaponized by malicious actors to take down services of their choosing.
At first glance, DDoS mitigation seems like a solved problem, with the industry valued at $3.5B 1
and protection services bundled by popular Virtual Private Server (VPS) vendors. Enterprise
DDoS protection can be understood as distributed proxying infrastructure that sanitizes requests
and caches responses on behalf of an origin server, at the expense of added latency. Bundled
DDoS mitigation is a lighter-weight option, advertised as an automatic shield against the most
common forms of DDoS.
Motivation
There is at least one type of application that can’t readily benefit from DDoS mitigation as a
service. Real-time applications must serve dynamic content at low latencies, such that caching
and proxying are counterproductive. When a real-time application is stateful on a per-server
basis, such as a massively-multiplayer online (MMO) game server, horizontal scale is not a
defense as attackers need only overwhelm one server at a time to cause unavailability. Vertical
scaling VPS’s beyond legitimate demand turns out to be cost-prohibitive, as price per nominal
CPU and RAM scales superlinearly and both network interfaces and applications experience
diminishing returns. Furthermore, relying on bundled DDoS protection is an exercise in wishful
thinking; it is incapable2 of magically deciding which packets are malicious or excessive. As a
result, bundled DDoS protection does not absolve the VPS customer developer of responsibility
to harden their stateful real-time application and operating system (OS).
Ethics
We comply with cloud vendor policies by only performing stress testing with meaningful requests
and responses, and at aggregate bandwidths significantly below 1Gbps.
1
https://www.mordorintelligence.com/industry-reports/ddos-protection-market
2
In a customer support ticket, one VPS provider admitted that their bundled, real-time, machine-learning
based DDoS protection system requires 5m and human intervention to mitigate SYN floods.
Design
Discussion
DDoS mitigation, as with all security measures, requires a layered approach. It is always better
to filter traffic as early as possible to reduce resource consumption. To that end, our design
relies on cloud provider and operating system utilities, in addition to library and application layer
constructs.
We start by limiting the scope of our application to TCP/IPv4, specifically HTTP(s) and SSH port
numbers. Conveniently, VPS providers have bundled firewall services that are capable of
rejecting other traffic (UDP, ICMP, IPv6, other TCP ports). Unlike bundled DDoS protection,
these firewalls are capable of working as advertised, on account of their better-defined task. We
focus on HTTP servers, using WebSockets for real-time communication, although our approach
would easily generalize to native TCP socket servers.
The next task is to limit networking packets to a manageable rate. Doing so in user-space would
have unacceptable overhead, so our design relies on the firewall built into our OS. We
distinguish between spoofable (false source IP address) and non-spoofable packets. In TCP,
only the initial SYN packet can be spoofed at scale, as other packets must fall within an existing
window that is unknown to attackers. To avoid using unbounded RAM to store new connections,
operating systems make available a feature known as SYN cookies. We enable it, but impose a
global limit (irrespective of IP address) to avoid CPU exhaustion, along with a global limit on
connection counts. The remainder of our limits are per-IPv4 address, taking advantage of IPv4
address scarcity. We limit new connections per IPv4, total connections per IPv4, and packets
per second per IPv4. For observability, we configure logging for dropped packets, but use
batching and rate limiting to avoid it becoming a bottleneck. Finally, for a very small volume of
packets, we configure the firewall to enqueue them for user-space analysis for early-warning
purposes.
After being admitted by the firewall, packets are handled by the OS TCP stack. Our design
adopts a much more stringent set of configuration options than the defaults, tightening down
memory and CPU usage at the expense of being less generous with retransmissions, timeouts,
and buffer sizes. This configuration is static, reused in all of our experiments, and not the main
focus of our design.
In user-space, our design operates as a library. It gathers input from the firewall queue,
measures system resource utilization statistics, and can also receive certain optional hints from
the application. The library computes a new firewall configuration every 10s, or if conditions
(namely new IPv4 addresses) change too much since the last computation. The firewall is a
function of several parameters, most importantly maximum number of clients per IPv4 address
and packets per second per client, which we decrease as utilization or signs of abuse increase.
By default, these limits apply to new connections but, in extreme circumstances, we terminate
existing connections of IP’s that recently started exceeding the new computed limits.
Strengths and Limitations
Our design costs nothing to implement, leaving efficacy as the only concern when evaluating it
as a cost-effective DDoS mitigation technique. The scope of TCP/IPv4-based stateful real-time
applications is certainly limited. Certain application developers may be willing and able to invest
in enterprise-grade solutions, including those that embrace WebSockets as opposed to cached
HTTP content. These solutions would likely offer better protection against larger DDoS attacks.
Future Questions
IPv6 support could be added by supporting address ranges large enough to cover individual
abusive entities. The library could be optimized against multiple real applications and DDoS
loads (e.g. simulated in a software-defined virtual network) for maximum efficacy. Finally, the
prospect of automatically switching to a higher-latency but more capable enterprise solution
during an attack offers a compromise between price and efficacy.
Implementation Overview
We chose to use Debian 12 Linux for our OS, its built-in nftables firewall service, and its dsniff
package for terminating existing TCP connections. Our DDoS protection library is implemented
in Rust and uses the nfq library for listening to packets enqueued by the firewall. When a new
firewall configuration is compiled, based on a hard-coded template and computed variables, it is
installed by executing the firewall’s command-line interface (CLI). The library maintains various
in-memory data structures to keep track of clients and system utilization. Instead of actually
provisioning a cloud firewall, we limit our stress testing to packets that would be accepted by it.
Evaluation
We designed an application to simulate a stateful, real-time HTTP and WebSocket server (a
Rust program). Clients perform several HTTP GET requests that, in practice, would load HTML,
CSS, and Javascript. These requests are primarily bandwidth-sensitive. Then, each client opens
a relatively-long-lasting WebSocket stream and sends relatively small but frequent messages,
which are primarily latency-sensitive. The server’s task is to fulfill GET requests and to echo
WebSocket messages (after passing them through a shared queue to enforce hypothetical
synchronization with shared state). As real time stateful applications typically involve interaction
between clients, such as players chatting or fighting in an MMO, we simulate an additional O(n2)
of periodic blocking CPU work where n is the number of currently-active WebSockets.
For clients, we prioritize performance isolation and unique IPv4 addresses by running them via
Functions as a Service (FaaS) in the cloud. Each function (consisting of a Go program) may run
a random number of concurrent clients, as is the case with a user opening multiple browser tabs
or multiple users sharing the same IPv4 address. Each client is instrumented to measure GET
request and WebSocket echo latency.
A central command and control script invokes multiple FaaS instances, scheduling each of them
to begin simulating clients at a random normally distributed time, with a peak in the middle of the
experiment. All initial randomness is seeded for reproducibility. The script also periodically
requests resource usage statistics from the server, to the extent that the server is reachable.
When all functions complete, the script compiles the data they collected for analysis.
Our test covers three scenarios: No firewall, static firewall, and dynamic firewall. The static
firewall is equivalent to the default state of the dynamic firewall, which strikes a balance between
generosity and preserving availability. We conducted each test three times and selected the
worst case result (lowest number of happy WebSocket IPv4 addresses), as an attacker only has
to succeed once. We simultaneously evaluate both GET request latency (averaged over the
multiple GET requests each client performs, with a line per IPv4 address) and WebSocket echo
latency (averaged over multiple clients in each 1s interval). We define “served” as not
experiencing any errors (including a 5s timeout on all GET requests and each WebSocket
echo). We define “happy” as “served” while never waiting longer than 200ms for an individual
GET request or WebSocket echo. An IPv4 address is considered served/happy if at least one of
its clients is served/happy, as we are prioritizing distinct users over a few users’ multiple browser
tabs or DDoS attack tools.
We intentionally provision a VPS with minimum specifications necessary to handle normal
volumes of legitimate traffic, namely 1vCPU and 1GB RAM, which happens to be the cheapest
plan offered by the VPS provider we chose. All infrastructure, from our VPS to our FaaS
template, is provisioned by terraform. Reproduction of our results is possible by installing the
tools and following the steps listed in the README included with our code.
Results (GET)
Results (WebSocket)
Discussion (GET)
Across all four of our aggregate metrics, our dynamic firewall significantly outperformed our
static firewall, which in turn outperformed the absence of a firewall. In the case without a firewall,
there is a notable period of downtime (15-22s) during which no progress is made and GET
requests fail. Immediately before and after, the average latency of GET requests is abnormally
high (~500ms), far in excess of our happiness threshold. The static firewall successfully kept the
server responsive to GET requests throughout the experiment, except for a short disruption from
20-24s. However, the average latency of several clients’ GET requests was above our
threshold. Finally, with the dynamic firewall, there is no observable period of service disruption
and the vast majority of average GET request latencies are within our threshold.
Both firewalls successfully prioritize the first client of each IPv4 address, as indicated by the
larger points being below our threshold, in a way that the lack of a firewall doesn’t. The dynamic
firewall is the only case in which at least one client from each IPv4 address is served (i.e.
without any errors). A tradeoff is that IPv4 addresses with many clients tend to experience
higher latencies for their last few clients with the dynamic firewall, which deprioritizes them
regardless of load.
Discussion (WebSocket)
The dynamic firewall maximizes the number of happy clients and IPv4 addresses, followed by
the static firewall. Without a firewall, WebSocket latencies quickly degrade by the 14s mark. The
server’s CPU usage stays pinned at 100% for a sustained period, during which no WebSocket
progress is made. The static firewall postpones similar performance degradation until the 19s
mark at which point the server grinds to a halt so catastrophically that the command and control
script can no longer reach it for resource utilization metrics. Only the dynamic firewall is able to
prevent runaway CPU usage, which translates into significantly lower WebSocket echo latencies
(all averages within our threshold).
Conclusion
Our implementation successfully increased both the quantity and quality of service for clients of
our simulated stateful real-time cloud application. This result offers a promising alternative to
DDoS protection as a service, especially for mitigating abnormally high application-layer loads. It
also shows how static firewalls may fail to adapt to changing network conditions, either
over-constraining traffic most of the time or falling short of adequate protection during an attack.
A logical next step would be to open-source our DDoS protection library and, when possible,
use it to protect real applications.