Agility Requires Safety

Every startup has the
same story:

“We don’t have time for
best practices.”

You can’t go faster by being
reckless

What happens if everyone jams
down on the gas?

To go fast, a car needs not only a
powerful engine…

As well as seat belts, airbags,
bumpers, and auto-pilot

For cars and for software, speed
is limited by safety

What are the seat belts, brakes, &
self-driving cars of software?

This talk is about
safety mechanisms

That make it possible to
build software quickly

I’m
Yevgeniy
Brikman
ybrikman.com

Founder
of
Atomic
Squirrel
atomic-squirrel.net

Author of
Hello,
Startup
hello-startup.net

1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline

Good brakes stop your car before
you run into something

Continuous integration stops buggy
code before it goes into production

Imagine your goal is to build the
International Space Station

Each team designs and builds
their component in isolation

You launch everything into space
and hope it all comes together

I thought the Russians were going
to build the bathrooms?

Weren’t the French supposed to
do the wiring?

Everyone is using the metric
system, right?

Teams working for a long time
with incorrect assumptions

Finding this out when you’re in
outer space is too late

This is the result of
“late integration”

Lots of teams working in isolation
on separate branches

Before attempting a massive
merge at the very end

The alternative is
“continuous integration”

Where everyone regularly merges
their work

The most common approach is
trunk-based development

Everyone works on a
single branch (trunk)

That can’t possibly scale to a lot
of developers, can it?

Uses trunk-based development for
1,000+ developers

4,000+ developers

20,000+ developers

Wouldn’t you have merge conflicts
all the time?

If you merge (commit) regularly,
conflicts are rare.

And those that happen are from a
day of work—not months.

Small commits are easier to
merge, test, revert, review

Wouldn’t there constantly be
broken code in trunk?

Build Build Build Build
Not if you run a self-testing build
after every commit
Build Build Build Build Build Build Build

It should compile your code and
run your automated tests

If a build fails, a developer must
fix it ASAP or revert the commit

Of course, this depends on
having good automated tests

Tests give you the confidence to
make changes quickly

JUnit version 4.11
...
Time: 6.063
OK (259 tests)
How long would it take you to do
259 tests manually?

It’s a trade-off between:
1. Likelihood of bugs
2. Cost of bugs
3. Cost of testing

Likelihood of bugs is higher for
complex code and large teams

Cost of bugs is higher for some
systems (payments, security)

Cost of tests is higher for
integration and UI tests

“Without continuous
integration, your software is
broken until somebody
proves it works, usually
during a testing or
integration stage.

With continuous integration,
your software is proven to
work (assuming a sufficiently
comprehensive set of
automated tests) with every
new change—and you know
the moment it breaks and can
fix it immediately.”

Ships have bulkheads to try to
contain flooding to one area.

You can split up a codebase to
contain problems to one area.

Code is the enemy: the more you
have, the slower you go

Project Size
Lines of code
Bug Density
Bugs per thousand lines
of code
< 2K 0 – 25
2K – 6K 0 – 40
16K – 64K 0.5 – 50
64K – 512K 2 – 70
> 512K 4 – 100

As the code grows, the number of
bugs grows even faster

“Software
development doesn't
happen in a chart, an
IDE, or a design tool;
it happens in your
head.”

The mind can only handle so
much complexity at once

One solution is to break the code
into multiple codebases

Instead of depending on the
source of another module
/moduleA
/moduleB /moduleC /moduleD
/moduleE

You depend on a versioned
artifact from that module
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar

This provides isolation from
changes in other modules
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar

You already do this: guava-
18.0.jar
jquery-2.2.0.js

Advantages of artifacts:
1. Isolation
2. Decoupling
3. Faster builds

Disadvantages of artifacts:
1. Dependency hell
2. No continuous integration
3. Hard to make global changes

Another option is to break the
codebase into services

In a monolith, you use function
calls within one process
A.a()
B.b() C.c() D.d()
E.e()

With services, you pass messages
between processes
http://A/a
http://B/b
http://C/c
http://D/d
http://E/e

Advantages of services:
1. Technology agnostic
2. Scalability
3. Isolation

Disadvantages of services:
1. Operational overhead
2. Performance overhead
3. I/O, error handling
4. Backwards compatibility
5. Hard to make global changes

Autopilot prevents accidents
caused by human error

Automated deployments prevent
accidents caused by human error

“If it hurts, do it
more often.”
– Martin Fowler

The deployment process should
be:

That means you should never
deploy or configure manually

> ssh ec2-user@12.34.56.78
__| __| __|
_| ( __ Amazon ECS-Optimized Amazon Linux AMI 2015.09.d
____|___|____/
[ec2-user ~]$ sudo apt-get install ruby
Don’t do this

The gold standard is the
blue-green deployment

Let’s say you have version 0.0.1 of
your app deployed

First, deploy version 0.0.2 on a
duplicate set of servers

If everything looks good, switch
the load balancer over to 0.0.2

Four main categories of
deployment automation tools:

1. Configuration management:
Chef, Puppet, Ansible, Salt

- name: Install httpd and php
yum: name={{ item }} state=present
with_items:
- httpd
- php
- name: start httpd
service: name=httpd state=started enabled=yes
- name: Copy the code from repository
git: repo={{ repository }} dest=/var/www/html/
Imperative scripts to configure
servers and deploy code

2. Provisioning tools: Terraform,
CloudFormation, Heat

resource "aws_instance" "example" {
ami = "ami-b960b1d"
instance_type = ["t2.micro"]
}
resource "aws_eip" "ip“ {
instance = "${aws_instance.example.id}"
depends_on = ["aws_instance.example"]
}
Declarative templates that define
your infrastructure

3. Virtual machines: VMWare,
VirtualBox, Packer, Vagrant

{
"builders": [{
"type": "amazon-ebs",
"source_ami": "ami-de0d9eb7",
"instance_type": "m1.medium",
"ami_name": "example-packer-ami-{{timestamp}}"
}],
"provisioners": [{
"type": "shell",
"inline": [
"sudo apt-get -y update",
"sudo apt-get -y install httpd php”
]
}]
}
Images of configured servers

4. Containers: Docker, rkt, LXD

FROM ubuntu:12.04
RUN apt-get update && apt-get install -y apache2 php
ENV APACHE_RUN_USER www-data
ENV APACHE_LOG_DIR /var/log/apache2
EXPOSE 80
CMD ["/usr/sbin/apache2", "-D", "FOREGROUND"]
Lightweight images of configured
servers

These tools allow you to define
your infrastructure as code

That way, you can version it,
review it, test it, and reuse it.

Elisha Otis
demoing
elevator
free-fall
safety in 1854

The safety
catches are
locked by
default

Only an
intact cable
can unlock
the
latches

This
elevator
provides
safety by
default

Feature
toggles
provide
safety by
default

New
feature,
part 1
New
feature,
part 2
New
feature,
part 3
If a large new feature takes many
commits, wouldn’t a user see it in
an unfinished state?

<section id="new-section">

</div>
<section id="original-section">

</section>
Let’s say you were adding a new
section to your website.

<% if toggles.enabled("new-section") %>
</div>
<% end %>
</section>
Wrap new code in a conditional
that looks up a feature toggle

<% if toggles.enabled("new-section") %>
</div>
<% end %>
</section>
Toggles are off by default, so
users won’t see unfinished work

development:
feature_toggles:
new-section: true
production:
feature_toggles:
new-section: false
You can enable feature toggles in
a config file.

> curl http://feature.toggles/
{
"development": { "new-section": true },
"production": { "new-section": false }
}
Or you could create a web service
for feature toggles.

> curl http://feature.toggles/?user=123
{
"development": { "new-section": "A" },
"production": { "new-section": "B" }
}
It could return different, complex
values for each user.

And provide a web UI for
configuring toggles.

This allows you to quickly turn
features on or off.

<% if toggles.get("new-section") == "A" %>
<section id="new-section-bucket-a">

</div>
<% elsif toggles.get("new-section") == "B" %>
<section id="new-section-bucket-b">

</div>
<% end %>
This allows A/B testing

A speedometer tells you how fast
you’re driving

Monitoring tells you how your
product is performing

“If you can’t
measure it, you
can’t fix it.”
– David Henke

There are many types of
monitoring

Availability metrics: is my product
up or down?

Useful tools: Keynote, Pingdom,
Uptime Robot, Route53

Business metrics: what are my
users doing in the product?

Useful tools: Google Analytics,
KISSMetrics, Mixpanel

Application metrics: how is my
application performing?

Useful tools: New Relic,
CloudWatch, Datadog

127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 2326
64.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846
127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 2216
64.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291
127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218
192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 3395
64.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253
Log files are also a form of
application-level monitoring

127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 2326
64.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846
127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 2216
64.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291
127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218
192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 3395
64.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253
Useful tools: loggly, logstash,
Papertrail, Sumo Logic

Server metrics: how is my server
performing?

Useful tools: Nagios, Icinga,
Munin, collectd, CloudWatch

Warning lights notify you if
something is wrong

Alerting systems notify you if
something is wrong

You can’t look at metrics 24/7.
Alerting systems can.

Useful tools: PagerDuty,
VictorOps

For a full list of monitoring and
alerting tools, see:
hello-startup.net/resources

Seat belts help you survive
crashes

High availability helps you survive
crashes

Stateless servers: multiple
instances, multiple zones

Load balancer routes around
server or zone outages

Auto-recovery mechanism brings
server back after outage

Stateful servers: multiple
instances, multiple zones

Replication to one or more
standby servers

Load balancer switches to
standby server in case of outage

Test your recovery process
regularly.

Two cars can drive at 80mph in
opposite directions safely…

It’s worth the time to put these
safety mechanisms in place

For more
info, see
Hello,
Startup
hello-startup.net

F1 racecar: Takayuki Suzuki
Highway traffic: Oran Viriyincy
Car accident: ER24 EMS (Pty) Ltd.
Road: Nicolas Raymond
BWM: Andy Durst
Self-driving car: Steve Jurvetson
Bus: Roland Tanglao
Tail lights: Tony Webster
USS South Dakota: Wikimedia
Crash test dummy: Wikimedia
Elisha Otis: Wikimedia
Otis Elevator: Wikimedia
Speedometer: Dawn Hopkins
Dashboard lights: Jim Larrison
Seat belt: Wikimedia
Google repo stats: Rachel Potvin
ISS: Wikimedia
Fire: Pete
Martin Fowler: Wikimedia
Image credits

Agility Requires Safety

More Related Content

What's hot

Viewers also liked

Similar to Agility Requires Safety

More from Yevgeniy Brikman

Recently uploaded

Agility Requires Safety