AGILITY
requires
SAFETY
Every startup has the
same story:
“We don’t have time for
best practices.”
You can’t go faster by being
reckless
Think of cars on a highway
What happens if everyone jams
down on the gas?
To go fast, a car needs not only a
powerful engine…
But also powerful brakes.
As well as seat belts, airbags,
bumpers, and auto-pilot
For cars and for software, speed
is limited by safety
What are the seat belts, brakes, &
self-driving cars of software?
This talk is about
safety mechanisms
That make it possible to
build software quickly
I’m
Yevgeniy
Brikman
ybrikman.com
Founder
of
Atomic
Squirrel
atomic-squirrel.net
PAST LIVES
Author of
Hello,
Startup
hello-startup.net
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Good brakes stop your car before
you run into something
Continuous integration stops buggy
code before it goes into production
Imagine your goal is to build the
International Space Station
Each team designs and builds
their component in isolation
You launch everything into space
and hope it all comes together
I thought the Russians were going
to build the bathrooms?
Weren’t the French supposed to
do the wiring?
Everyone is using the metric
system, right?
Teams working for a long time
with incorrect assumptions
Finding this out when you’re in
outer space is too late
This is the result of
“late integration”
Lots of teams working in isolation
on separate branches
Before attempting a massive
merge at the very end
MERGE
CONFLICT
The alternative is
“continuous integration”
Where everyone regularly merges
their work
The most common approach is
trunk-based development
Everyone works on a
single branch (trunk)
That can’t possibly scale to a lot
of developers, can it?
Uses trunk-based development for
1,000+ developers
Uses trunk-based development for
4,000+ developers
Uses trunk-based development for
20,000+ developers
Wouldn’t you have merge conflicts
all the time?
If you merge (commit) regularly,
conflicts are rare.
And those that happen are from a
day of work—not months.
Commit early and often.
Small commits are easier to
merge, test, revert, review
Wouldn’t there constantly be
broken code in trunk?
Build Build Build Build
Not if you run a self-testing build
after every commit
Build Build Build Build Build Build Build
Build Build Build Build
It should compile your code and
run your automated tests
Build Build Build Build Build Build Build
Build Build Build Build
If a build fails, a developer must
fix it ASAP or revert the commit
Build Build Build Build Build Build Build
Of course, this depends on
having good automated tests
Tests give you the confidence to
make changes quickly
JUnit version 4.11
...
Time: 6.063
OK (259 tests)
How long would it take you to do
259 tests manually?
What should you test?
Everything!
Everything!
It’s a trade-off between:
1. Likelihood of bugs
2. Cost of bugs
3. Cost of testing
Likelihood of bugs is higher for
complex code and large teams
Cost of bugs is higher for some
systems (payments, security)
Cost of tests is higher for
integration and UI tests
“Without continuous
integration, your software is
broken until somebody
proves it works, usually
during a testing or
integration stage.
With continuous integration,
your software is proven to
work (assuming a sufficiently
comprehensive set of
automated tests) with every
new change—and you know
the moment it breaks and can
fix it immediately.”
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Ships have bulkheads to try to
contain flooding to one area.
You can split up a codebase to
contain problems to one area.
Code is the enemy: the more you
have, the slower you go
Project Size
Lines of code
Bug Density
Bugs per thousand lines
of code
< 2K 0 – 25
2K – 6K 0 – 40
16K – 64K 0.5 – 50
64K – 512K 2 – 70
> 512K 4 – 100
As the code grows, the number of
bugs grows even faster
“Software
development doesn't
happen in a chart, an
IDE, or a design tool;
it happens in your
head.”
The mind can only handle so
much complexity at once
One solution is to break the code
into multiple codebases
Instead of depending on the
source of another module
/moduleA
/moduleB /moduleC /moduleD
/moduleE
You depend on a versioned
artifact from that module
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar
This provides isolation from
changes in other modules
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar
You already do this: guava-
18.0.jar
jquery-2.2.0.js
Advantages of artifacts:
1. Isolation
2. Decoupling
3. Faster builds
Disadvantages of artifacts:
1. Dependency hell
2. No continuous integration
3. Hard to make global changes
Another option is to break the
codebase into services
In a monolith, you use function
calls within one process
A.a()
B.b() C.c() D.d()
E.e()
With services, you pass messages
between processes
http://A/a
http://B/b
http://C/c
http://D/d
http://E/e
Advantages of services:
1. Technology agnostic
2. Scalability
3. Isolation
Disadvantages of services:
1. Operational overhead
2. Performance overhead
3. I/O, error handling
4. Backwards compatibility
5. Hard to make global changes
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Autopilot prevents accidents
caused by human error
Automated deployments prevent
accidents caused by human error
Deploying code can be painful
“If it hurts, do it
more often.”
– Martin Fowler
The deployment process should
be:
That means you should never
deploy or configure manually
> ssh ec2-user@12.34.56.78
__| __| __|
_| ( __  Amazon ECS-Optimized Amazon Linux AMI 2015.09.d
____|___|____/
[ec2-user ~]$ sudo apt-get install ruby
Don’t do this
Or this
Instead, automate everything
The gold standard is the
blue-green deployment
Let’s say you have version 0.0.1 of
your app deployed
First, deploy version 0.0.2 on a
duplicate set of servers
If everything looks good, switch
the load balancer over to 0.0.2
Four main categories of
deployment automation tools:
1. Configuration management:
Chef, Puppet, Ansible, Salt
- name: Install httpd and php
yum: name={{ item }} state=present
with_items:
- httpd
- php
- name: start httpd
service: name=httpd state=started enabled=yes
- name: Copy the code from repository
git: repo={{ repository }} dest=/var/www/html/
Imperative scripts to configure
servers and deploy code
2. Provisioning tools: Terraform,
CloudFormation, Heat
resource "aws_instance" "example" {
ami = "ami-b960b1d"
instance_type = ["t2.micro"]
}
resource "aws_eip" "ip“ {
instance = "${aws_instance.example.id}"
depends_on = ["aws_instance.example"]
}
Declarative templates that define
your infrastructure
3. Virtual machines: VMWare,
VirtualBox, Packer, Vagrant
{
"builders": [{
"type": "amazon-ebs",
"source_ami": "ami-de0d9eb7",
"instance_type": "m1.medium",
"ami_name": "example-packer-ami-{{timestamp}}"
}],
"provisioners": [{
"type": "shell",
"inline": [
"sudo apt-get -y update",
"sudo apt-get -y install httpd php”
]
}]
}
Images of configured servers
4. Containers: Docker, rkt, LXD
FROM ubuntu:12.04
RUN apt-get update && apt-get install -y apache2 php
ENV APACHE_RUN_USER www-data
ENV APACHE_LOG_DIR /var/log/apache2
EXPOSE 80
CMD ["/usr/sbin/apache2", "-D", "FOREGROUND"]
Lightweight images of configured
servers
These tools allow you to define
your infrastructure as code
That way, you can version it,
review it, test it, and reuse it.
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Elisha Otis
demoing
elevator
free-fall
safety in 1854
The safety
elevator
patent
The safety
catches are
locked by
default
Only an
intact cable
can unlock
the
latches
This
elevator
provides
safety by
default
Feature
toggles
provide
safety by
default
New
feature,
part 1
New
feature,
part 2
New
feature,
part 3
If a large new feature takes many
commits, wouldn’t a user see it in
an unfinished state?
<section id="new-section">
<!-- Code for new section-->
</div>
<section id="original-section">
<!-- Code for original section-->
</section>
Let’s say you were adding a new
section to your website.
<% if toggles.enabled("new-section") %>
<section id="new-section">
<!-- Code for new section-->
</div>
<% end %>
<section id="original-section">
<!-- Code for original section-->
</section>
Wrap new code in a conditional
that looks up a feature toggle
<% if toggles.enabled("new-section") %>
<section id="new-section">
<!-- Code for new section-->
</div>
<% end %>
<section id="original-section">
<!-- Code for original section-->
</section>
Toggles are off by default, so
users won’t see unfinished work
development:
feature_toggles:
new-section: true
production:
feature_toggles:
new-section: false
You can enable feature toggles in
a config file.
> curl http://feature.toggles/
{
"development": { "new-section": true },
"production": { "new-section": false }
}
Or you could create a web service
for feature toggles.
> curl http://feature.toggles/?user=123
{
"development": { "new-section": "A" },
"production": { "new-section": "B" }
}
It could return different, complex
values for each user.
And provide a web UI for
configuring toggles.
This allows you to quickly turn
features on or off.
<% if toggles.get("new-section") == "A" %>
<section id="new-section-bucket-a">
<!-- Code for new section, version A -->
</div>
<% elsif toggles.get("new-section") == "B" %>
<section id="new-section-bucket-b">
<!-- Code for new section, version B -->
</div>
<% end %>
This allows A/B testing
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
A speedometer tells you how fast
you’re driving
Monitoring tells you how your
product is performing
“If you can’t
measure it, you
can’t fix it.”
– David Henke
There are many types of
monitoring
Availability metrics: is my product
up or down?
Useful tools: Keynote, Pingdom,
Uptime Robot, Route53
Business metrics: what are my
users doing in the product?
Useful tools: Google Analytics,
KISSMetrics, Mixpanel
Application metrics: how is my
application performing?
Useful tools: New Relic,
CloudWatch, Datadog
127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 2326
64.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846
127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 2216
64.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291
127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218
192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 3395
64.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253
Log files are also a form of
application-level monitoring
127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 2326
64.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846
127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 2216
64.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291
127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218
192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 3395
64.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253
Useful tools: loggly, logstash,
Papertrail, Sumo Logic
Server metrics: how is my server
performing?
Useful tools: Nagios, Icinga,
Munin, collectd, CloudWatch
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Warning lights notify you if
something is wrong
Alerting systems notify you if
something is wrong
You can’t look at metrics 24/7.
Alerting systems can.
Useful tools: PagerDuty,
VictorOps
For a full list of monitoring and
alerting tools, see:
hello-startup.net/resources
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Seat belts help you survive
crashes
High availability helps you survive
crashes
Stateless servers: multiple
instances, multiple zones
Load balancer routes around
server or zone outages
Auto-recovery mechanism brings
server back after outage
Stateful servers: multiple
instances, multiple zones
Replication to one or more
standby servers
Load balancer switches to
standby server in case of outage
Auto-recovery mechanism brings
server back after outage
Test your recovery process
regularly.
1. Brakes
2. Bulkheads
3. Autopilot
4. Safety catch
5. Speedometer
6. Warning lights
7. Seat belt
Outline
Speed is limited by safety
Two cars can drive at 80mph in
opposite directions safely…
Because of two yellow lines
It’s worth the time to put these
safety mechanisms in place
For more
info, see
Hello,
Startup
hello-startup.net
Questions?
F1 racecar: Takayuki Suzuki
Highway traffic: Oran Viriyincy
Car accident: ER24 EMS (Pty) Ltd.
Road: Nicolas Raymond
BWM: Andy Durst
Self-driving car: Steve Jurvetson
Bus: Roland Tanglao
Tail lights: Tony Webster
USS South Dakota: Wikimedia
Crash test dummy: Wikimedia
Elisha Otis: Wikimedia
Otis Elevator: Wikimedia
Speedometer: Dawn Hopkins
Dashboard lights: Jim Larrison
Seat belt: Wikimedia
Google repo stats: Rachel Potvin
ISS: Wikimedia
Fire: Pete
Martin Fowler: Wikimedia
Image credits

Agility Requires Safety