ECS and Docker at Okta

Jon Todd, Chief Architect
ECS and Docker @Okta
August 2, 2016 @JonToddDotCom

Millions of people use Okta every dayMillions of people use Okta every day

Thousands of enterprises use Okta to
connect to Adobe’s Creative Cloud
jim@designer.com

Thousands of Enterprise Customers
Ed, Gov,
Non-Profit
Services Media ConsumerTechnology Manufacturing,
Energy
FinanceCloudHealth

Okta Application Network
Mobility ManagementSingle Sign On Adaptive MFA Provisioning
Universal Directory
Extensible Profiles, Attribute Transformations,
Directory Integration and AD Password Management
Secure SSO for All Your
Web Apps, On-prem
and Cloud, with Flexible
Policy, from Any Device
Contextual Access
Policies,
Modern Factors,
Adaptive Authentication,
Integrations for Apps
and VPNs
Lifecycle Management,
Cloud & On-prem App
Integration, Mastering
from Apps, Directory
Provisioning, Rules,
Workflow, Reporting
Tight User Identity
Integration, Device
Based Contextual
Access,
Light-weight
Management
Okta IT & Platform products

The most reliable IDaaS available
Never taken offline for upgrades
Redundant and scalable
A B C A B C
DC2 DC1
okta.com/trust
A Platform Architecture For Scale
DATA TIER
A B C LOAD
BALANCERS
APP
SERVERS

Our stack
stackshare.io/okta/okta

Defining a pattern for micro-services
https://www.pinterest.com/pin/205828645447534387/
http://www.bennysbaker.com/poop-emoji-cupcakes/

DevOps abstraction layer
Inspired by: http://dev2ops.org/2010/02/what-is-devops/
Dev OpsWall of turmoil
Dev Ops
I want stabilityI want change
Domain boundary

Repeatability through immutability
• Same runtime environment
dev / test / prod
• Runtime versioned w/ code
• Easy reproducibility
• All changes use same
release process

Additional requirements
• 0-downtime deployments
• Support for our multi-az & multi-region architecture
• Compliance – SOC2 type 2, HIPAA, ISO 27001
• Separation of duties – a.k.a. no developer access to production hosts
• Push button deployment
• Rollback and canary support

Building blocks
Dev Ops
I want stabilityI want changeContainer frameworks
Cluster schedulerDev Ops
Continuous integration

Options
Container frameworks Cluster schedulers
Amazon EC2
Container Service
LXC

Our problems solved
• Repeatability
• Declarative & composable
Dockerfile
• Images are immutable
• Stability
• Massive community with
production adoption
• Initial release > 3 years ago
• Compliant
• ECS isn’t in flow, EC2 is already
compliant
• DevOps Abstraction
• Hosts and underlying resources
abstracted away
• Task Definition allows developers to
schedule deploys
• Stability
• 0-downtime services
• Fully managed!
• Works with existing AWS tooling
Docker EC2 Container Service

Source: All Things Distributed – a.k.a Werner Vogels

Additional concepts
• Task Definitions define one or more containers to run.
• Services define a long running task and run inside a cluster
• Clusters define a set of EC2 resources that can be shared by more
than one service
• Auto scaling groups can be used to define size and launch
configuration of a cluster

CI Workflow
Artifactory
(Maven, NPM, Docker, YUM)
Topic builds – topic repo
Promoted builds– release repo

Why ECS – Isolation & Versioning

1. Lambda: Task which scales cluster based on queue
2. Lambda: Inspect running tasks an bin pack new tasks where possible
• This is one of the changes we had to make in order to use ECS for long running tasks,
rather than long running services spread across many stateless instances
• Disconnects unneeded nodes from cluster allowing themselves to self terminate when they
are idle
Why ECS - Dynamic worker scaling
VS

Cost Savings With Spot Instances

Feature Requests
• Ability to have spot and on-demand in same Auto Scaling Group (ASG)
• Built-in bin packing scheduler
• Give ASG a termination policy based on ECS status
• i.e. prefer instances with no running tasks

Termination policy
• OldestInstance. Auto Scaling terminates the oldest instance in the group. This
option is useful when you're upgrading the instances in the Auto Scaling group to a
new EC2 instance type, so you can gradually replace instances of the old type with
instances of the new type.
• NewestInstance. Auto Scaling terminates the newest instance in the group. This
policy is useful when you're testing a new launch configuration but don't want to
keep it in production.
• OldestLaunchConfiguration. Auto Scaling terminates instances that have the
oldest launch configuration. This policy is useful when you're updating a group and
phasing out the instances from a previous configuration.
• ClosestToNextInstanceHour. Auto Scaling terminates instances that are closest to
the next billing hour. This policy helps you maximize the use of your instances and
manage costs.
• Default. Auto Scaling uses its default termination policy. This policy is useful when
you have more than one scaling policy associated with the group.

Takeaways
• ECS is running well for us in a 150+ instance cluster
• Bake AMI with large files and common images into host machines
• Spot instances give 2 min warning. Keeps jobs short

Micro-services with
ECS Services

0-Downtime Testing
https://github.com/jontodd/aries

Test Assumptions
• ECS config
• Agent version 1.11.0
• Docker version 1.11.2
• Cluster config
• 8 instances backed by ASG
• ASG config
• 8 instances across 3 AZs
• Default termination policy
• 5 min health check grace period
• ELB
• Timeout 4s
• Interval 5s
• Unhealthy threshold 2
• Healthy threshold 10
• Enable connection draining 300s timeout
• Load generation
• 16 threads
• Throughput
• Interactive  490 r/s
• 10s long poll  1.5 r/s

Operation Interactive Errors
(~70ms latency, 490rps)
Long Poll Errors
(~10s latency, 1.5rps)
Upsize ECS service 4  8 0 0
Downsize ECS service 8  4 0 0
Deploy ECS service – 50% min healthy 0 0
Stop task* 0 0
Downsize Auto Scaling Group (ASG) 0 0
Terminate EC2 instance 0 0
Stop Docker daemon (service docker stop)* 0 0
Stop EC2 instance** 0 0
Kill Docker Container (docker kill <containerId>)* 2 2
Fail health check 450 5
* No intention of running operation in practice ** Caused inconsistent state

Workflow
Auto Scaling Group
Launch Config
EC2
ECS Cluster
ECS
Service
ECS
Canary
Service
Application YAML
Docker Registry
(Artifactory)
ELB
Images pulled
when tasks start
Conductor
(Bastion ECS Controller)
CI Pipeline
Git Repo
Promoted artifactsDockerfile
docker_compose.yml
Test / Preview / ProductionDev
Deploy new version

Application definition
• Developers define YAML for
their application
• Deploy time configuration is
supplied to the ECS task
definition
• Secrets are pulled by the
application at startup

Security conventions
• Container repository
• Only allow containers from internal repository
• IAM separation per service
• Either service per cluster or use new IAM for ECS functionality
• Security scanning of containers - JFrog Xray
• Process monitoring on docker host – cAdvisor from google
• Secrets or any form of config NEVER baked in containers
• Start from minimal, audited base OS
• Run container as non-privileged user w/ user namespaces Docker
1.10+
• Monitor alas.aws.amazon.com for critical updates

Source Conventions
• 3 categories of container definitions
1. “Library” definitions used as the basis for building other images
2. Third-party service definitions e.g. Zookeeper or Elasticsearch
3. Internal service definitions
• Repo per internal service
• Dockerfile in same repo => image versioned with code
• Docker compose for running dependent services
• Pegged versions (no builds)
• Single repo for library and third-party service definitions

Build Conventions
• Integration tests run against code running in container
• Build owns creating immutable version and publishing to artifact server
• Strict rules around “FROM” clause
• Must point at internal artifact server
• Must be tagged following SEMVER-SHORT_SHA convention
• Never allow missing or use of “latest” tag for repeatable builds

Logging and monitoring
• Logging
• All output streams pipe to STDOUT/STDERR of the running process
• Log forwarding is provided by underlying host
• Log entries contain
• Host
• Container Id
• Image name & version
• Request Id
• Metrics
• Host level, generic container metrics provided by host
• App level metrics published directly to well defined endpoints

Feature requests
• ELB
• Dynamic port mapping to containers
• Fail health based on HTTP return code
• Different health endpoint for adding vs removing
• Service level security groups
• Service discovery w/o ELB
• Ability to mark container instances as un-schedulable
• Remove sharp edges around the stopped state
• Give ASG ability to set EC2 ”shutdown behavior”
• Periodic cleanup process in ECS to deregister stopped instances

Takeaways
• /etc/ecs/ecs.config
• ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION for forensics (default 1hr)
• ECS_LOGLEVEL=debug
• Beware of running services in same cluster that use the same ports
• Tune ELB health check
• Docker 1.10 for security enhancements
• Canary & Blue/Green separate service attached to same ELB
• Rollback is trivial
• ECS is incredibly easy to get up and running
• The ecosystem is changing quickly, we are moving cautiously
• ECS team has made a lot of improvements

Dev OpsWall of turmoil
Automated pipeline of awesomeness

Thank You
Follow me @JonToddDotCom
Join us @Okta - www.okta.com/company/careers/

ECS and Docker at Okta

In this document