Jon Todd, Chief Architect
ECS and Docker @Okta
August 2, 2016 @JonToddDotCom
Background
Millions of people use Okta every dayMillions of people use Okta every day
Thousands of enterprises use Okta to
connect to Adobe’s Creative Cloud
jim@designer.com
Thousands of Enterprise Customers
Ed, Gov,
Non-Profit
Services Media ConsumerTechnology Manufacturing,
Energy
FinanceCloudHealth
Okta Application Network
Mobility ManagementSingle Sign On Adaptive MFA Provisioning
Universal Directory
Extensible Profiles, Attribute Transformations,
Directory Integration and AD Password Management
Secure SSO for All Your
Web Apps, On-prem
and Cloud, with Flexible
Policy, from Any Device
Contextual Access
Policies,
Modern Factors,
Adaptive Authentication,
Integrations for Apps
and VPNs
Lifecycle Management,
Cloud & On-prem App
Integration, Mastering
from Apps, Directory
Provisioning, Rules,
Workflow, Reporting
Tight User Identity
Integration, Device
Based Contextual
Access,
Light-weight
Management
Okta IT & Platform products
The most reliable IDaaS available
Never taken offline for upgrades
Redundant and scalable
A B C A B C
DC2 DC1
okta.com/trust
A Platform Architecture For Scale
DATA TIER
A B C LOAD
BALANCERS
APP
SERVERS
Global Datacenters
Our stack
stackshare.io/okta/okta
The Problem
Defining a pattern for micro-services
https://www.pinterest.com/pin/205828645447534387/
http://www.bennysbaker.com/poop-emoji-cupcakes/
DevOps abstraction layer
Inspired by: http://dev2ops.org/2010/02/what-is-devops/
Dev OpsWall of turmoil
Dev Ops
I want stabilityI want change
Domain boundary
Repeatability through immutability
• Same runtime environment
dev / test / prod
• Runtime versioned w/ code
• Easy reproducibility
• All changes use same
release process
Additional requirements
• 0-downtime deployments
• Support for our multi-az & multi-region architecture
• Compliance – SOC2 type 2, HIPAA, ISO 27001
• Separation of duties – a.k.a. no developer access to production hosts
• Push button deployment
• Rollback and canary support
Technology Selection
Building blocks
Dev Ops
I want stabilityI want changeContainer frameworks
Cluster schedulerDev Ops
Continuous integration
Options
Container frameworks Cluster schedulers
Amazon EC2
Container Service
LXC
Our problems solved
• Repeatability
• Declarative & composable
Dockerfile
• Images are immutable
• Stability
• Massive community with
production adoption
• Initial release > 3 years ago
• Compliant
• ECS isn’t in flow, EC2 is already
compliant
• DevOps Abstraction
• Hosts and underlying resources
abstracted away
• Task Definition allows developers to
schedule deploys
• Stability
• 0-downtime services
• Fully managed!
• Works with existing AWS tooling
Docker EC2 Container Service
ECS Refresher
Source: All Things Distributed – a.k.a Werner Vogels
Additional concepts
• Task Definitions define one or more containers to run.
• Services define a long running task and run inside a cluster
• Clusters define a set of EC2 resources that can be shared by more
than one service
• Auto scaling groups can be used to define size and launch
configuration of a cluster
CI with ECS Tasks
CI Workflow
Artifactory
(Maven, NPM, Docker, YUM)
Topic builds – topic repo
Promoted builds– release repo
CI Workflow
Why ECS – Isolation & Versioning
1. Lambda: Task which scales cluster based on queue
2. Lambda: Inspect running tasks an bin pack new tasks where possible
• This is one of the changes we had to make in order to use ECS for long running tasks,
rather than long running services spread across many stateless instances
• Disconnects unneeded nodes from cluster allowing themselves to self terminate when they
are idle
Why ECS - Dynamic worker scaling
VS
Dynamic Scaling
Cost Savings With Spot Instances
Feature Requests
• Ability to have spot and on-demand in same Auto Scaling Group (ASG)
• Built-in bin packing scheduler
• Give ASG a termination policy based on ECS status
• i.e. prefer instances with no running tasks
Termination policy
• OldestInstance. Auto Scaling terminates the oldest instance in the group. This
option is useful when you're upgrading the instances in the Auto Scaling group to a
new EC2 instance type, so you can gradually replace instances of the old type with
instances of the new type.
• NewestInstance. Auto Scaling terminates the newest instance in the group. This
policy is useful when you're testing a new launch configuration but don't want to
keep it in production.
• OldestLaunchConfiguration. Auto Scaling terminates instances that have the
oldest launch configuration. This policy is useful when you're updating a group and
phasing out the instances from a previous configuration.
• ClosestToNextInstanceHour. Auto Scaling terminates instances that are closest to
the next billing hour. This policy helps you maximize the use of your instances and
manage costs.
• Default. Auto Scaling uses its default termination policy. This policy is useful when
you have more than one scaling policy associated with the group.
Takeaways
• ECS is running well for us in a 150+ instance cluster
• Bake AMI with large files and common images into host machines
• Spot instances give 2 min warning. Keeps jobs short
Micro-services with
ECS Services
Due diligence
0-Downtime Testing
https://github.com/jontodd/aries
Test Assumptions
• ECS config
• Agent version 1.11.0
• Docker version 1.11.2
• Cluster config
• 8 instances backed by ASG
• ASG config
• 8 instances across 3 AZs
• Default termination policy
• 5 min health check grace period
• ELB
• Timeout 4s
• Interval 5s
• Unhealthy threshold 2
• Healthy threshold 10
• Enable connection draining 300s timeout
• Load generation
• 16 threads
• Throughput
• Interactive  490 r/s
• 10s long poll  1.5 r/s
Operation Interactive Errors
(~70ms latency, 490rps)
Long Poll Errors
(~10s latency, 1.5rps)
Upsize ECS service 4  8 0 0
Downsize ECS service 8  4 0 0
Deploy ECS service – 50% min healthy 0 0
Stop task* 0 0
Downsize Auto Scaling Group (ASG) 0 0
Terminate EC2 instance 0 0
Stop Docker daemon (service docker stop)* 0 0
Stop EC2 instance** 0 0
Kill Docker Container (docker kill <containerId>)* 2 2
Fail health check 450 5
* No intention of running operation in practice ** Caused inconsistent state
Our architecture
Workflow
Auto Scaling Group
Launch Config
EC2
ECS Cluster
ECS
Service
ECS
Canary
Service
Application YAML
Docker Registry
(Artifactory)
ELB
Images pulled
when tasks start
Conductor
(Bastion ECS Controller)
CI Pipeline
Git Repo
Promoted artifactsDockerfile
docker_compose.yml
Test / Preview / ProductionDev
Deploy new version
Application definition
• Developers define YAML for
their application
• Deploy time configuration is
supplied to the ECS task
definition
• Secrets are pulled by the
application at startup
Security conventions
• Container repository
• Only allow containers from internal repository
• IAM separation per service
• Either service per cluster or use new IAM for ECS functionality
• Security scanning of containers - JFrog Xray
• Process monitoring on docker host – cAdvisor from google
• Secrets or any form of config NEVER baked in containers
• Start from minimal, audited base OS
• Run container as non-privileged user w/ user namespaces Docker
1.10+
• Monitor alas.aws.amazon.com for critical updates
Source Conventions
• 3 categories of container definitions
1. ā€œLibraryā€ definitions used as the basis for building other images
2. Third-party service definitions e.g. Zookeeper or Elasticsearch
3. Internal service definitions
• Repo per internal service
• Dockerfile in same repo => image versioned with code
• Docker compose for running dependent services
• Pegged versions (no builds)
• Single repo for library and third-party service definitions
Build Conventions
• Integration tests run against code running in container
• Build owns creating immutable version and publishing to artifact server
• Strict rules around ā€œFROMā€ clause
• Must point at internal artifact server
• Must be tagged following SEMVER-SHORT_SHA convention
• Never allow missing or use of ā€œlatestā€ tag for repeatable builds
Logging and monitoring
• Logging
• All output streams pipe to STDOUT/STDERR of the running process
• Log forwarding is provided by underlying host
• Log entries contain
• Host
• Container Id
• Image name & version
• Request Id
• Metrics
• Host level, generic container metrics provided by host
• App level metrics published directly to well defined endpoints
Feature requests
• ELB
• Dynamic port mapping to containers
• Fail health based on HTTP return code
• Different health endpoint for adding vs removing
• Service level security groups
• Service discovery w/o ELB
• Ability to mark container instances as un-schedulable
• Remove sharp edges around the stopped state
• Give ASG ability to set EC2 ā€shutdown behaviorā€
• Periodic cleanup process in ECS to deregister stopped instances
Takeaways
• /etc/ecs/ecs.config
• ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION for forensics (default 1hr)
• ECS_LOGLEVEL=debug
• Beware of running services in same cluster that use the same ports
• Tune ELB health check
• Docker 1.10 for security enhancements
• Canary & Blue/Green separate service attached to same ELB
• Rollback is trivial
• ECS is incredibly easy to get up and running
• The ecosystem is changing quickly, we are moving cautiously
• ECS team has made a lot of improvements
Dev OpsWall of turmoil
Automated pipeline of awesomeness
Questions?
Thank You
Follow me @JonToddDotCom
Join us @Okta - www.okta.com/company/careers/

ECS and Docker at Okta

  • 1.
    Jon Todd, ChiefArchitect ECS and Docker @Okta August 2, 2016 @JonToddDotCom
  • 2.
  • 3.
    Millions of peopleuse Okta every dayMillions of people use Okta every day
  • 4.
    Thousands of enterprisesuse Okta to connect to Adobe’s Creative Cloud jim@designer.com
  • 5.
    Thousands of EnterpriseCustomers Ed, Gov, Non-Profit Services Media ConsumerTechnology Manufacturing, Energy FinanceCloudHealth
  • 6.
    Okta Application Network MobilityManagementSingle Sign On Adaptive MFA Provisioning Universal Directory Extensible Profiles, Attribute Transformations, Directory Integration and AD Password Management Secure SSO for All Your Web Apps, On-prem and Cloud, with Flexible Policy, from Any Device Contextual Access Policies, Modern Factors, Adaptive Authentication, Integrations for Apps and VPNs Lifecycle Management, Cloud & On-prem App Integration, Mastering from Apps, Directory Provisioning, Rules, Workflow, Reporting Tight User Identity Integration, Device Based Contextual Access, Light-weight Management Okta IT & Platform products
  • 7.
    The most reliableIDaaS available Never taken offline for upgrades Redundant and scalable A B C A B C DC2 DC1 okta.com/trust A Platform Architecture For Scale DATA TIER A B C LOAD BALANCERS APP SERVERS
  • 8.
  • 9.
  • 10.
  • 11.
    Defining a patternfor micro-services https://www.pinterest.com/pin/205828645447534387/ http://www.bennysbaker.com/poop-emoji-cupcakes/
  • 12.
    DevOps abstraction layer Inspiredby: http://dev2ops.org/2010/02/what-is-devops/ Dev OpsWall of turmoil Dev Ops I want stabilityI want change Domain boundary
  • 13.
    Repeatability through immutability •Same runtime environment dev / test / prod • Runtime versioned w/ code • Easy reproducibility • All changes use same release process
  • 14.
    Additional requirements • 0-downtimedeployments • Support for our multi-az & multi-region architecture • Compliance – SOC2 type 2, HIPAA, ISO 27001 • Separation of duties – a.k.a. no developer access to production hosts • Push button deployment • Rollback and canary support
  • 15.
  • 16.
    Building blocks Dev Ops Iwant stabilityI want changeContainer frameworks Cluster schedulerDev Ops Continuous integration
  • 17.
    Options Container frameworks Clusterschedulers Amazon EC2 Container Service LXC
  • 18.
    Our problems solved •Repeatability • Declarative & composable Dockerfile • Images are immutable • Stability • Massive community with production adoption • Initial release > 3 years ago • Compliant • ECS isn’t in flow, EC2 is already compliant • DevOps Abstraction • Hosts and underlying resources abstracted away • Task Definition allows developers to schedule deploys • Stability • 0-downtime services • Fully managed! • Works with existing AWS tooling Docker EC2 Container Service
  • 19.
  • 20.
    Source: All ThingsDistributed – a.k.a Werner Vogels
  • 21.
    Additional concepts • TaskDefinitions define one or more containers to run. • Services define a long running task and run inside a cluster • Clusters define a set of EC2 resources that can be shared by more than one service • Auto scaling groups can be used to define size and launch configuration of a cluster
  • 22.
  • 23.
    CI Workflow Artifactory (Maven, NPM,Docker, YUM) Topic builds – topic repo Promoted builds– release repo
  • 24.
  • 25.
    Why ECS –Isolation & Versioning
  • 26.
    1. Lambda: Taskwhich scales cluster based on queue 2. Lambda: Inspect running tasks an bin pack new tasks where possible • This is one of the changes we had to make in order to use ECS for long running tasks, rather than long running services spread across many stateless instances • Disconnects unneeded nodes from cluster allowing themselves to self terminate when they are idle Why ECS - Dynamic worker scaling VS
  • 27.
  • 28.
    Cost Savings WithSpot Instances
  • 29.
    Feature Requests • Abilityto have spot and on-demand in same Auto Scaling Group (ASG) • Built-in bin packing scheduler • Give ASG a termination policy based on ECS status • i.e. prefer instances with no running tasks
  • 30.
    Termination policy • OldestInstance.Auto Scaling terminates the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type, so you can gradually replace instances of the old type with instances of the new type. • NewestInstance. Auto Scaling terminates the newest instance in the group. This policy is useful when you're testing a new launch configuration but don't want to keep it in production. • OldestLaunchConfiguration. Auto Scaling terminates instances that have the oldest launch configuration. This policy is useful when you're updating a group and phasing out the instances from a previous configuration. • ClosestToNextInstanceHour. Auto Scaling terminates instances that are closest to the next billing hour. This policy helps you maximize the use of your instances and manage costs. • Default. Auto Scaling uses its default termination policy. This policy is useful when you have more than one scaling policy associated with the group.
  • 31.
    Takeaways • ECS isrunning well for us in a 150+ instance cluster • Bake AMI with large files and common images into host machines • Spot instances give 2 min warning. Keeps jobs short
  • 32.
  • 33.
  • 34.
  • 36.
    Test Assumptions • ECSconfig • Agent version 1.11.0 • Docker version 1.11.2 • Cluster config • 8 instances backed by ASG • ASG config • 8 instances across 3 AZs • Default termination policy • 5 min health check grace period • ELB • Timeout 4s • Interval 5s • Unhealthy threshold 2 • Healthy threshold 10 • Enable connection draining 300s timeout • Load generation • 16 threads • Throughput • Interactive  490 r/s • 10s long poll  1.5 r/s
  • 37.
    Operation Interactive Errors (~70mslatency, 490rps) Long Poll Errors (~10s latency, 1.5rps) Upsize ECS service 4  8 0 0 Downsize ECS service 8  4 0 0 Deploy ECS service – 50% min healthy 0 0 Stop task* 0 0 Downsize Auto Scaling Group (ASG) 0 0 Terminate EC2 instance 0 0 Stop Docker daemon (service docker stop)* 0 0 Stop EC2 instance** 0 0 Kill Docker Container (docker kill <containerId>)* 2 2 Fail health check 450 5 * No intention of running operation in practice ** Caused inconsistent state
  • 38.
  • 39.
    Workflow Auto Scaling Group LaunchConfig EC2 ECS Cluster ECS Service ECS Canary Service Application YAML Docker Registry (Artifactory) ELB Images pulled when tasks start Conductor (Bastion ECS Controller) CI Pipeline Git Repo Promoted artifactsDockerfile docker_compose.yml Test / Preview / ProductionDev Deploy new version
  • 40.
    Application definition • Developersdefine YAML for their application • Deploy time configuration is supplied to the ECS task definition • Secrets are pulled by the application at startup
  • 41.
    Security conventions • Containerrepository • Only allow containers from internal repository • IAM separation per service • Either service per cluster or use new IAM for ECS functionality • Security scanning of containers - JFrog Xray • Process monitoring on docker host – cAdvisor from google • Secrets or any form of config NEVER baked in containers • Start from minimal, audited base OS • Run container as non-privileged user w/ user namespaces Docker 1.10+ • Monitor alas.aws.amazon.com for critical updates
  • 42.
    Source Conventions • 3categories of container definitions 1. ā€œLibraryā€ definitions used as the basis for building other images 2. Third-party service definitions e.g. Zookeeper or Elasticsearch 3. Internal service definitions • Repo per internal service • Dockerfile in same repo => image versioned with code • Docker compose for running dependent services • Pegged versions (no builds) • Single repo for library and third-party service definitions
  • 43.
    Build Conventions • Integrationtests run against code running in container • Build owns creating immutable version and publishing to artifact server • Strict rules around ā€œFROMā€ clause • Must point at internal artifact server • Must be tagged following SEMVER-SHORT_SHA convention • Never allow missing or use of ā€œlatestā€ tag for repeatable builds
  • 44.
    Logging and monitoring •Logging • All output streams pipe to STDOUT/STDERR of the running process • Log forwarding is provided by underlying host • Log entries contain • Host • Container Id • Image name & version • Request Id • Metrics • Host level, generic container metrics provided by host • App level metrics published directly to well defined endpoints
  • 45.
    Feature requests • ELB •Dynamic port mapping to containers • Fail health based on HTTP return code • Different health endpoint for adding vs removing • Service level security groups • Service discovery w/o ELB • Ability to mark container instances as un-schedulable • Remove sharp edges around the stopped state • Give ASG ability to set EC2 ā€shutdown behaviorā€ • Periodic cleanup process in ECS to deregister stopped instances
  • 46.
    Takeaways • /etc/ecs/ecs.config • ECS_ENGINE_TASK_CLEANUP_WAIT_DURATIONfor forensics (default 1hr) • ECS_LOGLEVEL=debug • Beware of running services in same cluster that use the same ports • Tune ELB health check • Docker 1.10 for security enhancements • Canary & Blue/Green separate service attached to same ELB • Rollback is trivial • ECS is incredibly easy to get up and running • The ecosystem is changing quickly, we are moving cautiously • ECS team has made a lot of improvements
  • 47.
    Dev OpsWall ofturmoil Automated pipeline of awesomeness
  • 48.
  • 49.
    Thank You Follow me@JonToddDotCom Join us @Okta - www.okta.com/company/careers/

Editor's Notes

  • #3Ā How many have heard of Okta? Used it?
  • #7Ā This is the full set of Okta IT and Platform products, 100-cloud based and integrated. Each of these are full featured products you could use to replace CA, RSA, or Airwatch.
  • #10Ā Java backend JS Front end Entirely hosted in the cloud in AWS In general we like using and giving back to open source
  • #14Ā Same environment dev / test / prod Environment should be versioned with code Problems with chef mutating production with bad or incorrect version of config Easy reproducibility Security audit can be done on artifact and then just monitor runtime for correct version
  • #19Ā All together we get a PATTERN FOR MICROSERVICES
  • #21Ā - We run on ecs optimized - Reduced packages - Upgrade is easier
  • #24Ā Developers can run all CI test on any topic branch Master locked down, Bacon is the gate keeper Jenkins used for job definition and lifecycle Slave pool is ECS! ECS run as short lived tasks Each day we get between 100 & 150 containers at peak load
  • #25Ā This is bacon
  • #26Ā From before: main goal repeatability and immutability Not only is the artifact and it’s runtime immutable but the container which builds the artifact for testing is containerized Solves classic problem: changes to environment in CI
  • #27Ā Who has the knowledge about sizing?
  • #29Ā We presently respond to Spot price termination notices( you get 2 minutes warning) by placing tasks running on a node to be terminated back into the queue to immediately get picked up by another node. Currently working on recognition of spot price instance pool cascades, so we can switch to on demand. No ability to have both spot and on demand in same ASG Something to worry about. If the prices spike, and cause large outages, what is the availability of on demand instances?
  • #32Ā We auto scale daily to around 150 instances and back down to under 20 daily. Preload Maven, NPM, & git repositories. This saved us about 4 minutes on container start time
  • #40Ā The integration point between CI and Deployment is artifactory. Any sign off or approval happens there Autoscaling groups control pool of EC2 instances Launch config sets environment variables for ECS config like cluster and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION ECS cluster per service due to IAM issues, looking forward to using new feature 1 or more services registered to a service ELB supports canary Conductor is bastion service. Allows non-operators to perform deployments
  • #41Ā Software is grouped into applications which may have multiple components YAML defines all components. In this case we have an application with a single backend running in ECS
  • #45Ā 2016-07-26T18:56:08Z [INFO] Redundant container state change for task op1-sage:15 arn:aws:ecs:us-east-1:011750033084:task/8f9920cf-a289-44bb-ac43-e436d6fb84d7, Status: (RUNNING->RUNNING) Containers: [op1-sage-app (RUNNING->RUNNING),]: op1-sage-app(docker.aue1d.saasure.com/okta-sage:1_1_0_029796_ec67fd3) (RUNNING->RUNNING) to RUNNING, but already RUNNING