This is a repository for deploying Hatchet via sst.dev + Pulumi in AWS.
This is aimed at someone who is looking to integrate Hatchet into their stack and needs self-hosting, but is not an expert in AWS (or you are familiar with AWS/ECS but not EKS).
The Hatchet managed cloud offers a free tier, however if you do the kind of embarassingly parallel simulation work I do, the limitations on simultaneous Worker counts will prevent it from being relevant to you. You would likely need to go with a custom plan for the kind of workload pattern I have - e.g. running experiments with a few million simulations over a few thousands workers, but only once or twice a month, if that. I recommend you get in touch with the team to discuss pricing, because (a) they are super helpful and (b) if you are an academic like me, you would probably prefer to use managed infra rather than worring about your own. You can compare pricing at the end of the document.
In any case, it's relatively easy to self-deploy Hatchet (and inexpensive, assuming you don't mind standing up and tearing down infra each time you run an experiment, assuming infrequent but highly bursty needs). It's only a few resources really - an AmazonMQ RabbitMQ broker, an Aurora/RDS Postgres database, and an ECS service with the actual Hatchet engine, API, and web UI dashboard. Having said that, there are also serveral conveniences pre-configured so that you can easily deploy in private subnets or with/without an internet-facing load balancer. If you have cloud infra experience, you may prefer to roll your own deployment, however, if not, this should be enough to get you off the ground with Hatchet relatively quickly.
Hatchet's official self-hosting docs include lots more information, including official support for Kubernetes w/ Helm charts or glasskube, but as someone with no real experience with K8S, I felt it was personally easier for me to go the route of translating the Docker Compose Deployment instructions into ECS.
You will need an AWS account with credentials, as well as Docker, Node and SST installed.
Visit Docker and install the relevant version for your system if you do not have it already. This will be necessary when deploying since containers are built on your machine before being pushed to ECR.
Follow the instructions here
to get Node/npm installed on your machine if you do not have them already. I recommend
using nvm regardless of whether you are on Windows vs OSX/Linux
You can follow the official AWS instructions or just use a pre-existing account with relevant permissions; however, sst.dev's instructions are actually pretty helpful in that they guide you through setting up an organization with different isolated accounts for specific environments, so if you are standing up a new project, following them is not a bad idea!
cd path/to/your/repos/
git clone https://github.com/szvsw/hatchet-sst.git
cd hatchet-sst
npm i .You can skip this step if you only want the engine available to worker nodes in the same VPC (or via tunneling). If you do not know what this means, then you should buy a domain!
In most cases, you will want to make the engine available over the open internet so that you will be able to visit the Hatchet dashboard to check task progresses and allow worker nodes on your local machine to easily connect to the engine.
The easiest way to do this is to purchase a domain through AWS Route53, and let sst.dev automatically configure all of the relevant DNS settings, certs for SSL, load balancer config etc. Depending on how luxurious you are feeling with your choice of domain, this is probably approx. $50/yr for the domain + the monthly LB costs (approx. $30/mo, but if you are just standing up the engine for infrequent experiment runs, e.g. once or twice a month and then tearing down, it's much less).
- Log in to your AWS console.
- Navigate to Route53.
- Purchase a domain, write down its name, e.g.
acmelab.com.
If you have an externally managed domain, you will need to create a certificate in ACM and add it to the env vars - more documentation coming soon. It's pretty easy though! Essentially just need to add one or two records to your DNS config via your DNS provider's console and wait 20 min. TODO: enable certificate referencing
sst let's you manage different stages (aka environments) when you deploy, including
some cool functionality around dev deployments, but we will not worry about that for now. By
default, when you run a command like sst deploy, it will deploy to an environment
with your current OS username - e.g. for me that's szvsw on my work computer but sam
on my home computer. You can always override which stage you want to deploy by passing
in the --stage <stage-name> flag to the CLI. By default, sst will also load in any
configuration variables you set in a corresponding .env.<stage-name> file.
- Copy
.env.exampleto.env.<stage-name>(e.g.<your-os-username>orproduction) - Update
ROOT_DOMAIN(or delete if not accessible over the internet) - Update any other configuration variables which might be relevant (e.g. cpu/mem size)
| EnvVar | Type | Description |
|---|---|---|
ROOT_DOMAIN |
undefined or valid domain in Route53 |
The root domain which will be used for making Hatchet accessible. The dashboard will be available at hatchet-<stage-name>.<root-domain>, e.g. hatchet-production.acmelab.com. If omitted or false, the engine will only be accessible inside the same VPC. |
DB_STORAGE |
[number] GB |
Size of the Postgres database storage. |
DB_INSTANCE_TYPE |
supported instances | What type of AWS instance to use for the Aurora Postgres database. nb: omit the db. prefix from the instance type name |
BROKER_INSTANCE_TYPE |
supported instances | What type of AWS instance to use for the AmazonMQ RabbitMQ broker. nb: do NOT omit the mq. prefix from the instance type name |
ENGINE_CPU |
supported vCPU count | How many vCPUs the Hatchet engine service should use. nb: the combination of cpu/mem must be valid |
ENGINE_MEMORY |
supported vCPU count | How much memory the Hatchet engine service should use. nb: the combination of cpu/mem must be valid |
ENGINE_PRIVATE_SUBNET |
boolean |
Whether or not to deploy the engine inside a private subnet. nb: if true, additional monthly costs will be incurred because either a NAT Gateway or PrivateLink VPC Endpoints will be added in order to pull containers from ECR. |
NAT_GATEWAY |
boolean |
Whether to add a NAT Gateway to the VPC. If false and ENGINE_PRIVATE_SUBNET=true, then PrivateLink VPC Endpoints will be added so that containers can still be pulled. |
BASTION_ENABLED |
boolean |
Whether to add a Bastion instance in your VPC which gives you remote access/tunneling capabilities |
OVERWRITE_CONFIG |
boolean |
Whether to regenerate the base Hatchet config before redeploying the engine. |
nb: the default instance sizes in .env.example are relatively large and sized for decent throughput. See the cost estimate at the end of the document. If you want to start cheaper, consider dropping down to something with 1vCPU for the broker, 2 vCPU for the DB, and 2 vCPU for the engine.
TODO: considerations when deploying workers in a private subnet
sst secret set DatabasePassword <your-password> --stage <stage-name>(nb: must be 12+ chars)sst secret set BrokerPassword <your-password> --stage <stage-name>(nb: must be 12+ chars)sst secret set AdminPassword <your-password> --stage <stage-name>(nb: must be 12+ chars, must contain an uppercase value, a lowercase value, and a number)
sst deploy --stage <stage-name>- Visit
hatchet-<your-stage-name>.<your-root-domain>, e.g.hatchet-production.acmelab.comand log into the default admin tenant withhatchet@<your-root-domain>and the specified password.
If you have configured ROOT_DOMAIN=your-domain.com, a load balancer is automatically
configured and Hatchet's engine is configured to tell workers via the fields encoded in a JWT
API token to send the appropriate HTTP(s)/gRPC traffic via
hatchet-<your-stage-name>.<root_domain> and hatchet-<your-stage-name>.<root_domain>:8443
respectively. These resolve to the load balancer, which then routes traffic to the appropriate
containers.
There's a good chance you might be spinning up thousands of worker nodes, in which case you probably want to skip the load balancer altogether, which you can do by deploying the worker nodes in the same VPC as the engine (TODO: auto-deploy docs coming soon) and using the cloudmap namespace domains.
However, because the client JWTs you generate still have the load balancer URLs encoded in the relevant fields, you need to override some environment variables when deploying the worker.
In addition to setting HATCHET_CLIENT_TOKEN, you will also need to set:
HATCHET_CLIENT_SERVER_URL=http://Engine.<your-stage-name>.hatchet.sst
HATCHET_CLIENT_HOST_PORT=Engine.<your-stage-name>.hatchet.sst:7070
HATCHET_CLIENT_TLS_STRATEGY=none
You can find the relevant URLs in the results of sst deploy under EngineAddresses in internalServerUrl and internalGrpcBroadcastAddress.
If you need to deploy without ingress from the internet, simply omit the ROOT_DOMAIN
env var or set it to false. This will result in the deployment skipping the configuration
of a Load Balancer for the Hachet service. However, this means that you will not be able to
connect local workers to Hatchet or check the dashboard from your machine, at least not with
some networking-fu. By default, this will still deploy the service in the public subnets
of your VPC, but there will be no ingress pathway from your local machine to the service.
Fortunately, sst makes it relatively easy to get connected to the VPC.
nb: your choice of private/public subnets for the engine containers are irrelevant here, since the tunnel we establish in the VPC will already have ingress rules which allow traffic to reach the engine.
nb: if you are on windows, you will need to use WSL for this part
- First, you will need to set
BASTION_ENABLED=trueand redploy (sst deploy --stage <your-stage-name>). Copy the Bastion Instance ID (something likei-asdf1348) to your clipboard for use later. - (install tunneling via
sudo sst tunnel installif you have not already) - Open up a tunnel with
sst tunnel --stage <your-stage-name>
- Open up a Firefox, then open
Settings > Network Settings > Settings - Select
Manual proxy configuration - Configure the
SOCKSProxyhostfield aslocalhostand theportfield as1080. - Make sure that
SOCKS v5is selected. - Click
OKto save settings. - Open a shell on your Bastion Instance:
aws ssm start-session --target <Bastion-instance-id> - Run
dig +short engine.<your-stage-name>.hatchet.sstto print out the IP address of the engine service within the VPC (you can also check this from the AWS console). - Open your
hostsfile in a text editor (on Mac/Linux, this is at/etc/hosts, on windows it's atC:/Windows/System32/drivers/etc) and add a record at the end which says<ipaddress> Engine.<your-stage-name>.hatchet.sst, e.g.10.0.10.136 Engine.szvsw.hatchet.sst. This will tell your computer to route the url to the ip address, while the proxy we configured in Firefox will tell your computer to route the IP address through the tunnel into the VPC. - You can now access the dashboard via the internal cloudmap namespace server url, which should be something like
Engine.<your-stage-name>.hatchet.sst. - Default log-in email will be
hatchet@example.comwith your specified password fromsst secret.
nb: though it's not particularly problematic to leave it there, it's probably a good idea to remove the record you added to your
hostsfile as well as the proxy settings in Firefox when you are done lest you confuse yourself in the future.
You can remotely access your Bastion instance by running:
aws ssm start-session --target <Bastion-instance-id>
You will of course need to deploy your workers in the same VPC. By default, the a client token generated from the dashboard following the instructions above should work fine - it will use the internal cloudmap namespace correctly. However, you will need to set an additional env var on the worker:
HATCHET_CLIENT_TLS_STRATEGY=none
TODO: example of worker deployment
This cost estimate presented is sized for a moderately high throughput and DOES NOT include your worker node compute costs, just the engine, database, queue, etc.
- Aurora/RDS: r6g.xlarge, $0.2016/hr
- MQ: m7g.large, $0.0816/hr
- Fargate: 4vCPU/8GB, $0.19/hr
- ALB: ~$30/mo (depends on if Workers connect thru ALB or within VPC)
- NAT (optional), 2 AZs, ~$65/mo
- Not included: some negligible ECR costs, Domain registration cost (e.g. $50/yr)
About $13/day or $370/month without a NAT, or about $430/month with a NAT or PrivateLink VPC Endpoints.
Note that the managed Hatchet pricing for the Growth plan is currently $425/month, but it includes $100/month in worker node compute credits, meaning the effective infrastructure price is $325/month, which already beats this. Given that you can get set up with managed Hatchet Cloud in literally seconds AND you can very easily auto-deploy worker nodes via managed comptue with auto-scaling and CI/CD already configured, I would say that pricing seems very attractive versus self-hosting for an actual persistent application (as compared to my typical use case, where I can just stand up and tear down the whole stack since I only need it once or twice a month).
Of course you can tune those instance size to your needs (and maybe even use spot capacity for the engine, though that seems risky), skip the load balancer entirely, and so on so you might see costs anywhere in the $100-300/month depending on your settings, but still, then you might be competing with the managed Hatchet Starter Plan @ $180/mo.
To me this suggests that you probably need a pretty strong argument to go the self-hosting route, which is probably just that you actually need to own your infra for one business/dev reason or another.
document credentials, hatchet login/token generation, using pgadmin through the tunnel etc