Timur Galeev Blog

Building vibestack: how I stopped re-explaining myself to my AI

Timur Galeev — Wed, 29 Apr 2026 14:21:44 GMT

A small confession, before anything else

The thing that pushed me to build vibestack was not a strategy meeting. It was a Tuesday evening, and I was tired.

I had spent maybe forty minutes in a Claude Code session walking through a Terraform module - the kind of slow, careful walk where you read the file, then the parent module, then the variable file, then the locals, then a mental diff against what production looks like in the AWS console. After all that, I asked for a small fix. Two lines. And the model, very politely, helped me. And then "improved" three other things I had not asked about.

I closed the laptop. I made tea. I sat down again and looked at the diff. Honestly, the changes were fine. They might even have been good ideas. But they were not what I asked for, and now I had to think about each of them, decide if I trusted them, run tests, and so on. By the time I was done, the small fix had become a thirty-minute review.

That night I did not write code. I wrote a list. The list said: the next time I sit down with this thing, what would I want it to remember about how I work?

That list became vibestack.

If you have ever finished a session with an AI and felt vaguely unhappy without being able to say why, the next few pages may be familiar.

Why the personal layer matters

The conversation about AI coding tools spends a lot of time on which model to pick, which IDE, which framework. It spends very little time on the layer above all of that - the small, specific, slightly opinionated set of conventions that turns "an AI in your terminal" into "an AI that fits the way this particular work gets done."

That layer is the part of the stack most people skip. It is also the part that makes the difference between collaborating with a useful colleague and shouting instructions at someone who doesn't quite get it. The model is a commodity. The IDE is a commodity. The personal layer is the bit that's actually yours.

vibestack is one shape that layer can take. Forty-four small slash commands, a handful of bash hooks, an install script, and a flat state directory in ~/.vibestack/. The rest of this article walks through what's in there, why each piece exists, and what it has to do with where the industry is heading in 2026.

What vibestack actually is (in one sentence, then several)

vibestack is a personal pack of 44 specialised workflows for Claude Code, exposed as slash commands. That's the elevator version.

The slightly longer version: each workflow is a folder under skills/ with a single SKILL.md file inside it. The file has a small YAML header (the name, what it does, which tools it's allowed to touch, the trigger phrases) and then a body written in plain English. No bash, no DSL - just a careful set of instructions to a smart colleague. Claude Code discovers these files automatically and lets me invoke them as /review, /ship, /investigate, /cso, /freeze, and so on.

That's it, really. The whole repository is around 50 small markdown files, an install script, and a handful of bash hook scripts. There is no framework. There is no SDK. If you delete vibestack tomorrow, Claude Code still works - you just lose 44 small habits I've taught it.

Here are the rough buckets of what's in there:

Planning and product - /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /plan-devex-review, /autoplan, /plan-tune. These are the ones I reach for when I want a structured second pair of eyes on something before I write code.
Code quality and shipping - /review, /ship, /investigate, /cso, /pr-summary. The "is this actually good and safe to merge" ones.
QA and design - /qa, /qa-only, /canary, /land-and-deploy, /design-review, /design-html, /design-shotgun, /design-consultation. I spent fifteen years doing infra. Then I built one user-facing product and realised how bad I was at telling typography apart from a polite mess. These are my crutches.
Operations and learning - /retro, /learn, /health, /benchmark, /document-release. These keep me honest week to week.
Safety - /careful, /freeze, /unfreeze, /guard. We'll get to these.
Session and context - /context-save, /context-restore, /setup-memory. The "stop forgetting what we agreed on" ones.
Meta-tooling - /codex, /claude, /benchmark-models, /browse, /open-browser, /pair-agent, /make-pdf, /setup-deploy, and a couple of sillier things.

If that list looks long: yes. It's long because it grew organically over a few months of "I keep doing this thing - let me make it a command." It is not a curriculum. It's a habit pile.

Why I built it instead of using something off-the-shelf

The honest answer is that I tried. I read the awesome-claude-code lists. I copied skills from a few public packs. They were great - and they were not me.

There's a particular kind of friction that hits when your tools are someone else's habits in disguise. A skill that's almost right is sometimes worse than no skill at all, because you don't notice the gap until the wrong thing has already happened. A "review" command that doesn't check the things I actually care about gives me a green light I shouldn't have. A "ship" command that uses a versioning convention I don't follow drags me into manual cleanup.

So I started writing my own. The rule I gave myself early on, written down in a file called ETHOS.md in the repo: if I won't reach for this command at least once a week, it doesn't belong here.

That is also the rule I'd give anyone else thinking of doing this. Don't build a skill pack. Build the five commands you actually use, and let the rest emerge.

The five principles, the way I'd say them out loud

I have these in ETHOS.md and they sound a little serious there. Here they are translated into how I'd say them to a colleague:

1. Write to the model like it's a smart human, not a regex engine. A skill is not a script. If your skill body is full of bash, your skill is wrong - that bash should be in a hook. The body should read like a clear briefing.

2. Search before you build. Half the "new skill" ideas I get are actually existing skills I forgot about, plus a different trigger phrase. Adding more files multiplies confusion. Adding fewer files multiplies clarity.

3. The user is in charge. Always. I don't want a skill that decides for me. I want a skill that surfaces a consequence and lets me decide. /careful warns. /freeze enforces what I told it to enforce. Nothing in vibestack overrides me silently.

4. Hooks are powerful - be quiet with them. A hook intercepts every matching tool call. That is a real footprint. If a skill body would do the job, don't reach for a hook. And if you do, the hook must fail safely. Crash → allow. Don't ever crash → block.

5. Build what you actually use. Skills written speculatively rot. They never get invoked, the trigger phrases drift, and one day you read your own SKILL.md and don't recognise it. Better to delete a skill than to keep one you don't use.

That's it. Five rules. They sound obvious. They're not, until you've built the wrong skill twice.

The part where hooks earn their keep

Let me show you what a hook actually does, because this is the part most people don't see.

Claude Code lets a skill register a hook on certain events. The one I use most is PreToolUse - it fires before the model is allowed to run a tool like Bash, Edit, or Write. The hook is a small script that reads JSON on stdin (the proposed tool call) and writes JSON on stdout (a decision). Three possible decisions:

{} - fine, let it through.
{"permissionDecision":"ask","message":"..."} - pause, surface this to me, let me approve or refuse.
{"permissionDecision":"deny","message":"..."} - block, don't even ask.

That sounds like nothing. It is the whole game.

Two examples from vibestack.

/careful registers a Bash hook that scans the proposed command. If it matches rm -rf , DROP TABLE, git push --force, kubectl delete, git reset --hard, and a small list of similar things, it returns ask with a short explanation. I get a chance to look at the thing before I lose the thing. The hook script is around forty lines of bash, mostly safe-listing harmless cases like rm -rf node_modules or dist/ so it doesn't cry wolf.

/freeze is more ambitious. When I run it, I tell Claude "only edit files inside src/api/auth/ for the rest of this session." It writes that path to ~/.vibestack/freeze-dir.txt. From then on, every Edit and Write runs through check-freeze.sh, which compares the proposed file path against the boundary. Outside? Deny. Inside? Allow. The state file is plain text. You can cat it, rm it, edit it. Nothing magic.

Here's a small story about that script that taught me a lesson.

The first version of check-freeze.sh resolved symlinks for the file being edited but not for the boundary itself. That's fine on Linux. On macOS, /tmp is a symlink to /private/tmp. If you froze edits to /tmp/something, then asked to edit /tmp/something/foo.txt, the file path got resolved to /private/tmp/something/foo.txt, which did not start with the boundary /tmp/something/, and the hook denied your own edits. To your own freeze. Inside the directory you said was OK.

The fix is one of the kind I love: a five-line refactor (resolve both sides) and a one-paragraph commit message. It shipped in v1.1.0. And the lesson - apply your transformation to both sides of a comparison, always - is now living rent-free in my head.

The other small fix has the same flavour. The /careful script used \s in a sed regex. macOS BSD sed does not support \s. The fix: use [[:space:]] and anchor with ^. POSIX-portable. Works everywhere. Ten characters of change, hours of "but it works on my colleague's machine" avoided.

These are not exciting fixes. They are the kind of fixes that mean you can rely on the thing.

The other kind of skill: thinking partners

Not every vibestack skill is a hook. Some don't run any bash at all. They're pure markdown - a few thousand words of carefully tuned prose that turn the conversation itself into the tool. Two of them get reached for more than anything else, and they deserve their own section because they do something the rest of vibestack doesn't.

`/office-hours` - the skill to run before writing a single line

/office-hours opens with one question - what's your goal with this? - and based on the answer it routes into one of two modes.

Startup mode is the hard one. It asks six "forcing questions" designed to expose whether the thing about to be built is real or imaginary:

What's the strongest evidence someone actually wants this - not "is interested," not "signed up for a waitlist," but would be genuinely upset if it disappeared tomorrow?
What are users doing right now to solve this - even badly? What does that workaround cost them?
Name the actual human who needs this. Not a category. A name, a role, a consequence they face if the problem isn't solved.
What's the smallest possible version someone would pay real money for this week - not after the platform is built?
Have you sat down and watched someone use this without helping them? What did they do that surprised you?
If the world looks meaningfully different in three years - and it will - does this become more essential or less?

The skill is direct to the point of discomfort. It refuses to accept polished first answers - it pushes once, then pushes again. It will not let "everyone needs this" pass. It has an explicit anti-pattern list - "interest is not demand," "growth rate is not a vision," "surveys lie, demos are theater" - and it will name the failure mode out loud the moment it spots one. Reading the prompt that drives this skill feels like reading a senior product manager's notebook from after a bad week.

Builder mode is the gentler sibling - same questioning structure, but tuned for side projects, hackathons, learning, open source. The currency there is delight, not demand. What's the coolest version of this? Who would you show this to that would say "whoa"? What would the 10× version look like if there were no time limits?

Both modes produce the same artifact: a markdown design doc, written automatically to ~/.vibestack/projects//. Problem statement, demand evidence (or "what makes this cool"), the premises that have been agreed to, two or three alternative approaches, the recommended one, and one concrete next-step assignment. No code. Not even scaffolding. The skill has a hard gate against starting implementation - its only output is the document.

That document then becomes the input to the next skill on this list.

`/plan-ceo-review` - the dispassionate reread

/plan-ceo-review picks up where /office-hours leaves off. It reads the design doc automatically (or works without one if there isn't one) and reviews the plan in what it calls founder mode - the posture of someone who is not there to rubber-stamp anything.

The skill asks for a mode up front, and there are four:

Scope expansion - dream bigger. What would make this 10× better for 2× the effort? Push scope up, present every expansion as an opt-in.
Selective expansion - hold the line, but cherry-pick wins where they're cheap.
Hold scope - no drift in either direction. Just maximum rigor on what's already there.
Scope reduction - find the minimum viable cut and ship it.

Once a mode is chosen, the skill commits to it. No silent drift halfway through. That single rule is more useful than it sounds - it stops the review from quietly becoming a different review when the conversation gets long.

The body of the review is structured around nine prime directives that read like a grumpy senior engineer's checklist:

Zero silent failures. Every error has a name. Data flows have shadow paths. Interactions have edge cases. Observability is scope, not afterthought. Diagrams are mandatory. Everything deferred must be written down. Optimise for the six-month future. You have permission to say "scrap it and do this instead."

Behind those is a deeper layer - eighteen cognitive patterns borrowed from how strong founders think. Bezos one-way vs. two-way doors. Munger's inversion reflex (for every "how do we win?" also ask "what would make us fail?"). Jobs's subtraction default. Grove's paranoid scanning. None of those are checklist items. They are lenses for reading the plan.

What comes back is the part of the work that is hardest to do for yourself: the dispassionate reread of your own plan, with the quiet failure modes you missed marked in red.

Why these two together

/office-hours and /plan-ceo-review are the part of vibestack that has changed actual output the most. Not because they make code faster - they don't make code at all. They make the right thing get built on the first attempt more often, and that is a much larger lever than any amount of generation speed.

A diff that ships in two days but solves the wrong problem still solves the wrong problem. The most expensive code is the code that gets thrown away after a quarter because nobody asked the six questions before it was written. These two skills are an attempt to keep that from happening.

If only two ideas from this whole article are worth taking, take those.

How vibestack installs itself, and why I'm proud of it

This is going to sound small, but I want to dwell on it because it tells you something about how the whole thing is designed.

The install script does one thing. It walks skills/, finds every SKILL.md, and creates a symlink in ~/.claude/skills//. Not a copy. A symlink. The canonical source stays in the repo. If I git pull && ./install on Monday morning, every change is immediately live - no rebuild, no sync, no cache to bust.

That decision is not technically clever. It is organisationally clever. It means there is exactly one place where my skills live, exactly one history of how they changed, and exactly zero "I edited the installed copy and lost it on the next pull" moments. I have lost too many afternoons to that pattern in other tools to want to repeat it here.

Hook scripts are also symlinked. State lives in ~/.vibestack/, a flat directory of .txt and .jsonl files I can grep. Nothing about this setup will surprise you in five years. Nothing requires explanation.

Here is the install philosophy in one sentence: the source of truth is the git repo; everything else is a pointer.

The sibling: vibekit

While vibestack lives in ~/.claude/skills/vibestack/, there's a quieter sibling repo over at github.com/timurgaleev/vibekit. It does a different job, and I want to talk about it because the relationship between the two is the actual story.

vibestack is workflows - the slash commands. vibe-config is settings - the always-loaded shape of how Claude Code, Cursor, and Kiro behave when I open a session. Three subfolders, three targets:

vibe-config/claude/  →  ~/.claude/
vibe-config/kiro/    →  ~/.kiro/
vibe-config/cursor/  →  ~/.cursor/

Inside claude/ you'll find the things every session of mine starts with: a CLAUDE.md with my coding philosophy, a rules/ folder with files like language.md, security.md, tests.md, git.md, obsidian.md. There are sub-agent definitions (a planner, a builder, a debugger, a quality reviewer). There's a statusline.py that renders my context window, model, cost, and token usage in the bottom bar. And there's a hooks/vibenotif.py which broadcasts the session state - thinking, working, waiting, done - to a small Electron app and, optionally, to a tiny ESP32 device with an LCD screen that sits on my desk and tells me, in colour, whether the agent needs me.

If that last bit sounds silly, fair enough. It also turns out to be useful. You get up, you make coffee, you glance at the desk on the way back, and you know without opening the laptop whether the agent is grinding or waiting on an answer. A two-dollar screen, doing one thing, doing it well.

The split between vibestack and vibe-config matters. vibestack is the active layer - commands I invoke. vibe-config is the passive layer - guidelines that are always in scope. Mixing them was tempting at first. Keeping them separate has paid off every time I've had to update one without touching the other.

One install command handles vibe-config:

./install.sh        # sync everything
./install.sh -n     # dry-run, show me what would change

It uses MD5 hashes to diff before writing, so re-running is cheap and idempotent. The same script knows how to disable VibeNotif, how to merge Cursor's cli-config.json without overwriting my personal model preferences, and how to warn me when Cursor's settings.json has drifted on disk. It is the kind of script you only write after the fifth time you have hand-fixed something it should have automated.

Why this matters in 2026, and not in some abstract way

Now the part where I look up from the keyboard.

If you read Anthropic's 2026 Agentic Coding Trends Report - and I think you should, even if you've already read three takes on it - there is one phrase that keeps coming back: context engineering is the load-bearing skill of 2026.

That sentence is doing a lot of work. Translated into something I'd say to a junior engineer over lunch: the bottleneck has moved. It used to be that the model was the bottleneck. The model couldn't write the function, so we wrote it, and the model autocompleted. Now the model can write the function. What it can't do - at least not reliably - is figure out which function you want, in which file, with which conventions, against which constraints, by Thursday. That work has to come from somewhere. Increasingly, it comes from the way you set up the session.

The numbers in the report are striking. Projects with well-maintained context files saw something like 40% fewer agent errors and 55% faster task completion. MCP - Model Context Protocol, Anthropic's spec for connecting tools to models - crossed 97 million installs in March. Skills (SKILL.md files following the universal format) now work across Claude Code, Cursor, Gemini CLI, Codex CLI, and more. Anthropic Academy is running 17 courses and people are showing up for them.

This is what people mean when they say agentic engineering is no longer experimental. The wires have set. The patterns we use today - skills, hooks, MCP, sub-agents, status hooks - are going to be the boring infrastructure of the next decade.

And in that picture, here is the thing I keep thinking about: the most valuable layer is the personal one.

Not because individual taste matters more than team standards. It doesn't. But because the team standards have to be embodied somewhere, and the only place they actually run is on your machine, in your session, against your habits. A team can publish a CLAUDE.md. The CLAUDE.md does nothing until it's loaded. It loads when you set it up. The personal layer is the surface where every other layer lands.

vibestack is mine. It's mine the way my keyboard layout is mine. If you take it as-is, you'll get most of the value of the structure and miss the value of the customisation. The interesting move is not "install vibestack." The interesting move is "fork vibestack, delete half of it, and write three skills that reflect how you actually work."

That's the bit that the trend reports keep almost saying and not quite saying out loud. So I'll say it: start a personal skills pack. It can be five files. It will save you a thousand small re-explanations.

What I'm doing next, and what I'd do differently

A few things are on my list.

I want a /morning-briefing skill that reads my git logs, my Linear tickets (when I'm allowed), and my Obsidian inbox, and gives me a one-page "here is what is on fire" report at 8:30am. Right now I do this by hand. It takes maybe ten minutes. I'd rather it took thirty seconds.

I want to push more learnings into ~/.vibestack/projects//learnings.jsonl. The /learn skill already exists, but I haven't been disciplined about it. I'm hoping that as the file fills up, the next conversation about the same project gets sharper. If that doesn't happen, the skill is wrong and I'll redesign it.

I want to write a smaller, opinionated skill template - not a framework, definitely not an SDK - that I can hand to teammates who say "I'd love to set this up but I don't know where to start." Three commands. Maybe four. The minimum viable habit.

If I were starting again from scratch, I would do two things differently. I would write the install script first, before any skill, because almost every regression I hit was an install-time issue. And I would write the hook conventions document before writing any hooks, because I learned the macOS BSD sed lesson the painful way.

These are small regrets. The thing largely works. I use it every day. It saves me time I don't have.

Wrapping up

If there's one thing I'd want you to take away from all of this, it's that the most useful tooling around AI right now isn't the model itself, or the IDE, or even the framework someone wrote a viral blog post about last week. It's the small, specific, slightly opinionated layer you build on top of the rest - the one that knows how you work.

You don't need 44 commands. I didn't start with 44. I started with three. Whatever you build, build it because you keep typing the same thing into the chat window and you're tired of it.

Both repos are MIT-licensed and live on GitHub:

github.com/timurgaleev/vibestack - the slash commands
github.com/timurgaleev/vibekit - the configuration sibling

If something here was useful, or if you've built your own version and want to compare notes, I'd genuinely like to hear about it.

Thanks for reading.

Sources and further reading

Anthropic, 2026 Agentic Coding Trends Report - resources.anthropic.com
Anthropic, Extend Claude with skills - code.claude.com/docs/en/skills
Claude Code: Hooks, Subagents, and Skills - Complete Guide, March 2026 - ofox.ai/blog
A Mental Model for Claude Code: Skills, Subagents, and Plugins, Level Up Coding, March 2026 — levelup.gitconnected.com
Awesome Claude Code — community-curated list — github.com/hesreallyhim/awesome-claude-code
Pento, A Year of MCP: From Internal Experiment to Industry Standard — pento.ai/blog

Working with AWS European Sovereign Cloud (ESC): Terraform, IaC, and what's different

Timur Galeev — Wed, 28 Jan 2026 09:30:00 GMT

If you manage AWS infrastructure with code, the European Sovereign Cloud adds a new partition to think about. Different endpoints, separate IAM, its own console. This guide covers what works out of the box, what needs changes, and the patterns that help when you deploy across both ESC and commercial AWS.

Why This Exists

AWS has had European regions since 2007. Ireland came first, then Frankfurt, London, Paris, Stockholm, Milan, Zurich, Spain. Eight regions across Europe. Data stays in Europe. GDPR compliant. Problem solved, right?

Not quite.

Here's the thing about eu-central-1 (Frankfurt) — your data sits in Germany, sure. But AWS operations? Support tickets? Billing metadata? That stuff flows through global systems. American employees can access it. The control plane lives in the US. When you call support at 3am, someone in Seattle might answer.

For plenty of companies, that's fine. You're running a SaaS product, your customers don't care where the ops team sits. But for German government agencies processing citizen data? French hospitals handling patient records? Banks under BaFin scrutiny? They've been asking harder questions.

The US Cloud Act made it worse. Passed in 2018, it lets American authorities compel US companies to hand over data, even if that data sits on servers in Frankfurt. Doesn't matter where the bits are physically stored — if an American company controls them, American courts can demand them. AWS has always pushed back on these requests, but "trust us, we'll fight it" isn't the same as "technically impossible."

Then came Schrems II in 2020, when the EU Court of Justice invalidated Privacy Shield. Suddenly every European company using American cloud providers had to justify why their data transfers were legal. Standard contractual clauses helped, but the legal uncertainty never fully went away.

That's the gap ESC fills. Not just "data in Europe" but "everything in Europe" — operations, support, billing, leadership, legal jurisdiction.

What's Actually Different

The European Sovereign Cloud is a separate partition entirely. Not a region — a partition. Like how AWS GovCloud is separate from commercial AWS, or how China regions are isolated. Different domain (amazonaws.eu instead of amazonaws.com), different IAM system, different control plane.

The region code is eusc-de-east-1, sitting in Brandenburg, Germany. The partition identifier is aws-eusc. When you construct ARNs, it's arn:aws-eusc: not arn:aws:.

AWS set up a new German parent company to run it — AWS European Sovereign Cloud GmbH — with three subsidiaries handling infrastructure, certificates, and employment. The managing directors are Stéphane Israël (former CEO of Arianespace) and Stefan Hoechbauer (VP of AWS Germany), both EU citizens based in the EU. The board includes independent third-party representatives specifically for sovereignty oversight. Not Amazon employees — actual independent oversight.

Only EU residents work there. Not just "based in Europe" — actually residing in the EU with EU contracts. And going forward, they're only hiring EU citizens. The transition is gradual, but the end state is clear: EU citizens only, no exceptions. No "follow-the-sun" support routing your ticket to Virginia at 3am.

When AWS says the infrastructure has "no critical dependencies on non-EU infrastructure", they mean it literally. The system can keep running even if someone cuts the transatlantic cables. Billing systems, metering engines, security operations center — all contained within the EU. Metadata created in ESC stays in ESC. Your usage data doesn't flow to a US billing system.

The Security Foundation

This matters more than the org chart stuff, honestly. Legal structures can change. Technical architecture is harder to undo.

ESC runs on the Nitro System, same as regular AWS. But the Nitro architecture is what makes the sovereignty claims credible. It's not just policy — it's hardware design.

The Nitro System was built with zero operator access as a design goal. There's no SSH into the hypervisor. No console access. No mechanism for AWS employees — or anyone — to access EC2 instance memory or customer data on encrypted storage. When they say "no backdoors", it's not a policy promise, it's a constraint enforced by the silicon.

Administrative access happens through authenticated, authorized, and logged APIs that provide no path to customer data. You can audit operations without giving operators data access. These restrictions are built into the Nitro firmware itself. Not a software toggle someone can flip during an emergency or under legal pressure.

NCC Group, an independent security firm, validated these claims in an audit published May 2023. They specifically looked for gaps that would let someone access customer data or memory. Found none. That audit applies to Nitro everywhere, including ESC.

For ESC specifically, AWS added the Sovereignty Reference Framework (ESC-SRF). It's an independently validated framework with third-party auditor reports documenting the sovereignty controls. Your compliance team can hand these reports to regulators instead of trying to explain AWS architecture themselves.

The Catch (There's Always a Catch)

You can't just add ESC to your existing AWS Organization and call it a day. This is a separate cloud, and that separation creates friction.

Separate console, separate login. ESC has its own management console on the amazonaws.eu domain, separate from console.aws.amazon.com. Different URL, different accounts, different credentials. You can't switch between ESC and commercial AWS with the account dropdown — they're completely separate consoles. Bookmark both if you work in both.

No cross-partition IAM. Can't assume roles from your regular AWS account into ESC. If you have workloads in both places, you need separate identity management. Set up federation through a third-party IdP like Okta or Azure AD, maintain separate credentials, design your CI/CD to handle both partitions. Your developers need two sets of AWS credentials.

No VPC peering. Want to connect eu-central-1 to ESC? Treat it like connecting to on-premises infrastructure. VPN, Direct Connect, or application-level APIs. You're bridging two clouds, not two regions. Network architects used to multi-region deployments need to reset their mental model.

Separate accounts entirely. Different accounts, different Organizations, different invoices, different cost allocation tags. If your finance team tracks cloud spend by AWS account ID, they need new processes. Your existing FinOps dashboards won't see ESC spend.

ECR isolation. You can't pull container images from your existing ECR repos in eu-central-1. ESC's isolation means no cross-partition image pulls. Push your images to ECR in eusc-de-east-1, use a public registry, or set up replication through your CI/CD pipeline.

Terraform works, but check your version. Terraform 1.14+ and AWS provider 6.x support ESC natively — endpoints resolve correctly without manual configuration. Just set the region:

provider "aws" {
  region = "eusc-de-east-1"
}

If you're on an older version, you'll need to upgrade or configure endpoints manually. The S3 backend for state storage also requires Terraform 1.14+.

What Services Are Available

AWS didn't launch this with five services and a "coming soon" page. You get 90+ services from day one. That matters because previous sovereign cloud offerings often meant accepting a skeleton service catalog.

Containers: ECS, EKS, ECR. Full Fargate support. If you're running containers anywhere on AWS today, same capabilities.

Compute: EC2 with multiple instance families, Lambda for serverless. Enough instance types for most workloads.

AI/ML: Bedrock, SageMaker, Amazon Q. All available from day one.

Database: Aurora (MySQL and PostgreSQL compatible), DynamoDB, RDS for managed databases. All the usual engines.

Storage: S3 with full feature parity, EBS for block storage.

Networking: VPC, Direct Connect, Route 53 for private hosted zones. Transit Gateway for complex topologies.

Security: KMS for encryption keys, Secrets Manager, Private CA, IAM with all the normal features.

If you're running containers on Fargate in Frankfurt today, you can run the same workloads on ESC. Same task definitions, same service configs, just different region and endpoints.

What's Missing

90 services sounds good until you remember AWS has 240+. Some gaps matter more than others:

CloudFront — No CDN at launch. If your architecture relies on edge caching, you'll need alternatives. Expected end of 2026.

IAM Identity Center — The modern way to manage SSO across an Organization isn't there yet. You can still use IAM with external identity providers, but you'll configure it per-account instead of centrally. Expected Q1 2026.

Shield Advanced & Firewall Manager — DDoS protection and centralized firewall rules aren't available. Basic Shield is included, but advanced protections aren't.

Amazon Inspector — No automated vulnerability scanning for workloads yet.

GuardDuty — Available but limited. No Organization-level management, missing some newer detection capabilities.

IoT Services — IoT Core, Greengrass, and related services aren't included. If you're running IoT workloads, ESC isn't ready for them.

Organizations features — You get AWS Organizations, but delegated administration isn't supported. StackSets and other governance tools must run from the Management Account.

Also worth noting: S3 Block Public Access isn't enabled by default like it is in commercial AWS. Enable it manually.

Pricing: Expect 10-15% premium over Frankfurt (eu-central-1) for comparable services.

Deploying Containers — The Practical Bits

The patterns are identical to regular AWS. I'm not going to paste hundreds of lines of Terraform — you know how to deploy ECS. The differences are configuration, not architecture:

Region: eusc-de-east-1
ARNs use aws-eusc partition: arn:aws-eusc:iam::aws:policy/...
ECR images must come from ESC or public registries
Tag resources with compliance markers for your auditors

A minimal ECS task definition:

resource "aws_ecs_task_definition" "app" {
  family                   = "my-app"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"
  execution_role_arn       = aws_iam_role.execution.arn

  container_definitions = jsonencode([{
    name  = "app"
    image = "your-ecr.eusc-de-east-1.amazonaws.eu/app:latest"
    portMappings = [{ containerPort = 80 }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"  = "/ecs/my-app"
        "awslogs-region" = "eusc-de-east-1"
      }
    }
  }])
}

VPC setup is standard — public subnets for load balancers, private subnets for tasks, NAT gateways for outbound traffic. Security groups, ALB config, service definitions — all identical to what you'd write for Frankfurt.

Infrastructure as Code: The Real Story

If you're managing infrastructure with code (and you should be), here's what actually works with ESC right now.

Terraform and OpenTofu

As mentioned, Terraform 1.14+ handles ESC out of the box. But there's more to it than just setting the region. The aws_partition data source correctly returns aws-eusc, which is useful when you're building partition-aware modules:

data "aws_partition" "current" {}

# Returns "aws-eusc" in ESC, "aws" in commercial
output "partition" {
  value = data.aws_partition.current.partition
}

For multi-partition deployments, use provider aliases:

provider "aws" {
  alias  = "esc"
  region = "eusc-de-east-1"
}

provider "aws" {
  alias  = "commercial"
  region = "eu-central-1"
}

# Deploy to ESC
resource "aws_s3_bucket" "sovereign_data" {
  provider = aws.esc
  bucket   = "my-sovereign-bucket"
}

# Deploy to commercial
resource "aws_s3_bucket" "public_assets" {
  provider = aws.commercial
  bucket   = "my-public-bucket"
}

OpenTofu 1.11+ also supports ESC natively, including the S3 backend in eusc-de-east-1. Confirmed working by community testing in December 2025. If you've switched to OpenTofu, same patterns apply.

AWS CDK

CDK supports ESC since August 2025. Region registration for eusc-de-east-1 and VPC endpoint handling were added in PR #34860. No workarounds needed — just set the region:

const app = new cdk.App();
const stack = new cdk.Stack(app, 'EscStack', {
  env: {
    account: '123456789012',
    region: 'eusc-de-east-1',
  },
});

ARNs, service endpoints, and partition references resolve correctly out of the box.

CloudFormation

Works as expected. CloudFormation is partition-aware by design, so templates deploy without modification. The AWS::Partition pseudo parameter returns aws-eusc automatically.

Resources:
  MyRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        # Automatically uses aws-eusc partition
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"

One exception: Landing Zone Accelerator doesn't work. LZA maps to a single AWS Organization and can't span partitions. You'll need separate LZA deployments for ESC and commercial, with duplicated configurations.

Multi-Partition Patterns

Running workloads in both ESC and commercial AWS? Here are patterns that work:

Shared modules with partition-aware variables:

variable "partition" {
  description = "AWS partition (aws or aws-eusc)"
  type        = string
}

variable "region" {
  description = "AWS region"
  type        = string
}

locals {
  is_sovereign = var.partition == "aws-eusc"

  # Adjust for service availability
  enable_cloudfront    = !local.is_sovereign  # Not available in ESC yet
  enable_guardduty_org = !local.is_sovereign  # Limited in ESC
}

Separate state files per partition:

# ESC backend
terraform {
  backend "s3" {
    bucket = "my-tfstate-esc"
    key    = "infrastructure/terraform.tfstate"
    region = "eusc-de-east-1"
  }
}

Don't try to share state across partitions. The isolation is the point.

CI/CD branching strategy:

Some teams run completely separate pipelines per partition. Others use a single pipeline with partition as a variable. The right choice depends on how different your ESC and commercial configurations are. If they're mostly identical, one pipeline with environment variables works. If they diverge significantly, separate pipelines prevent accidents.

Planning Your Architecture

If you're considering ESC, think about workload segmentation early. Not everything needs sovereignty guarantees, and putting everything in ESC when it doesn't need to be there adds cost and complexity.

Tier 0 — Sovereign (ESC): Sensitive data requiring sovereignty guarantees. Patient health records, citizen personal data, financial records under regulatory requirements, classified government workloads. This is your ESC tier.

Tier 1 — Standard (Commercial AWS or ESC): Business data without special regulatory requirements. Internal tools, development environments, public-facing websites, marketing systems.

The hard part is the boundary. Your sovereign tier probably needs data from the standard tier sometimes. Options:

API gateways at the boundary. ESC workloads call commercial AWS through a controlled API layer. Strict authentication, audit logging, minimal data exposure. The API becomes your compliance checkpoint.

Data diodes for one-way flow. ESC can pull data from commercial AWS on a schedule. Commercial can't push to ESC. Useful for reference data that needs to be in ESC but originates elsewhere.

Message queues with encryption. Async communication through something like SQS or external message brokers. Decouples the systems while maintaining the boundary.

Don't try to architect this like multi-region. It's multi-cloud, practically speaking. Your eu-central-1 workloads can't directly call your ESC workloads over private networking. Plan for that from day one, not as an afterthought.

Migration Path

If you're moving existing workloads to ESC, here's a rough sequence:

Phase 1: Assessment. Which workloads actually need sovereignty? Many teams discover only 20-30% of their infrastructure handles truly sensitive data. Don't move everything just because you can.

Phase 2: Identity setup. Get your IAM structure in ESC before anything else. Set up federation, create roles, establish your permission model. Test authentication flows.

Phase 3: Network foundation. VPC, subnets, NAT gateways, security groups. If you need connectivity back to commercial AWS, set up the VPN or Direct Connect tunnel.

Phase 4: Container registry. Push your images to ECR in ESC. Update your CI/CD to build and push to both registries if you're running in both partitions.

Phase 5: Workload deployment. Start with non-critical workloads to validate your Terraform and deployment pipelines. Work through the endpoint configuration issues before touching production.

Phase 6: Data migration. This is usually the hardest part. How do you move data without downtime? Often involves running parallel systems temporarily, with replication from source.

Phase 7: Cutover. Switch traffic to ESC workloads. Keep the old deployment running until you're confident, then decommission.

Cost Reality

ESC pricing follows standard AWS models — you pay for what you use. But the isolation adds costs:

NAT Gateways: ~€0.045/hour each plus data processing. High availability means two gateways, roughly €65/month before data charges. You're paying this in Frankfurt too, but now you're paying it twice if you have workloads in both partitions.

Data transfer between partitions: Not free internal transfer. Treat it like cross-region or internet egress. If your architecture involves heavy data movement between ESC and commercial AWS, model those costs.

Operational overhead: Managing two partitions means duplicated effort. Two sets of IAM policies, two CI/CD pipelines, two monitoring dashboards, two on-call rotations if you have partition-specific issues. That's engineering time.

Compliance tooling: You'll probably want separate security scanning, compliance monitoring, and audit tooling for ESC. Or tools that understand both partitions. Either way, cost.

AWS has confirmed a 10-15% pricing premium over Frankfurt for comparable services — what they call the "sovereignty premium." Combined with the hidden costs above, budget accordingly.

Who Should Actually Use This

Move to ESC if:

You handle data under strict EU sovereignty requirements — not just GDPR, but sector-specific rules that mandate operational control
Regulators or auditors have specifically asked about US Cloud Act exposure
You're in public sector, healthcare (especially in Germany with patient data), or finance with explicit data residency mandates
Your contracts require EU-only operations and personnel — government contracts often do
You need to demonstrate sovereignty compliance with third-party validated reports

Stick with regular EU regions if:

Standard GDPR compliance is sufficient for your use case
You need services that haven't launched in ESC yet
Cost optimization is priority over sovereignty guarantees
You're already running multi-region and partition complexity doesn't fit your operating model
Your compliance requirements don't specifically call out operational sovereignty or personnel location

ESC isn't "better" than Frankfurt. It solves a specific problem. If you don't have that problem, you're adding complexity and cost for no benefit. Frankfurt with proper encryption and access controls is fine for most workloads.

The Competitive Landscape

AWS isn't alone here. Microsoft announced sovereign cloud offerings for EU customers. Google has Sovereign Controls for GCP. But the approaches differ.

Microsoft's approach involves partnerships with local operators — like T-Systems in Germany running Azure infrastructure. Google focuses on software controls and key management.

AWS went further with complete partition isolation. New legal entities, new domain, separate IAM, the whole stack. Whether that matters depends on what your regulators care about.

The 90+ service catalog at launch also sets AWS apart. Competitors often launch sovereign offerings with limited services and catch up over time. ESC starts nearly feature-complete.

What's Coming

AWS announced expansion plans. Local Zones in Belgium, Netherlands, and Portugal — same sovereignty model, lower latency for users in those countries. These extend ESC's footprint without requiring new full regions.

The workforce transition continues. Current staff are EU residents; future hires will be EU citizens only. Over time, the entire operation shifts to citizen-only. That's a commitment you can point to in RFPs.

More regions within ESC are likely but not announced. If demand justifies it, a second ESC region (France? Italy?) would add redundancy options.

The €7.8 billion investment through 2040 signals this isn't an experiment. Amazon is building parallel infrastructure for the next fifteen years.

Bottom Line

The European Sovereign Cloud answers three questions that every regulated European organization has been asking. Where exactly is my data? Who can access it? What happens when a foreign government asks for it?

For workloads where those questions have regulatory or contractual weight, ESC provides answers backed by legal structure, organizational isolation, and hardware-level security design. The ESC-SRF gives you auditor reports to prove it.

For everything else, eu-central-1 works fine and doesn't require rethinking your account structure, identity model, and network architecture.

Just remember: ESC is a different cloud, not a different region. The isolation that provides sovereignty guarantees also creates operational boundaries. That's the point — but it's also the cost.

References

ECS vs EKS: When You DON'T Need Kubernetes - A Practical Guide to Choosing AWS Container Services

Timur Galeev — Sun, 04 Jan 2026 16:01:37 GMT

Introduction

You know what? I see teams spinning up Kubernetes clusters for three microservices all the time. Then they spend two months figuring out pods, ingress controllers, and all that magic. And then they pay $70 per month just for three clusters in different regions, not counting the actual servers.

Here's the honest truth: Kubernetes is a powerful tool but you don't always need it. Amazon ECS is a simpler alternative that handles most tasks faster and cheaper.

In this article I'll show you:

When ECS beats EKS (and saves you tons of money)
Real scenarios with numbers and examples
Ready-to-use code snippets for deploying to both platforms
How to make the decision without headaches

Let's dive in!

Quick Comparison: ECS vs EKS

First let's look at the main differences in a simple table:

Feature	AWS ECS	AWS EKS
Cluster Cost	$0	$0.10/hour (~$70/month)
Setup Complexity	Low (2-4 hours)	High (1-2 days)
Learning Curve	Few days	Several weeks
Management	AWS Console/CLI	kubectl + AWS Console
Ecosystem	AWS services	Entire Kubernetes world
Portability	AWS only	Any cloud/on-prem
Updates	Automatic	Manual (control plane)
Best For	1-10 services	10-100+ services

Architecture: How It Works

ECS Architecture:

Your Application
    ↓
Docker Image (you need this!)
    ↓
Task Definition (container description)
    ↓
ECS Service (manages launch)
    ↓
EC2 or Fargate (where it runs)
    ↓
Container running

EKS Architecture:

Your Application
    ↓
Docker Image
    ↓
Kubernetes Pod specification
    ↓
Deployment/StatefulSet
    ↓
Kubernetes Control Plane ($$$)
    ↓
Worker Nodes
    ↓
Container in Pod

See the difference? ECS has two fewer steps and each one is easier to understand.

When ECS is Your Best Choice

This is where it gets interesting. Many people think Kubernetes is always needed but that's not true. Let's break down real situations where ECS wins.

Scenario 1: Multi-Regional Deployment (3-5 Services)

Imagine: you have a simple API and a couple supporting services. You need to deploy them in three regions - Europe, Asia, USA. For redundancy, you know.

With EKS you pay:

Europe cluster: $70/month
Asia cluster: $70/month
USA cluster: $70/month
Total: $210/month just for the right to run containers

With ECS you pay:

Cluster is free: $0
Total: $0 for management

In other words, save $2,520 per year just on the control plane! And you still gotta pay for the actual servers.

Real Example

I had a project - e-commerce backend. Five services:

API Gateway (Node.js)
Order Service (Python)
Payment Service (Go)
Notification Service (Node.js)
Analytics Worker (Python)

Each service needed a Docker image. Here's a simple Dockerfile example for the Node.js API:

# Dockerfile for API Gateway
FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install --production

COPY . .

EXPOSE 3000
CMD ["node", "server.js"]

We deployed across three regions using ECS Fargate. Setup time: 4 hours including Terraform code. If we'd done it with EKS - that's minimum a week with Helm charts, ingress controllers and all that kitchen.

Here's how we defined the task in ECS (simplified):

# ECS Task Definition - just the container part
resource "aws_ecs_task_definition" "api_gateway" {
  family                   = "api-gateway"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"

  container_definitions = jsonencode([{
    name      = "api"
    image     = "123456789.dkr.ecr.us-east-1.amazonaws.com/api-gateway:latest"
    essential = true

    portMappings = [{
      containerPort = 3000
      protocol      = "tcp"
    }]

    environment = [
      { name = "NODE_ENV", value = "production" },
      { name = "PORT", value = "3000" }
    ]
  }])
}

Compare this to Kubernetes - you'd need Deployment YAML, Service YAML, maybe Ingress, ConfigMaps... it adds up.

Scenario 2: Quick Start and Simplicity

You're a startup. You have an MVP that needs to ship yesterday. Team of three people nobody knows Kubernetes deeply.

ECS gives you:

Launch in couple hours (not days!)
AWS integration out of the box
No need to hire Kubernetes expert
Less moving parts = less things break

Look I'm not saying ~~Kubernetes is bad~~. It's awesome! But do you need it when you just wanna run a container? It's like buying a truck to get bread from the store.

Time to learn:

ECS: 2-3 days to work comfortably
EKS: 2-3 weeks minimum (or even a month)

Here's a complete minimal ECS setup with Terraform:

# Minimal ECS cluster
resource "aws_ecs_cluster" "main" {
  name = "my-app-cluster"
}

# ECS Service - runs 2 copies of your container
resource "aws_ecs_service" "app" {
  name            = "my-app"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api_gateway.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = ["subnet-xxx", "subnet-yyy"]
    security_groups  = ["sg-xxx"]
    assign_public_ip = true
  }
}

That's it! No Helm, no kubectl, no YAML soup.

Scenario 3: AWS-Native Project

Your project is fully in AWS:

Database - RDS
Files - S3
Queues - SQS
Cache - ElastiCache
Logs - CloudWatch

Why Kubernetes here? ECS integrates with these services natively and simpler.

Example - S3 access:

ECS Task Role (simple):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": "arn:aws:s3:::my-bucket/*"
  }]
}

Attach the role to Task Definition - done.

In EKS you do the same through IRSA (IAM Roles for Service Accounts):

Setup OIDC provider
Create ServiceAccount in Kubernetes
Link with IAM role
Annotate the pod

More steps = more places to mess up.

When EKS Becomes Necessary

Alright enough praising ECS. Let's be honest - there are situations where EKS is really better.

Scenario 1: Large Microservices Architecture (20+ Services)

When you have 20, 30, 50 microservices - that's different math.

Why EKS wins:

$70 per cluster is fixed price (whether 5 services or 50)
Kubernetes scales complexity better
Ecosystem: Helm, Operators, service mesh (Istio, Linkerd)
Centralized management of all services

Cost example:

With 30 services in one region:

ECS: 30 separate ECS Services = lots of config hard to manage
EKS: One cluster all services in namespaces manage through GitOps

Here $70/month pays for convenience.

A typical Kubernetes deployment:

# Kubernetes Deployment - simpler at scale
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
      - name: api
        image: my-registry/api-gateway:v1.2.3
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "250m"

With Kubernetes you get built-in health checks, rolling updates, easy rollbacks.

Scenario 2: Multi-Cloud or Hybrid Infrastructure

Your company wants:

Work in AWS and GCP simultaneously
Keep some workloads on-premise
Have ability to migrate between clouds

EKS (Kubernetes) gives portability:

Same YAML manifests work everywhere
Can move applications between clouds
Standardization across all infra

ECS is AWS only. Can't move it anywhere. (ECS anywhere?! :) )

Scenario 3: Advanced Features

GPU workloads for ML/AI: EKS supports GPU nodes out of the box + all tooling like Kubeflow.

Complex networking policies: Network Policies in Kubernetes give precise traffic control between pods.

# Network Policy example
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
spec:
  podSelector:
    matchLabels:
      app: api-gateway
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 3000

Stateful applications: StatefulSets Persistent Volumes - all this works better in Kubernetes.

Practical Deployment Examples

Enough theory let's get hands dirty. I'll show how to deploy a simple application to both ECS and EKS. Same application to compare.

Our application: Nginx + simple Node.js API (both need Docker images)

Building Docker Images First

Before deploying anywhere you need Docker images. Here's our setup:

# Dockerfile for our Node.js app
FROM node:18-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .

EXPOSE 3000
CMD ["node", "index.js"]

Build and push:

# Build image
docker build -t my-app:latest .

# Tag for ECR
docker tag my-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

# Push to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

ECS Deployment with Terraform

Let's start with the simpler one - ECS.

Step 1: VPC Setup

# Create VPC for containers
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Public subnets
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index}.0/24"
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

Step 2: ECS Cluster and Service

# Create ECS cluster 
resource "aws_ecs_cluster" "main" {
  name = "my-app-cluster"
}

# Task Definition - describes your Docker container
resource "aws_ecs_task_definition" "app" {
  family                   = "my-app"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"

  execution_role_arn = aws_iam_role.ecs_execution.arn
  task_role_arn      = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([{
    name      = "app"
    image     = "123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest"
    essential = true

    portMappings = [{
      containerPort = 3000
      protocol      = "tcp"
    }]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/my-app"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "app"
      }
    }
  }])
}

# ECS Service - runs and maintains containers
resource "aws_ecs_service" "app" {
  name            = "my-app-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.public[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = true
  }
}

Step 3: IAM Roles

# Role for ECS to pull Docker images and write logs
resource "aws_iam_role" "ecs_execution" {
  name = "ecs-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ecs-tasks.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_execution_policy" {
  role       = aws_iam_role.ecs_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# Role for your application (e.g., S3 access)
resource "aws_iam_role" "ecs_task" {
  name = "ecs-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ecs-tasks.amazonaws.com"
      }
    }]
  })
}

Deploy It

terraform init
terraform plan
terraform apply

Done! Container is running.

EKS Deployment with Terraform

Now the same thing but in EKS.

Step 1: EKS Cluster

# EKS cluster 
resource "aws_eks_cluster" "main" {
  name     = "my-eks-cluster"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.28"

  vpc_config {
    subnet_ids = concat(aws_subnet.public[*].id, aws_subnet.private[*].id)
  }
}

# Worker nodes
resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "main-nodes"
  node_role_arn   = aws_iam_role.eks_node.arn
  subnet_ids      = aws_subnet.private[*].id

  scaling_config {
    desired_size = 2
    max_size     = 4
    min_size     = 1
  }

  instance_types = ["t3.medium"]
}

Step 2: IAM for EKS

# Cluster role
resource "aws_iam_role" "eks_cluster" {
  name = "eks-cluster-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "eks.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.eks_cluster.name
}

# Node role
resource "aws_iam_role" "eks_node" {
  name = "eks-node-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "eks_worker_node" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.eks_node.name
}

resource "aws_iam_role_policy_attachment" "eks_cni" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.eks_node.name
}

Step 3: Kubernetes Manifests

After cluster is created deploy application with kubectl:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "250m"
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 3000

Deploy It

# 1. Apply Terraform 
terraform init
terraform apply

# 2. Configure kubectl
aws eks update-kubeconfig --name my-eks-cluster --region us-east-1

# 3. Check nodes
kubectl get nodes

# 4. Deploy application
kubectl apply -f deployment.yaml

# 5. Check status
kubectl get pods
kubectl get svc

Difference:

ECS: one terraform apply and done
EKS: terraform apply + kubectl commands + wait for everything to come up

Complexity Comparison

Action	ECS	EKS
Config files	3-4 Terraform files	4-5 Terraform + YAML manifests
First deploy time	5-7 minutes	15-20 minutes
Commands to run	2 (init apply)	5+ (terraform + kubectl)
Need to know	AWS Terraform Docker	AWS Terraform Kubernetes kubectl Docker

Real Cases and Economics

Let's calculate concrete numbers for typical scenarios.

Case 1: Startup with 5 Microservices in 3 Regions

Requirements:

5 services (API Workers Background Jobs)
3 regions: US EU Asia
2 instances each service
All need Docker images built and stored in ECR

ECS Fargate:

Cluster cost: $0
ECR storage: ~$5/month (for Docker images)
Compute (Fargate):
  - 5 services × 2 instances × 3 regions = 30 tasks
  - Each task: 0.25 vCPU 512 MB
  - $0.04048/hour per vCPU $0.004445/hour per GB
  - (~0.25 × $0.04048 + 0.5 × $0.004445) × 730 hours = ~$9/task/month
  - 30 tasks × $9 = $270/month

Total: ~$275/month

EKS:

Cluster cost: $70 × 3 regions = $210/month
ECR storage: ~$5/month (same Docker images)
Compute (EC2 nodes):
  - Minimum 2× t3.medium per region = 6 instances
  - t3.medium = $0.0416/hour × 730 = ~$30/month
  - 6 × $30 = $180/month

Total: $210 + $5 + $180 = $395/month

Savings with ECS: $120/month or $1440/year

Plus with ECS you don't pay DevOps engineer to manage Kubernetes :)

Case 2: Large Project with 30 Services in 1 Region

ECS:

Cluster: $0
Management: 30 separate ECS Services (hard to manage!)
Compute: depends on load

EKS:

Cluster: $70/month
Management: One namespace GitOps Helm (easier!)
Compute: same + better resource utilization

Here EKS wins on management convenience. $70 pays for itself.

Time for Setup and Maintenance

Real numbers from my experience:

Initial setup:

ECS: 4 hours (Terraform + tests + Docker builds)
EKS: 2 days (cluster + addons + monitoring setup + Docker builds)

Weekly maintenance:

ECS: ~30 minutes (check logs updates)
EKS: ~2 hours (updates cluster checks monitoring)

Platform updates:

ECS: automatic
EKS: need to update control plane once a year (takes half a day with tests)

Decision Checklist: What to Choose?

So here's a simple flowchart for decision making:

Choose ECS if:

✅ You have less than 10-15 microservices ✅ Project is AWS only (no multi-cloud plans) ✅ Team doesn't know Kubernetes (and doesn't want to learn) ✅ Need to launch quickly (MVP startup) ✅ Budget is limited ✅ Simple application without complex dependencies ✅ Multi-regional deploy (save on clusters) ✅ Comfortable with Docker basics

Choose EKS if:

✅ More than 20+ microservices ✅ Need portability (multi-cloud hybrid) ✅ Team knows Kubernetes ✅ Need advanced features (service mesh operators) ✅ GPU workloads for ML/AI ✅ Already using Kubernetes elsewhere ✅ Complex microservices architecture ✅ Want access to Kubernetes ecosystem

Middle Ground

You can start with ECS and migrate later!

Many companies do this:

Start on ECS (fast and cheap)
Grow to 15-20 services
Kubernetes developers join the team
Gradually migrate to EKS

This is normal evolution. Don't go Kubernetes just "because it's cool".

Conclusions

Here's what's important to remember:

ECS is not a "second-rate" option. It's a full-fledged solution that handles many tasks excellently. Yes EKS is more powerful in capabilities but most projects simply don't need those capabilities.

Main ECS advantages:

Free control plane (save $70-210+ per month)
Simplicity and launch speed
Less operational overhead
Native AWS integration
Perfect for multi-regional deployment of small services

When EKS is really needed:

Large scale (20+ services)
Code portability
Advanced Kubernetes features
Already have expertise in team

My advice: Don't chase the hype. Start with ECS if the task allows. Save time money and nerves. And when you really grow into Kubernetes - then migrate.

Kubernetes is like a Ferrari - cool car but for a trip to the store a regular Toyota works fine. And uses less gas 😄

Final Thoughts

The choice between ECS and EKS isn't about "better" or "worse" - it's about right tool for the job.

Start simple. ECS lets you ship fast without the Kubernetes learning curve. Your Docker skills transfer directly. AWS handles the orchestration.

As you grow, reassess. When you hit 15-20 services or need multi-cloud, EKS makes sense. But many successful companies run production on ECS for years.

Remember: Complexity is a cost. Every abstraction layer you add costs time money and mental overhead. Sometimes the best architecture is the simplest one that works.

Both platforms use Docker. Both run containers. Both scale. The question is: how much complexity do you actually need?

Choose wisely!

Sources

AWS ECS Evolution: Managed Instances and Advanced Deployment Strategies

Timur Galeev — Mon, 13 Oct 2025 22:00:00 GMT

The container orchestration landscape on AWS recently received significant enhancements with two major updates to Amazon Elastic Container Service (ECS): the introduction of ECS Managed Instances and built-in support for Linear and Canary deployment strategies. These features address common operational challenges while providing more flexibility for teams running containerized workloads.

ECS Managed Instances: Bridging the Gap Between Control and Simplicity

Amazon ECS Managed Instances represents a new compute option that aims to combine the operational simplicity of managed infrastructure with the flexibility of EC2. This offering positions itself between AWS Fargate and self-managed EC2 instances in the ECS ecosystem.

What Makes It Different?

The key differentiator lies in how it handles infrastructure management. Unlike Fargate, which abstracts away the underlying compute entirely, ECS Managed Instances gives you visibility and control over instance types while AWS handles the operational burden. Unlike traditional EC2-backed ECS clusters, you don't need to manage instance provisioning, scaling, or patching.

Key capabilities include:

Instance Selection Flexibility: By default, AWS automatically selects cost-optimized instance types based on your workload requirements. However, you can specify particular instance attributes when needed, including GPU acceleration, specific CPU architectures (ARM/x86), or enhanced networking capabilities.
Task Bin-Packing: Unlike Fargate's one-task-per-instance model, Managed Instances supports multiple tasks per instance, optimizing resource utilization and potentially reducing costs through better instance consolidation.
Automated Maintenance: The service implements security patches every 14 days and handles instance lifecycle management. You can schedule maintenance windows using EC2 event windows to minimize application disruption during critical business hours.
Bottlerocket OS: Instances run on Bottlerocket, AWS's purpose-built container operating system, which provides a minimal attack surface and improved security posture.

Understanding the Cost Model

It's important to note that ECS Managed Instances adds a management fee on top of EC2 instance costs. This charge varies by instance class and size and is billed at on-demand pricing (per second with a one-minute minimum), even if you're using EC2 Savings Plans for the underlying instances. Teams should evaluate whether the operational savings justify the additional cost for their specific workloads.

When to Choose ECS Managed Instances

This option makes sense when you need:

Access to specific instance types (bare metal, GPU instances, or specialized compute)
Better cost optimization through task bin-packing
EC2-level control without operational overhead
Integration with existing EC2 pricing commitments

Advanced Deployment Strategies: Linear and Canary Deployments

Alongside Managed Instances, AWS introduced native support for Linear and Canary deployment strategies in ECS, expanding beyond the existing blue/green deployment option. These strategies are available for services using Application Load Balancer (ALB) or ECS Service Connect.

Canary Deployments: Controlled Risk Exposure

Canary deployments allow you to validate new service revisions with minimal risk by routing a small percentage of production traffic to the new version first.

The deployment process follows a two-step traffic shift:

Initially shift a configured percentage (e.g., 10%) to the new revision
After the canary bake time completes successfully, shift 100% of remaining traffic

During the canary bake time, both versions run simultaneously, allowing you to monitor metrics, health checks, and application behavior. If issues are detected, you can quickly roll back by shifting traffic back to the original version.

Linear Deployments: Gradual Traffic Migration

Linear deployments provide a more gradual approach, shifting traffic in equal percentage increments over a specified time period. You configure:

Step percentage: How much traffic shifts at each increment (e.g., 10%)
Step bake time: The wait period between each increment for monitoring

This strategy validates your application at multiple stages with progressively increasing production traffic, providing more data points for validation compared to canary deployments.

Deployment Lifecycle and Monitoring

Both strategies support several critical features:

Deployment Bake Time: After all traffic has shifted to the new revision, AWS waits a configurable period before terminating the old revision, enabling quick rollback without downtime if issues emerge.
Lifecycle Hooks: You can configure Lambda functions to execute at specific deployment stages for automated validation, custom health checks, or integration with external monitoring systems.
CloudWatch Alarm Integration: Configure automatic rollback triggers based on CloudWatch alarms, enabling automated failure detection and recovery.
Lifecycle Stages: Each deployment progresses through distinct stages (SCALE_UP, TEST_TRAFFIC_SHIFT, PRODUCTION_TRAFFIC_SHIFT, BAKE_TIME, CLEAN_UP), with each stage lasting up to 24 hours. For CloudFormation deployments, the entire process must complete within 36 hours.

Best Practices for Production Use

When implementing these deployment strategies, consider:

Start Conservative: Begin with smaller percentages (5-10% for canary) to minimize impact if issues occur
Sufficient Monitoring: Ensure your canary percentage generates enough traffic for meaningful validation
Appropriate Bake Times: Set evaluation periods long enough to capture meaningful performance data (typically 10-30 minutes)
Comprehensive Metrics: Monitor response time, error rates, throughput, and business-specific metrics
Automated Rollback: Configure CloudWatch alarms to automatically trigger rollback when metrics exceed thresholds

Regional Availability

ECS Managed Instances launched in six AWS Regions: US East (North Virginia), US West (Oregon), Europe (Ireland), Africa (Cape Town), Asia Pacific (Singapore), and Asia Pacific (Tokyo).

Linear and Canary deployment strategies are available in all commercial AWS Regions where Amazon ECS is available and can be configured through the Console, SDK, CLI, CloudFormation, CDK, and Terraform.

Conclusion

These enhancements demonstrate AWS's continued investment in making ECS more flexible and operationally efficient. ECS Managed Instances provides a middle ground between Fargate's simplicity and EC2's control, while the new deployment strategies offer production-grade deployment patterns that many organizations previously had to build themselves.

For teams running containerized workloads on AWS, these features warrant evaluation against existing deployment patterns and infrastructure management practices. The key is understanding your specific requirements around control, cost optimization, and operational complexity to determine which combination of ECS features best serves your needs.

Accelerating Infrastructure as Code Optimization with AI: A Practitioner's Journey with Amazon Q Developer

Timur Galeev — Thu, 28 Aug 2025 22:00:00 GMT

Introduction

I've been working with Infrastructure as Code for the better part of eight years—starting with CloudFormation, migrating teams to Terraform, and lately exploring AWS CDK. Over that time, I've seen platforms grow from a handful of templates to hundreds of modules scattered across dozens of repositories. I've also watched technical debt accumulate: legacy EC2 instance types chosen three years ago, untagged resources, container images piling up in ECR, and NAT gateways draining budgets while sitting mostly idle.

The traditional FinOps workflow—reactive hunting for idle resources using cost optimization hubs and billing alerts—works, but it's exhausting and slow. I wanted to shift left: catch inefficiencies before they hit production, bake cost and security best practices into the templates themselves, and help platform engineers understand inherited code without spending days spelunking through thousands of lines.

This article documents how I've integrated Amazon Q Developer into my IaC workflow—not as a replacement for human judgment, but as a force multiplier. I'll walk through real scenarios, concrete examples, workflow integration, limitations I've encountered, and a practical framework for measuring impact.

The Pain Points of Traditional IaC Management

Before introducing AI assistance, my team faced several recurring bottlenecks:

Legacy comprehension: Inheriting a 2,000-line Terraform module written by someone who left the company two years ago. No README. Cryptic variable names. Comments? Optional, apparently. Understanding what it does, how components interact, and where optimization opportunities exist consumed days of calendar time.

Migration friction: Translating a CloudFormation template to CDK or Terraform—or vice versa—is tedious and error-prone. Even straightforward resources involve syntax mapping, API differences, and validation loops. Multiply that by dozens of modules, and migration projects drag on for quarters.

Review latency: Pull requests with IaC changes sat in queues waiting for someone with enough context to spot that the new RDS instance lacks encryption, or that the NAT gateway could be replaced with VPC endpoints, or that the instance type is three generations old.

Standardization gaps: Every engineer writes modules slightly differently. Some include lifecycle policies; others don't. Tagging strategies diverge. IAM policies are either too permissive or so locked down they break deployments.

Security and cost blind spots: Static analysis tools (tfsec, Checkov) catch obvious mistakes, but they don't suggest improvements. They tell you what's wrong, not what could be better. Cost estimation tools (Infracost) show projected spend, but they don't recommend Graviton instances or Spot for batch workloads.

Onboarding friction: New hires need weeks to become productive with our IaC codebase. The learning curve is steep, and tribal knowledge is poorly documented.

How Amazon Q Developer Fits In

Amazon Q Developer is an AI-powered coding assistant built on over 17 years of AWS cloud experience. It integrates directly into VS Code, JetBrains IDEs, and provides CLI capabilities for automated transformations. It generates deployment-ready infrastructure code for Terraform, AWS CDK, and CloudFormation.

I use it for:

Code comprehension: Summarizing what a template does, mapping resource dependencies, identifying entry points.
Optimization discovery: Scanning templates for cost, security, and performance improvements aligned with AWS Well-Architected Framework.
IaC transformation: Automated translation between IaC frameworks (Terraform ↔ CDK ↔ CloudFormation) using the four-step process: assess, translate, test and refine, deploy.
Module generation: Creating deployment-ready modules from natural language requirements with built-in AWS best practices.
Pull request reviews: Analyzing diffs, flagging risks, suggesting improvements based on AWS standards.
Custom rule enforcement: Using rule-based automation to encode team standards and ensure consistent, repeatable suggestions.

According to AWS internal testing, Amazon Q's agentic capabilities deliver 10x-50x time savings for legacy IaC remediation compared to manual processes. For VMware network migrations, AWS teams translated configurations for 500 VMs in 1 hour—80 times faster than the traditional 2-week manual approach.

I treat Q as a highly skilled junior engineer: fast, knowledgeable, but requiring validation and context.

End-to-End Workflow Integration

Here's how Amazon Q fits into my current IaC lifecycle:

1. Local Development (VS Code + Amazon Q)

Open a CDK stack or Terraform module
Prompt Q: "Review this file and identify opportunities to optimize for cost efficiency"
Q returns recommendations: instance type downsizing, ECR lifecycle policies, Graviton migration paths, NAT gateway elimination, subnet configuration changes
I validate recommendations against workload requirements, commitments, and architectural constraints
Implement approved changes with Q's assistance (it can write the code inline)

2. Static Analysis

Run Checkov, tfsec, or cfn-lint locally
If violations appear, I prompt Q: "Fix the security issues flagged by Checkov in this file"
Q suggests remediation (e.g., enable encryption, add bucket policies, restrict ingress rules)

3. Policy-as-Code Validation

Apply OPA/Conftest or CloudFormation Guard policies
For failures, I ask Q to explain the policy intent and adjust the template accordingly

Example policy (Rego):

  package terraform.tags
  deny[msg] {
    input.resource_type == "aws_instance"
    not input.tags.Environment
    msg = "Missing required Environment tag"
  }

4. Cost Estimation

Run Infracost to project monthly spend
If costs are higher than expected, I prompt Q: "Suggest ways to reduce cost for this infrastructure while maintaining performance"
Q might recommend reserved capacity, Savings Plans eligibility, or Graviton alternatives

5. CI/CD Pipeline Gates

Pre-commit hooks run formatters (terraform fmt, prettier)
GitHub Actions execute tests, static analysis, policy checks, Infracost diff
If checks fail, the pipeline surfaces Q-generated suggestions in PR comments (I've scripted this using Q's API)

6. Pull Request Review

Reviewers use Q to summarize changes: "Explain what this PR changes and flag any cost, security, or operational risks"
Q highlights: "This PR introduces a new RDS instance without backup retention configured. Estimated monthly cost increase: $150. Consider enabling automated backups and Multi-AZ for production."

7. Post-Deployment

Monitor drift using AWS Config or Terraform state comparisons
Track cost variance via Cost Explorer
Feed learnings back into Q prompts for future iterations

Practical Example 1: Generating a Terraform Module from Requirements

Scenario: I need a reusable Terraform module for an S3 bucket with versioning, lifecycle rules, encryption, and least-privilege IAM access.

Prompt:

Create a Terraform module for an S3 bucket with:
- Versioning enabled
- Lifecycle policy: expire non-current versions after 90 days
- AES-256 encryption at rest
- Block public access
- IAM policy allowing read-only access for a Lambda execution role

Q Output (trimmed):

resource "aws_s3_bucket" "main" {
  bucket = var.bucket_name
}

resource "aws_s3_bucket_versioning" "main" {
  bucket = aws_s3_bucket.main.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "main" {
  bucket = aws_s3_bucket.main.id
  rule {
    id     = "expire-noncurrent"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "main" {
  bucket = aws_s3_bucket.main.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "main" {
  bucket                  = aws_s3_bucket.main.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Outcome: In 30 seconds, I have a production-ready module skeleton. I add variables, outputs, and documentation, then commit. Time saved: 45 minutes.

Practical Example 2: Refactoring Terraform to CDK

Scenario: A team wants to migrate a Terraform module to CDK (TypeScript) for better IDE support and type safety.

Original Terraform (simplified):

resource "aws_lambda_function" "processor" {
  function_name = "data-processor"
  runtime       = "python3.9"
  handler       = "index.handler"
  role          = aws_iam_role.lambda_role.arn
  filename      = "function.zip"
}

resource "aws_iam_role" "lambda_role" {
  name = "lambda-exec-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

Prompt:

Translate this Terraform configuration to AWS CDK TypeScript.
Use best practices: managed policies, environment variables, and proper construct hierarchy.

Q Output (trimmed):

import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as iam from 'aws-cdk-lib/aws-iam';

export class DataProcessorStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const lambdaRole = new iam.Role(this, 'LambdaRole', {
      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSLambdaBasicExecutionRole')
      ]
    });

    const processor = new lambda.Function(this, 'Processor', {
      runtime: lambda.Runtime.PYTHON_3_9,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('function.zip'),
      role: lambdaRole,
      functionName: 'data-processor'
    });
  }
}

Outcome: Migration skeleton in under a minute. I adjust asset paths, add CloudWatch alarms, refine IAM permissions. Time saved per module: 1-2 hours.

Practical Example 3: Pull Request Review with Amazon Q

Scenario: A PR introduces a new ECS service with Fargate tasks. Reviewer uses Q to assess impact.

Prompt:

Review this CDK stack diff. Summarize changes, identify cost implications,
flag security or operational risks.

Q Analysis:

This PR creates:

ECS Fargate service with 2 tasks (2 vCPU, 4GB RAM each)

Application Load Balancer

CloudWatch log group (no retention configured)

Cost estimate: ~$95/month (Fargate) + ~$25/month (ALB) = $120/month

Risks:

Log group has no retention policy → unbounded storage costs

Tasks use x86_64 architecture; Graviton would reduce cost ~20%

No autoscaling configured; manual intervention required under load

ALB security group allows 0.0.0.0/0 ingress on port 443 (verify if intentional)

Recommendations:

Set log retention to 30 days

Evaluate ARM64/Graviton compatibility

Add autoscaling with target tracking

Restrict ALB ingress to known CIDR ranges or CloudFront

Outcome: Reviewer approves with conditions. Author updates the stack. Review cycle time: 15 minutes instead of 2 hours.

Security, Compliance, and Quality Integration

Amazon Q doesn't replace security tooling—it augments it.

IAM Least Privilege

I prompt Q: "Review this IAM policy and restrict to least privilege for a Lambda reading from S3 and writing to DynamoDB."

Q tightens wildcards, removes unnecessary actions, adds conditions for resource tagging.

Secrets Hygiene

Q flags hardcoded credentials or API keys during reviews. I pair this with git-secrets and AWS Secrets Manager integration.

Drift Detection

After deployments, I compare actual infrastructure (via AWS Config or Terraform state) against source templates. If drift occurs, I ask Q: "Why might this resource configuration differ from the template?" It helps hypothesize causes (manual changes, out-of-band automation, CloudFormation stack updates).

Policy-as-Code

I maintain Conftest policies (OPA/Rego) for tagging, encryption, and network segmentation. When policies fail, Q explains the rule intent and suggests compliant configurations.

Cost Guardrails

I integrate Infracost in CI and set thresholds (e.g., no PR increasing monthly cost by >$500 without approval). Q helps identify cost drivers and alternatives.

Repository Improvement Plan (Prioritized)

If I were assessing a typical IaC codebase today, here's what I'd prioritize:

Add (High Priority):
- Pre-commit hooks: terraform fmt, tflint, Checkov
- Infracost integration in CI
- Basic Conftest policies (tagging, encryption)
- ECR lifecycle policies across all container builds
- Automated README generation (Q can draft from code)
Refactor (Medium Priority):
- Consolidate duplicate modules
- Standardize naming conventions (use Q to generate renaming scripts)
- Migrate legacy instance types to Graviton where compatible
- Replace NAT gateways with VPC endpoints for AWS services
Harden (Medium Priority):
- IAM policy reviews (Q-assisted least privilege tightening)
- Enable Terraform state locking (DynamoDB + S3)
- Add drift detection automation (aws-config or Terraform Cloud)
- Implement environment-specific configurations (dev/stage/prod variants)
Automate (Lower Priority, High Impact):
- Q-generated PR comment summaries (cost/security/drift)
- Automated documentation updates on merge
- Ephemeral preview environments for PRs (using Terraform workspaces or CDK context)
Measure (Ongoing):
- Track PR review time and test coverage improvements
- Iterate on Q prompts based on false positives/negatives
- Refine policy-as-code rules based on team feedback

Skepticism and Limitations

I've hit real limitations with Q:

Hallucinations: Q occasionally invents AWS resource properties that don't exist (e.g., fictional CloudFormation parameters). Always validate against official docs.

Context window: For massive monorepo structures, Q loses context. I work around this by targeting specific files or summarizing first.

Organizational standards: Q doesn't know your company's naming conventions, approved instance families, or compliance requirements unless you explicitly provide them in prompts or customization files.

Noisy recommendations: Q sometimes suggests optimizations that conflict with architectural decisions (e.g., recommending smaller instances when you've standardized on m5.large for operational simplicity). Filtering signal from noise requires domain knowledge.

Overfitting to public examples: Q trained on public repos. If your IaC patterns are highly proprietary or unconventional, its suggestions may miss the mark.

Human validation is non-negotiable: I never merge Q-generated code without review, testing, and static analysis. Treat Q as a draft generator, not a replacement for engineering judgment.

Conclusion

Amazon Q Developer has changed how I work with infrastructure code. It doesn't replace engineering judgment, but it handles the tedious parts—reading legacy code, translating between IaC languages, spotting optimization opportunities, and catching security issues early.

The biggest wins for me have been:

Understanding inherited codebases in minutes instead of days
Generating module skeletons from plain English requirements
Cutting PR review time by helping reviewers quickly understand changes and impacts
Catching cost and security issues before they reach production

The key is treating Q as a tool, not a magic solution. I always validate its suggestions, test changes thoroughly, and integrate it with existing tooling like static analysis and policy checks.

If you're considering trying it, start small: pick one messy legacy file, ask Q to explain it, and see what optimization opportunities it finds. Install the VS Code extension (there's a free tier), experiment with prompts, and adjust based on what works for your workflow.

The goal isn't perfection—it's making IaC work less frustrating and more efficient, one template at a time.

Kiro: Bridging the Gap Between AI Prototyping and Production-Ready Code

Timur Galeev — Sun, 27 Jul 2025 22:00:00 GMT

As DevOps engineers and Cloud Architects, we have all experienced the excitement of using AI coding assistants to rapidly prototype applications. A few prompts later, you have working code. But then reality hits: deploying to production requires documentation, proper architecture decisions, testing strategies, and maintainability considerations that AI-generated prototypes often lack.

The Production Problem

AI coding tools excel at quick prototyping—what some call "vibe coding." However, getting these prototypes production-ready presents challenges:

Undocumented assumptions: The AI made decisions during development, but those choices aren't captured anywhere
Missing requirements clarity: You guided the agent throughout, but fuzzy requirements mean you can't verify if the application truly meets needs
Architecture blindspots: Understanding how the system design affects performance, scalability, and your infrastructure isn't immediately clear
Maintenance difficulties: Without proper documentation and structure, future changes become increasingly complex

Enter Kiro: Spec-Driven Development Meets AI

Kiro is a new agentic IDE :) that tackles these challenges through spec-driven development. Rather than jumping straight to code, Kiro helps you think through decisions systematically while maintaining the speed of AI-assisted development.

Key Features

1. Requirements Specification

Kiro transforms a simple prompt like "Add a review system for products" into detailed user stories with EARS (Easy Approach to Requirements Syntax) acceptance criteria. This makes implicit assumptions explicit, ensuring the AI builds what you actually need—not what it thinks you need.

2. Technical Design Documentation

After requirements approval, Kiro analyzes your codebase and generates comprehensive design documents including:

Data flow diagrams
TypeScript interfaces and type definitions
Database schemas
API endpoint specifications

This eliminates the typical back-and-forth on requirements clarity that slows down development cycles.

3. Task Decomposition and Sequencing

Kiro automatically generates implementation tasks with proper dependency ordering. Each task includes considerations often missed in quick prototypes:

Unit and integration tests
Loading states and error handling
Mobile responsiveness
Accessibility requirements (WCAG compliance)

4. Agent Hooks for Automation

Hooks are event-driven automations that act like an experienced team member catching issues in the background:

Update tests automatically when components change
Refresh API documentation when endpoints are modified
Scan for security issues before commits
Enforce coding standards across the entire team

These hooks commit to Git, ensuring consistent quality checks across all developers.

Why This Matters for DevOps and Cloud Architecture

For those of us managing infrastructure and deployment pipelines, Kiro addresses several pain points:

Infrastructure as Code Compatibility: Spec-driven development aligns naturally with IaC practices. Design documents provide the clarity needed for proper resource planning and cost optimization.

CI/CD Integration: Automated test generation and security scanning hooks integrate seamlessly into existing pipelines, reducing manual review overhead.

Documentation Drift Prevention: Kiro keeps specs synchronized with code changes—solving the eternal problem of outdated documentation that complicates infrastructure modifications.

Team Consistency: When managing multiple services or microservices architectures, enforcing standards through hooks ensures uniform code quality across repositories.

Technical Details

Built on Code OSS (VS Code compatible)
Supports Model Context Protocol (MCP) for specialized tool integration
Works with Open VSX plugins
Available for Mac, Windows, and Linux
Supports most popular programming languages
Free during preview period

The Bigger Picture

While Kiro isn't specifically an AWS or cloud tool, its approach to structured development, automated quality checks, and documentation maintenance addresses fundamental challenges in modern software delivery—challenges that become amplified when deploying to cloud environments where misconfigurations can have immediate cost and security implications.

For DevOps practitioners and cloud architects, Kiro represents a shift from treating AI coding assistants as simple code generators to treating them as collaborative partners in the entire development lifecycle—from requirements gathering through production deployment.

Leveraging AWS Lambda for Modern Business Applications

Timur Galeev — Sat, 08 Mar 2025 09:30:00 GMT

Every few years, a technology shift forces engineers to rethink assumptions they had stopped questioning. Serverless computing is that shift for this decade. And in 2025, AWS Lambda - the service that popularised the model - has matured into something genuinely enterprise-ready: not a weekend experiment, but a production foundation for organisations moving away from on-premises data centres, unwinding Kubernetes complexity, or simply building new products without the overhead that once felt inevitable.

This article is a guide for practitioners. It covers what Lambda is, how it fits into real business operations, how to migrate from on-premises environments and Kubernetes, what AWS has introduced in 2025, and what production deployments actually require. The architecture diagram below illustrates the system we'll build across these sections.

What AWS Lambda Is - and What It Isn't

AWS Lambda is a compute service. You upload code; Lambda runs it. Between invocations, nothing is running and nothing is costing you money. When a request arrives - from an API, a message queue, a database change, a file upload, a scheduled timer - Lambda allocates an execution environment, runs your function, and returns the result.

That description sounds deceptively simple, and it is simple in the best possible way. The complexity that normally lives in the infrastructure layer - operating system maintenance, capacity planning, auto-scaling rules, health checks, rolling deployments, load balancer configuration - disappears. It becomes AWS's problem, not yours.

What Lambda is not is a silver bullet. Functions have a maximum execution time of fifteen minutes. They are stateless by design. They impose cold-start latency when a new execution environment is initialised. These are real constraints, and any honest guide must acknowledge them upfront. But for the enormous class of workloads that fit within them - and in 2025 that class is significantly larger than it was three years ago - Lambda eliminates more operational surface area than any other approach on the market.

The financial model is equally important. Lambda charges per millisecond of execution time and per million requests, with no charge for idle time. A business process that runs sporadically - a report that triggers once a night, a webhook that fires when a customer takes a specific action - costs essentially nothing outside of those moments. That changes how organisations think about building software.

Lambda as a Business Platform

The most common mistake organisations make when evaluating serverless is treating it as a purely technical decision. It is not. The choice to adopt Lambda is a business decision, because it changes how quickly teams can move, how much they pay for compute, and how much of their engineering time goes toward infrastructure rather than product.

Consider the four fundamental patterns through which Lambda integrates with business operations.

Synchronous processing is the most familiar. An end user submits an order, an API call lands at API Gateway, Lambda runs the validation and persistence logic and returns a response within milliseconds. The user never knows - or cares - that there is no server behind the interaction.

Asynchronous processing handles work that does not need an immediate response. When that same order triggers an invoice, a warehouse notification, and a customer confirmation email, those tasks can be handed off to a queue and processed independently. Each concern becomes its own Lambda function, scaling and failing independently, without blocking the user who placed the order.

Stream processing allows Lambda to sit alongside continuously flowing data - from Kinesis, DynamoDB Streams, or Kafka - and react to it in near real time. Fraud detection, inventory synchronisation, and real-time analytics pipelines are built this way.

Scheduled execution replaces the cron jobs that accumulate on every server in every organisation. A Lambda function triggered by an EventBridge schedule runs exactly when needed and stops. There is no server sitting idle between runs.

These patterns do not depend on industry. A logistics company uses them to track shipments. A financial services firm uses them to process transactions. A healthcare provider uses them to route clinical data. Lambda does not care what the business does; it simply executes the code that represents what the business needs.

The operational advantage compounds over time. When teams are freed from managing servers, they invest that time in the product. In organisations that have made this transition, the recurring observation is not that Lambda is cheaper - though it often is - but that teams ship faster and with more confidence, because the operational blast radius of any individual function is small and well-contained.

Migrating from On-Premises: The Decomposition Problem

Moving an on-premises application to Lambda is not a migration in the traditional sense. You cannot lift and shift a monolith into a Lambda function and expect it to work. Lambda imposes a fundamentally different execution model, and that is precisely where the value comes from - but it also means the migration requires deliberate architectural work.

The practical starting point is decomposition. A monolithic application bundles dozens of distinct capabilities - user authentication, order management, notification dispatch, reporting, data export - into a single deployable artefact. The first task is to draw boundaries around each capability and ask a straightforward question: what does this piece of code actually need to run?

In most cases, the answer reveals how much of the complexity in the original system was infrastructural rather than functional. The authentication module does not need a persistent server; it needs access to a user store and the ability to return a token. The notification service does not need to share a process with order management; it needs to receive an event and send an email. Once you see the capabilities separately, the Lambda model becomes obvious.

The migration strategy that works most reliably is the strangler fig pattern. Rather than rewriting everything at once, you route specific requests away from the on-premises system to new Lambda-based endpoints, while the monolith continues handling everything else. Over weeks or months, the monolith handles less and less, until it can be decommissioned. At no point does the business face a cutover risk.

The critical infrastructure decisions during this migration concern state and connections. Lambda functions are stateless; anything that was stored in process memory needs a home. DynamoDB is the natural choice for operational data - it scales with Lambda's concurrency model and has no connection pool limitations. For relational workloads, RDS Proxy provides connection pooling so that Lambda functions do not exhaust database connections under load. Shared caching moves to ElastiCache. Files move to S3.

On the infrastructure side, Terraform makes the new serverless stack reproducible and auditable. A Lambda function, its IAM role, its API Gateway integration, its DynamoDB table, its SQS queue, and its CloudWatch alarms are all declared in code, version-controlled, and deployable to any environment with a single command. This matters for compliance and for team velocity equally.

resource "aws_lambda_function" "order_processor" {
  function_name = "order-processor-prod"
  runtime       = "python3.13"
  handler       = "order_processor.lambda_handler"
  architectures = ["arm64"]
  memory_size   = 512
  timeout       = 30
  role          = aws_iam_role.order_processor.arn
  s3_bucket     = aws_s3_bucket.artefacts.id
  s3_key        = "order-processor/v1.4.2.zip"
  kms_key_arn   = aws_kms_key.lambda.arn

  snap_start {
    apply_on = "PublishedVersions"
  }

  environment {
    variables = {
      ORDERS_TABLE = aws_dynamodb_table.orders.name
      LOG_LEVEL    = "WARNING"
    }
  }

  tracing_config {
    mode = "Active"
  }
}

This single declaration defines the runtime, the compute configuration, the encryption posture, the observability mode, and the cold-start optimisation. It is the entire execution contract for this function, expressed in twenty lines.

Migrating from Kubernetes: When Orchestration Becomes Overhead

Kubernetes solved a genuine problem: how to run containerised workloads reliably at scale. It solved that problem well. It also introduced a layer of operational complexity — cluster management, node pools, pod scheduling, Helm chart maintenance, certificate rotation, ingress configuration — that many teams now find disproportionate to the value it delivers for their specific workloads.

The migration from Kubernetes to Lambda is conceptually clean. A Kubernetes Deployment - a set of pods serving requests - becomes a Lambda function with an alias. The alias provides the same stability guarantee as a Kubernetes service name: a stable endpoint that points to a specific version of the function. A Kubernetes HorizontalPodAutoscaler becomes Lambda's built-in concurrency scaling, which operates without any configuration. A Kubernetes CronJob becomes an EventBridge Scheduler rule. A Kubernetes ConfigMap becomes Lambda environment variables backed by SSM Parameter Store.

The more significant shift is architectural. Kubernetes encourages long-running services that handle many requests across their lifetime. Lambda encourages short, discrete handlers that do one thing and stop. This is not a downgrade - it is a clarification. When each function has a single responsibility, testing is simpler, failure is isolated, and deployment is independent. A bug in the notification service cannot affect the order processor, because they are entirely separate functions with separate IAM roles, separate deployment pipelines, and separate scaling characteristics.

The deployment model also improves. Rather than Helm rollouts and pod readiness probes, Lambda uses traffic-shifting aliases. You publish a new version of a function, configure the alias to route ten percent of traffic to it, watch the error metrics, and either promote it to one hundred percent or roll back instantly. This is blue-green deployment without any of the infrastructure it normally requires.

resource "aws_lambda_alias" "notification_live" {
  name             = "live"
  function_name    = aws_lambda_function.notification_service.arn
  function_version = aws_lambda_function.notification_service.version

  routing_config {
    additional_version_weights = {
      # 10% to canary version; 90% stays on stable
      "${var.canary_version}" = 0.1
    }
  }
}

The honest consideration for teams evaluating this migration is the workload profile. Lambda suits request-driven, event-driven, and scheduled workloads with predictable execution times. If a service runs a compute-heavy job that regularly takes twenty minutes, it belongs on ECS Fargate or a dedicated compute resource. If it handles API requests, processes queue messages, or reacts to events, Lambda is almost certainly a better fit than a Kubernetes deployment — and significantly cheaper to operate.

What AWS Built in 2025

Lambda has been available since 2014. For much of that time, it was an excellent tool for certain workloads with some meaningful limitations. In 2025, those limitations have been significantly reduced. Three developments stand out.

Lambda Managed Instances

The most consequential announcement of late 2025 is Lambda Managed Instances. This feature provides dedicated EC2 compute capacity for Lambda functions — instances that run in your own AWS account, managed entirely by the Lambda service but isolated to your workloads.

The significance is twofold. First, performance. Managed Instances gives you access to the latest processor generations, including Graviton4, with configurable memory-to-CPU ratios and high-bandwidth networking. For compute-intensive or I/O-heavy applications, this is a meaningful improvement over the shared infrastructure of standard Lambda. Second, concurrency. Standard Lambda execution environments handle one request at a time. Managed Instances supports multiple concurrent invocations per execution environment - a model that maps more closely to how traditional servers work and dramatically improves utilisation for workloads with high request rates and short durations.

Managed Instances is not for every workload. For sporadic or unpredictable traffic, standard Lambda remains superior because you pay nothing when idle. But for steady, high-volume applications that previously required dedicated servers or managed Kubernetes clusters, Managed Instances closes the gap entirely. A team that was running a microservice on three m6g.xlarge Kubernetes nodes can now run the equivalent Lambda function on a Managed Instances capacity provider and eliminate the cluster management entirely.

resource "aws_lambda_capacity_provider" "production" {
  name = "prod-capacity-provider"

  vpc_config {
    subnet_ids         = var.private_subnet_ids
    security_group_ids = [aws_security_group.lambda.id]
  }

  instance_requirements {
    allowed_instance_types = ["m8g.*", "c8g.*"]

    memory_mib { min = 16384 }
    vcpu_count  { min = 4    }
  }

  scaling_config {
    min_instance_count = 3
    max_instance_count = 30
  }
}

SnapStart for Python and .NET

Cold starts - the latency incurred when Lambda initialises a new execution environment — have been the most persistent complaint about serverless compute. For Java applications, SnapStart has been available since 2022: Lambda takes a snapshot of an initialised execution environment and resumes from it rather than initialising from scratch, reducing cold-start latency from seconds to milliseconds.

In late 2024, SnapStart expanded to Python and .NET runtimes. This is significant because Python is the dominant language for Lambda functions across the industry. Applications that import large libraries at initialisation - machine learning models, complex ORM configurations, SDK clients - now benefit from the same snapshot mechanism. The cold start that previously made latency-sensitive Python APIs difficult to deploy on Lambda becomes a non-issue.

New Runtimes on Amazon Linux 2023

Every Lambda runtime now runs on Amazon Linux 2023, replacing the older Amazon Linux 2 baseline. The security posture improves, the available system libraries are more current, and performance characteristics are better across the board. The practical implication for teams is that upgrading to recent runtime versions — Python 3.13, Python 3.14, Node.js 22, Node.js 24, Java 21, Java 25, .NET 10 — is now straightforwardly worthwhile and should be part of any migration or modernisation plan.

Building for Production: What Actually Matters

There is a gap between a Lambda function that works and one that is ready for production. The gap is not about the code - it is about the surrounding decisions that determine how the function behaves when things go wrong, when load increases unexpectedly, and when an audit requires evidence of what ran and when.

Security starts with IAM

Every Lambda function should have its own IAM role with only the permissions it needs. This is not a theoretical principle; it is a practical defence. If a function's role is compromised, the blast radius is limited to the resources that function needs to access. A function that reads from one DynamoDB table should not have write access to S3. A notification function should not have access to financial records.

Secrets - database passwords, API keys, third-party tokens - belong in AWS Secrets Manager, not in environment variables as plaintext. The pattern is straightforward: fetch the secret once during the cold start, cache it in a module-level variable, and reuse it across invocations. This adds one API call per cold start and eliminates the risk of secrets appearing in logs or environment variable dumps.

Customer-managed KMS keys should encrypt both environment variables and deployment packages. AWS has supported CMK encryption for environment variables for years; the ability to encrypt .zip deployment packages with a CMK is newer and closes a compliance gap that affected regulated industries.

resource "aws_iam_role" "order_processor" {
  name = "order-processor-role-prod"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  role = aws_iam_role.order_processor.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:Query"]
        Resource = [aws_dynamodb_table.orders.arn]
      },
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = [aws_sqs_queue.orders.arn]
      }
    ]
  })
}

Performance is a configuration decision

Lambda allocates CPU proportionally to memory. A function configured with 128 MB receives a fraction of a vCPU. A function configured with 1,792 MB receives exactly one full vCPU. For CPU-bound workloads, increasing memory is the only way to increase compute — and it often reduces total execution time enough to lower cost despite the higher per-millisecond rate.

The arm64 architecture - AWS Graviton - provides approximately twenty percent better price-to-performance than x86 for most workloads. It is available for all supported runtimes and requires only a one-line change to a Terraform resource. For any function that will run at meaningful volume, it is worth enabling.

Provisioned Concurrency eliminates cold starts entirely by keeping a specified number of execution environments initialised and ready. It is billed continuously, so it makes sense for latency-sensitive endpoints that receive consistent traffic throughout the day - not for sporadic background functions.

Observability is not optional

AWS X-Ray provides distributed tracing for Lambda functions at the cost of two lines of configuration. It captures the time spent in each segment of a function's execution, traces calls to downstream services, and surfaces them in a visual map that makes debugging distributed systems tractable.

CloudWatch Logs Insights allows structured queries against function logs without any additional infrastructure. Alarms on error rate and p99 duration should be standard in any production deployment - they provide the earliest signal that something has changed in a function's behaviour.

The combination of X-Ray traces, structured logs, and CloudWatch metrics gives operations teams the visibility they need to diagnose problems without access to a server. This is often cited as a concern about serverless - that it is a black box. In practice, a well-instrumented Lambda function is more observable than most self-managed server deployments.

The Practical Path Forward

Lambda in 2025 is not a niche tool for simple workloads. It is a production compute platform that handles the API backends, event processing pipelines, scheduled jobs, and notification systems of organisations across industries. The operational model — no servers to manage, no idle costs, automatic scaling — has moved from aspiration to reliable reality.

For teams considering migration from on-premises, the strangler fig pattern provides a low-risk path. Begin with the functions that have the clearest boundaries and the most predictable execution times. Prove the model on a real workload. Then expand.

For teams running Kubernetes, the question is not whether Lambda can replace it but which workloads are genuinely better served by the serverless model. In most organisations, the majority of services are candidates. The ones that are not — long-running jobs, stateful streaming computations, workloads with very specific hardware requirements — can coexist on Fargate or dedicated compute alongside the serverless functions that surround them.

Lambda Managed Instances changes the calculus further. For the high-volume, steady-state workloads that previously justified Kubernetes clusters, there is now a serverless alternative that provides dedicated compute without the operational overhead. The threshold at which Kubernetes makes more sense than Lambda has moved upward significantly.

The teams that are moving fastest in 2025 are not the ones with the most sophisticated infrastructure. They are the ones that have reduced infrastructure complexity to the point where engineering effort goes almost entirely into the product. Lambda, used well, is how you get there.

How AWS, AI, and Bedrock Transformed Businesses in 2025

Timur Galeev — Sat, 15 Feb 2025 09:30:00 GMT

Introduction

The year 2025 marked a turning point for enterprise technology. Amazon Web Services, artificial intelligence tooling, and Amazon Bedrock — the fully managed service for building generative AI applications — converged into a force that fundamentally reshaped how businesses operate, compete, and innovate. From healthcare diagnostics to financial fraud detection, from supply-chain optimization to hyper-personalized retail, the combination of cloud scale, machine learning maturity, and accessible generative AI rewrote what is possible for organizations of every size.

This article examines that transformation: what changed, how businesses responded, and where the next wave is heading.

1. AWS Evolution: A Cloud That Thinks

Smarter Infrastructure

AWS entered 2025 with a portfolio that had grown well beyond raw compute and storage. The introduction of AWS Trainium2 chips and expanded Inferentia3 capacity slashed model training and inference costs by up to 40% compared to GPU equivalents, making AI workloads economically viable for mid-market companies that previously could not justify the investment.

Amazon Aurora Limitless Database matured from preview into general availability, letting transactional applications scale to hundreds of millions of writes per second without manual sharding. Combined with Amazon MemoryDB's vector-search capabilities, businesses could now run real-time personalization engines directly inside their data tier.

Security and Governance at Scale

AWS re:Invent 2024 previews became 2025 production features: Amazon GuardDuty Extended Threat Detection added AI-driven behavioral baselines that reduced false-positive alerts by 60%, and AWS IAM Access Analyzer gained automated remediation — surfacing an over-permissive role and proposing a least-privilege replacement in the same workflow.

Edge and Hybrid Expansion

AWS Outposts 3 brought cloud-native APIs to factory floors and hospital data centers with sub-5ms latency guarantees, eliminating the last architectural excuse for keeping sensitive workloads disconnected from modern tooling.

2. AI Integration: Automation That Actually Delivers

Automating the Mundane, Augmenting the Complex

AI in 2025 stopped being a pilot project and became an operational assumption. Businesses across sectors embedded AI at every layer of the value chain:

Healthcare Hospital networks deployed AI triage assistants trained on clinical notes and imaging data. These systems flagged high-risk patients in emergency queues with 94% accuracy, freeing physicians to focus on diagnosis. Drug discovery pipelines using AWS HealthLake and SageMaker reduced the pre-clinical research phase from years to months by predicting molecular binding affinity with foundation models.

Financial Services Real-time fraud detection moved from rule-based systems to gradient-boosting and transformer models running on SageMaker endpoints. Transaction approval latency dropped to under 20 milliseconds while false-positive rates fell 35%, reducing the friction that drove cart abandonment in e-commerce. Investment banks used AI copilots built on Bedrock to summarize earnings calls, generate risk memos, and draft client communications — tasks that previously consumed analyst hours.

E-Commerce and Retail Dynamic pricing engines, demand forecasting, and AI-generated product descriptions became table stakes. Retailers using Amazon Personalize with Bedrock-generated copy saw a 22% average uplift in conversion rates compared to static merchandising.

Manufacturing Computer-vision models on AWS Panorama inspected production lines at speeds no human team could match, catching micro-defects in semiconductors, auto parts, and packaged goods. Predictive maintenance models trained on IoT sensor streams cut unplanned downtime by 30% at several large automotive plants.

The Rise of Agentic Workflows

Perhaps the most significant shift was the move from single-inference AI calls to agentic workflows — chains of AI actions that plan, execute, observe, and self-correct. Using AWS Step Functions orchestrating Bedrock agents, companies built autonomous processes that could research a topic, draft a document, route it for approval, and publish it to internal knowledge bases with zero human intervention at each step.

3. Amazon Bedrock: The Generative AI Accelerant

What Bedrock Changed

When Amazon Bedrock launched, it solved the hardest enterprise problem in generative AI: how do you use frontier models without your proprietary data leaving your security perimeter? Bedrock's fully managed, VPC-native architecture meant a bank could query Claude, Titan, Llama, or Mistral without training data ever touching a shared endpoint. That single guarantee unlocked adoption in regulated industries that had been watching from the sidelines.

Model Choice as a Strategic Asset

By early 2025, Bedrock's model catalog spanned more than 30 models from Anthropic, Meta, Cohere, Stability AI, and Amazon itself. Businesses stopped thinking of "the AI model" as a monolith and started treating model selection as a cost-performance decision:

Use Case	Preferred Model Tier	Why
Complex reasoning, legal review	Claude 3 Opus / Sonnet	Highest accuracy
Customer-facing chat	Claude Haiku, Llama 3	Latency + cost
Embeddings and search	Amazon Titan Embeddings	Native AWS integration
Image generation	Stability AI SDXL	Brand-quality visuals

Knowledge Bases and RAG at Scale

Bedrock Knowledge Bases with native OpenSearch Serverless integration made retrieval-augmented generation (RAG) a two-hour setup rather than a multi-sprint engineering project. Companies ingested internal documentation, legal contracts, and product catalogs, then gave employees natural-language interfaces to query terabytes of institutional knowledge instantly.

Guardrails: Enterprise Trust Layer

Amazon Bedrock Guardrails — with configurable content filters, PII redaction, grounding checks, and topic denial lists — gave compliance teams the controls they needed to sign off on production AI deployments. By Q1 2025, over 60% of Fortune 500 companies had at least one Bedrock workload in production, up from 18% eighteen months earlier.

Agents for Bedrock

Agents for Bedrock enabled multi-step task execution: an agent could call an API, read a database, invoke a Lambda function, and compose a response — all from a single natural language instruction. Customer service organizations used these agents to handle refund requests end-to-end, pulling order records, evaluating return eligibility against policy, and issuing credits without a human in the loop.

4. Impact on Business Models

From Cost Center to Revenue Driver

IT departments that once justified cloud spend by eliminating data center CAPEX now justified it by directly attributing revenue to AI-powered features. Product teams and engineering organizations merged around shared AI platforms, dissolving the traditional boundary between "technology" and "business."

The Platform Economy Deepens

Software vendors rebuilt their platforms on Bedrock, offering AI capabilities as part of standard subscriptions rather than premium add-ons. CRMs, ERPs, HR systems, and analytics tools all embedded generative AI natively, raising the floor of what customers expected from enterprise software.

Workforce Transformation

Roles did not disappear en masse — but they mutated. Data entry, boilerplate coding, first-line customer support, and report generation shifted from human tasks to AI tasks. Workers redeployed to prompt engineering, AI output review, exception handling, and higher-judgment decision-making. Companies that invested in reskilling outperformed peers that treated workforce transformation as a cost-cutting exercise.

Speed as Competitive Moat

The defining advantage of 2025 was not data, models, or even talent — it was speed of iteration. Teams using Bedrock, SageMaker, and serverless infrastructure could go from idea to production feature in days. That cycle-time advantage compounded: businesses that shipped faster learned faster, and learning faster meant shipping better.

5. Real-World Examples

Pfizer: Accelerating Drug Discovery

Pfizer integrated Amazon Bedrock with its internal research data lake, enabling scientists to query clinical trial results in natural language. Bedrock agents summarized literature, flagged contradictory studies, and proposed experimental hypotheses. The company reported a 30% reduction in pre-clinical research timelines for two oncology programs.

Klarna: AI-First Customer Service

The Swedish fintech replaced a significant portion of its support queue with a Bedrock-powered agent integrated with its transaction systems. The agent resolved 70% of inquiries without escalation, handled 23 languages, and reduced average resolution time from 11 minutes to under 2 minutes — while maintaining higher customer satisfaction scores than the previous model.

Siemens Energy: Predictive Maintenance at Scale

Siemens deployed AWS IoT services and SageMaker to monitor 40,000+ turbine sensors across global wind farms. Bedrock-powered natural language dashboards let operations engineers query anomalies in plain English. Unplanned outages dropped 28% year-over-year, translating to tens of millions in avoided losses.

Duolingo: Hyper-Personalized Learning

Duolingo rebuilt its adaptive learning engine on Bedrock, generating custom lesson content, explanations, and conversational practice scenarios tailored to each learner's error patterns and progress. Retention at the 30-day mark improved 18%, and the team shipped the feature in six weeks — a project that would previously have taken two quarters.

6. Looking Forward: 2026 and Beyond

Multimodal Becomes Standard

Text-only AI is already a legacy constraint. By 2026, enterprise applications will routinely combine text, image, audio, and video understanding in single workflows. AWS is investing heavily in multimodal foundation models and Bedrock will serve as the unified interface.

Autonomous AI Agents Go Mainstream

The agentic patterns pioneered by early adopters in 2025 will become default architectures. AI agents will own entire business processes — monitoring, deciding, executing, and reporting — with humans setting goals and reviewing exceptions rather than approving each step.

Cost Optimization Pressure Creates New Patterns

As AI spend scales, FinOps for AI becomes a discipline in its own right. Expect model distillation, caching, and tiered inference routing (choosing the cheapest model that meets a quality threshold) to become standard engineering practices, enabled by tools built directly into Bedrock.

Regulatory Frameworks Mature

The EU AI Act enforcement, evolving US federal guidance, and sector-specific rules (FDA digital health, SEC AI disclosures) will reshape how businesses deploy AI. AWS compliance tooling — Bedrock Guardrails, audit logging, model cards — will become contractual requirements rather than optional features.

Vertical AI Clouds Emerge

AWS is building deep specializations: AWS for Healthcare, AWS for Financial Services, AWS for Manufacturing — each with pre-configured compliance postures, domain-specific foundation models, and reference architectures. Businesses in these sectors will adopt vertical cloud stacks as the fastest path to compliant, production-ready AI.

Conclusion

2025 was the year generative AI moved from experimentation to execution. AWS provided the infrastructure, Amazon Bedrock removed the integration friction, and AI models supplied the intelligence. Together, they gave businesses a toolkit to automate work that previously required human cognition, to personalize at scales previously impossible, and to iterate at speeds previously unimaginable.

The companies that thrived were not necessarily the largest or the most technically sophisticated — they were the most willing to redesign their processes around new capabilities rather than bolt AI onto old ones. That willingness to reimagine, backed by the scale and reliability of AWS, is the defining business advantage of our moment.

The next chapter begins now.

Building an AI-Optimized Platform on Amazon EKS with NVIDIA NIM and OpenAI Models

Timur Galeev — Wed, 18 Dec 2024 21:08:37 GMT

Introduction

The rise of artificial intelligence (AI) has brought about an unprecedented demand for infrastructure that can handle large-scale computations, support GPU acceleration, and provide scalable, flexible management of workloads. Kubernetes has emerged as a leading platform for orchestrating these workloads, and Amazon Elastic Kubernetes Service (EKS) extends Kubernetes’ capabilities by simplifying deployment and scaling in the cloud.

NVIDIA Infrastructure Manager (NIM) complements Kubernetes by optimizing GPU workloads, a critical need for training large language models (LLMs), computer vision, and other computationally intensive AI tasks. Additionally, OpenAI models can be integrated into this ecosystem to unlock cutting-edge AI capabilities, such as text generation, image recognition, and decision-making systems.

This article provides an in-depth guide to building a complete AI platform using EKS, NVIDIA NIM, and OpenAI models, with Terraform automating the deployment. Whether you are an AI researcher or a business looking to adopt AI, this guide outlines how to build a robust and scalable platform. Complete code for this setup is available on GitHub https://github.com/timurgaleev/eks-nim-llm-openai.

Why Choose NVIDIA NIM and EKS for AI Workloads?

Challenges of AI Workloads

AI applications, especially those involving LLMs, have unique challenges:

GPU Resource Management: Training and inference rely on GPUs, which are scarce and expensive resources. Efficient allocation and monitoring are crucial.
Scalability: AI workloads often need to scale dynamically based on user demand or data processing requirements.
Storage for Large Datasets: AI models and datasets can require hundreds of gigabytes, necessitating persistent, shared, and scalable storage.
Observability: Monitoring system performance, especially GPU utilization and latency, is essential for optimizing workloads.

NVIDIA NIM: A Solution for GPU Workloads

NVIDIA NIM addresses these challenges by providing:

GPU Scheduling: Maximizes GPU usage across workloads.
Integration with Kubernetes: Leverages Kubernetes to manage pods, jobs, and resources efficiently.
AI Model Management: Simplifies deployment and scaling of AI models with Helm charts and Kubernetes CRDs (Custom Resource Definitions).
Support for Persistent Storage: Integrates with shared storage solutions like AWS EFS for storing datasets and models.

Amazon EKS: A Scalable Kubernetes Solution

Amazon EKS adds value by:

Managed Kubernetes: Reduces operational overhead by handling Kubernetes cluster setup, updates, and management.
Elastic Compute Integration: Dynamically provisions GPU-enabled instances, such as g4dn and p4d, to handle AI workloads. Ensure that your AWS account has sufficient quotas and availability for these instance types to avoid provisioning issues.
Built-in Security: Integrates with AWS IAM and VPC for secure access and network segmentation.

Together, NVIDIA NIM and Amazon EKS create a powerful platform for AI model training, inference, and experimentation.

Architecture Overview

The platform architecture integrates NVIDIA NIM and OpenAI models into an EKS cluster, combining compute, storage, and monitoring components.

Key Components

EKS Cluster: Manages Kubernetes workloads and scales GPU-enabled nodes.
Karpenter: Dynamically provisions and scales nodes (CPU and GPU) based on workload demands, optimizing resource utilization and cost.
GPU Node Groups: Nodes equipped with NVIDIA GPUs for ML and AI inference tasks.
NVIDIA NIM: Deploys GPU workloads, manages AI pipelines, and integrates with Kubernetes.
OpenAI Web UI: Provides a user-friendly interface for interacting with AI models.
Persistent Storage: AWS EFS supports shared storage for datasets and models.
Observability Tools: Prometheus and Grafana offer real-time monitoring of system metrics, including GPU utilization and pod performance.

Deployment Guide

This guide provides step-by-step instructions to deploy the architecture using Terraform. While the focus is on essential components like EKS, GPU workloads, and observability, we skip detailed VPC configuration to allow flexibility based on your specific requirements.

For a VPC example that fits this deployment, refer to the repository: https://github.com/timurgaleev/eks-nim-llm-openai.

Step 1: Provisioning the EKS Cluster

Provisioning an Amazon EKS cluster is the foundation for Kubernetes workloads. Below is the EKS Cluster Configuration with key highlights to focus on scalability, system add-ons, and Karpenter integration.

EKS Cluster Configuration

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.15"

  cluster_name                   = local.name
  cluster_version                = var.eks_cluster_version
  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = compact([
    for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
    substr(cidr_block, 0, 4) == "100." ? subnet_id : null
  ])

  manage_aws_auth_configmap = true
  aws_auth_roles = [
    {
      rolearn  = module.eks_blueprints_addons.karpenter.node_iam_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes"
      ]
    }
  ]

  eks_managed_node_group_defaults = {
    iam_role_additional_policies = {
      AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
    ebs_optimized = true
    block_device_mappings = {
      xvda = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size = 100
          volume_type = "gp3"
        }
      }
    }
  }

  eks_managed_node_groups = {
    core_node_group = {
      name            = "core-node-group"
      description     = "EKS Core node group for hosting system add-ons"
      subnet_ids      = compact([
        for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
        substr(cidr_block, 0, 4) == "100." ? subnet_id : null
      ])
      ami_type        = "AL2_x86_64"
      instance_types  = ["m5.xlarge"]
      capacity_type   = "SPOT"
      desired_size    = 2
      min_size        = 2
      max_size        = 4
      labels = {
        WorkerType    = "SPOT"
        NodeGroupType = "core"
      }
      tags = merge(local.tags, { Name = "core-node-grp" })
    }
  }
}

Key Highlights

Networking:
- Subnets are filtered to include only CIDR blocks starting with 100. to ensure specific subnet assignment for nodes.
IAM and Auth:
- Integration with Karpenter is configured via the aws_auth_roles block, allowing Karpenter to dynamically provision nodes.
Managed Node Groups:
- Core Node Group:
  - Optimized for system-level workloads.
  - Configured with m5.xlarge spot instances for cost efficiency.
  - Labels such as NodeGroupType: core and taints can be used to restrict workloads to this node group.
Storage:
- Nodes are configured with gp3 root volumes (100 GiB) for system usage. Additional storage for workloads should be configured separately.
Scaling:
- Use Karpenter for workload-based scaling instead of additional managed node groups. The eks_managed_node_groups block here is only for critical system workloads.

Step 2: Deploying NVIDIA NIM for AI Workloads

Deploying NVIDIA NIM (NVIDIA Inference Manager) requires configuring persistent storage for large datasets and allocating GPU resources for optimal performance. Here's an expanded guide breaking down the essential steps.

1. Persistent Storage with AWS EFS

AI workloads often require storage that exceeds local node capacity. AWS EFS (Elastic File System) provides a shared and scalable storage solution across multiple pods. Below is the configuration for creating a Persistent Volume Claim (PVC) backed by EFS:

Code: Persistent Volume Claim (PVC)

kubernetes_persistent_volume_claim_v1 "efs_pvc" {
  metadata {
    name      = "efs-storage"
    namespace = "nim"
  }
  spec {
    access_modes       = ["ReadWriteMany"] # Enables sharing storage across multiple pods.
    storage_class_name = "efs"             # Links the PVC to an EFS storage class.
    resources {
      requests = {
        storage = "200Gi" # Reserves 200 GiB of scalable storage.
      }
    }
  }
}

Key Points:

Access Mode: "ReadWriteMany" allows simultaneous access by multiple pods, critical for parallel workloads.
Storage Class: Must correspond to an EFS provisioner configured in the Kubernetes cluster.
Capacity: Start with 200 GiB and scale as per your dataset requirements.

2. Deploying NVIDIA NIM Helm Chart

After configuring storage, deploy NVIDIA NIM using Helm. The Helm chart simplifies GPU allocation and links the persistent storage to NIM-managed workloads.

Configure the NGC API Key

Before deploying NVIDIA NIM, you need to retrieve your NGC API Key from NVIDIA’s cloud platform and set it as an environment variable. This key enables secure authentication with NVIDIA’s container registry and services.

Steps to Retrieve the NGC API Key:

Log in to your NGC account.
Navigate to Setup > API Keys.
Click Generate API Key if you don’t already have one.
Copy the generated key to use in your deployment process.

Set the NGC API Key as an Environment Variable:

Run the following command in your terminal to make the key accessible to Terraform during deployment:

export TF_VAR_ngc_api_key=

Replace with your actual API key. This key will be passed to NVIDIA NIM to enable seamless model deployment.

Code: Helm Release for NVIDIA NIM

helm_release "nim_llm" {
  name      = "nim-llm"
  chart     = "./nim-llm"                # Points to the NIM Helm chart location.
  namespace = "nim"
  values = [
    templatefile("nim-llm-values.yaml", {
      model_id    = var.model_id            # Specifies the LLM model (e.g., GPT-like models).
      num_gpu     = var.num_gpu             # Allocates GPU resources for inference tasks.
      ngc_api_key = var.ngc_api_key
      pvc_name    = kubernetes_persistent_volume_claim_v1.efs_pvc.metadata[0].name
    })
  ]
}

Key Points:

model_id: The identifier of the model being deployed (e.g., GPT-3, BERT).
num_gpu: Configures GPU resources for inference tasks. The value should align with the instance type used in your cluster (e.g., g4dn.xlarge for one GPU).
pvc_name: Links the EFS-backed PVC to the workload for storing large datasets or models.

3. Configuration Highlights

Why Persistent Storage?

AI models and datasets are often larger than the node's local storage. Using EFS ensures:
- Scalability: Adjust storage as required without downtime.
- High Availability: Accessible across multiple Availability Zones.

GPU Allocation

NVIDIA NIM optimizes GPU usage for inference. Use the num_gpu variable to specify the number of GPUs for your workload, ensuring efficient resource utilization.

Summary

Storage Configuration: Use AWS EFS with Kubernetes PVC for shared, scalable storage across pods.
GPU Allocation: NVIDIA NIM enables efficient GPU resource management for AI inference tasks.
Helm Chart Deployment: Leverage Helm for streamlined deployment, linking GPU resources and persistent storage.

Step 3: Adding OpenAI Web UI

The OpenAI Web UI provides an interface for users to interact with deployed AI models.

"helm_release" "openai_webui" {
  name       = "openai-webui"
  chart      = "open-webui"
  repository = "https://helm.openwebui.com/"
  namespace  = "openai-webui"
  values = [
    jsonencode({
      replicaCount = 1,
      image = {
        repository = "ghcr.io/open-webui/open-webui"
        tag        = "main"
      }
    })
  ]
}

Step 4: Observability with Prometheus, Grafana, and Custom Metrics

Prometheus and Grafana are essential tools for monitoring AI workloads. Prometheus collects resource metrics, including GPU-specific data, while Grafana visualizes these metrics through tailored dashboards. These tools help ensure that AI operations are running smoothly and efficiently.

To extend observability, the Prometheus Adapter is configured with custom rules for tracking AI-specific metrics. Key configurations include:

Tracking Active Requests: Using the num_requests_running metric, Prometheus monitors the number of ongoing requests, providing insights into workload intensity.
Inference Queue Monitoring: The nv_inference_queue_duration_us metric tracks NVIDIA inference queue times, converted into milliseconds for enhanced readability.

Sample Configuration for Prometheus Adapter:

prometheus:
  url: http://kube-prometheus-stack-prometheus.${prometheus_namespace}
  port: 9090
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"num_requests_running"}'
    resources:
      template: <<.Resource>>
    name:
      matches: "num_requests_running"
      as: ""
    metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  - seriesQuery: 'nv_inference_queue_duration_us{namespace!="", pod!=""}'
    resources:
      overrides:
        namespace:
          resource: "namespace"
        pod:
          resource: "pod"
    name:
      matches: "nv_inference_queue_duration_us"
      as: "nv_inference_queue_duration_ms"
    metricsQuery: 'avg(rate(nv_inference_queue_duration_us{<<.LabelMatchers>>}[1m])/1000) by (<<.GroupBy>>)'

These configurations enable Prometheus to expose meaningful custom metrics that are critical for scaling and optimizing AI workloads. By integrating these metrics into Grafana dashboards, users gain actionable insights into system performance and bottlenecks.

Step 5: Scaling and Optimization with Karpenter

In large-scale AI deployments, workload demands fluctuate significantly. Dynamic scaling is essential for managing these workloads effectively while minimizing costs. Karpenter, a Kubernetes-native cluster autoscaler, provides powerful mechanisms for optimizing resource utilization. It dynamically provisions nodes tailored to the specific demands of applications, including GPU-heavy AI workloads.

This section integrates Karpenter into the EKS Blueprint framework, highlighting its configuration for both CPU and GPU workloads. The full implementation and configurations are available in the https://github.com/timurgaleev/eks-nim-llm-openai.

Deploying Karpenter with EKS Blueprints

Karpenter is added to the EKS cluster as a Blueprint add-on. Below is an example of the configuration block for enabling Karpenter, focusing on both CPU and GPU workload optimization:

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.2"

  enable_karpenter                  = true
  karpenter_enable_spot_termination = true
  karpenter_node = {
    iam_role_additional_policies = {
      AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
  }
  karpenter = {
    chart_version = "0.37.0"
  }
}

This configuration enables Karpenter with support for Spot instance termination handling and assigns additional IAM policies for managing nodes.

Configuring Karpenter for CPU and GPU Workloads

For effective scaling, Karpenter relies on Provisioner configurations tailored to workload requirements. The following examples showcase how Karpenter dynamically provisions CPU and GPU nodes.

CPU Workloads

name: cpu-karpenter
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
  karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
  subnetSelectorTerms:
    id: ${module.vpc.private_subnets[2]}
  securityGroupSelectorTerms:
    tags:
      Name: ${module.eks.cluster_name}-node
  instanceStorePolicy: RAID0

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: cpu-karpenter
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["m5"]
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: ["xlarge", "2xlarge", "4xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 180s
    expireAfter: 720h
  weight: 100

GPU Workloads

name: gpu-workloads
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
  karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
  subnetSelectorTerms:
    id: ${module.vpc.private_subnets[1]}
  securityGroupSelectorTerms:
    tags:
      Name: ${module.eks.cluster_name}-node
  instanceStorePolicy: RAID0

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: gpu-workloads
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["g5", "p4", "p5"]  # GPU instances
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: ["2xlarge", "4xlarge", "8xlarge", "12xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 180s
    expireAfter: 720h
  weight: 100

Terraform Automation Scripts

To streamline the deployment and teardown of resources, the project includes two utility scripts: install.sh and cleanup.sh.

install.sh: Automates the deployment process. It initializes Terraform, applies modules sequentially (e.g., VPC and EKS), and ensures all resources are provisioned successfully. A final Terraform apply captures any remaining dependencies.
cleanup.sh: Safely destroys the deployed infrastructure. It handles dependencies like Kubernetes services, Load Balancers, and Security Groups, ensuring proper teardown order. Each module is destroyed sequentially, with a final pass to catch residual resources.

These scripts enhance operational efficiency and minimize errors during deployment and cleanup phases, making the workflow more robust and reproducible.

Key Features of Karpenter in AI Ecosystems

Dynamic Node Provisioning: Automatically provisions CPU or GPU nodes based on real-time workload needs.
Cost Optimization: Leverages Spot instances while ensuring reliable on-demand scaling for critical workloads.
Enhanced Resource Utilization: Consolidates underutilized nodes and removes idle resources with disruption policies.
Tailored Scaling Policies: Supports node pools for diverse workload types, such as inference tasks or data preprocessing.

Karpenter’s integration with GPU-optimized workloads ensures that demanding AI models benefit from high-performance compute nodes while maintaining cost efficiency.

Use Cases

1. AI Model Training

NVIDIA NIM’s GPU optimizations allow for efficient training of models like BERT or GPT, reducing runtime and costs.

2. Real-Time Inference

Deploy models for real-time applications such as fraud detection, image recognition, or natural language understanding.

3. Experimentation and Research

With the OpenAI Web UI, data scientists can quickly test and iterate on models.

Conclusion

This platform enables the scalable and efficient deployment of AI workloads by integrating NVIDIA NIM with Amazon EKS. Terraform automates the process, ensuring repeatable and reliable setups. With GPU optimization, persistent storage, and observability tools, the platform is well-suited for businesses and researchers alike.

By following this guide, you can build a scalable and efficient AI platform. For detailed code and further exploration, visit the GitHub repository https://github.com/timurgaleev/eks-nim-llm-openai.

Deploying AWS EKS with Terraform and Blueprints Addons

Timur Galeev — Thu, 07 Nov 2024 08:45:20 GMT

After a pause from covering AWS and infrastructure management, I’m back with insights for those looking to navigate the world of AWS containers and Kubernetes with ease. For anyone new to deploying Kubernetes in AWS, leveraging Terraform for setting up an EKS (Elastic Kubernetes Service) cluster can be a game-changer. By combining Terraform’s infrastructure-as-code capabilities with AWS’s EKS Blueprints Addons, users can create a scalable, production-ready Kubernetes environment without the usual complexity.

In this article, I'll guide you through using Terraform to deploy EKS with essential add-ons, which streamline the configuration and management of your Kubernetes clusters. With these modular add-ons, you can quickly incorporate features like CoreDNS, the AWS Load Balancer Controller, and other powerful tools to customize and enhance your setup. Whether you’re new to container orchestration or just seeking an efficient AWS solution, this guide will help you build a resilient EKS environment in a few straightforward steps.

So let’s start.

Setting Up the VPC for EKS

The VPC configuration is foundational for your EKS cluster, establishing a secure, isolated environment with both public and private subnets. Private subnets are typically used to host your Kubernetes nodes, keeping them inaccessible from the internet. Here’s the configuration provided in the vpc.tf file, which sets up both public and private subnets along with NAT and Internet Gateway options for flexible networking.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name                 = local.name
  cidr                 = var.vpc_cidr
  azs                  = local.azs
  secondary_cidr_blocks = var.secondary_cidr_blocks
  private_subnets      = concat(local.private_subnets, local.secondary_ip_range_private_subnets)
  public_subnets       = local.public_subnets
  enable_nat_gateway   = true
  single_nat_gateway   = true
  public_subnet_tags   = {"kubernetes.io/role/elb" = 1}
  private_subnet_tags  = {
    "kubernetes.io/role/internal-elb" = 1
    "karpenter.sh/discovery" = local.name
  }
  tags = local.tags
}

This setup:

Creates private and public subnets across multiple availability zones.
Configures a secondary CIDR block for the EKS data plane, which is crucial for large-scale deployments.
Enables a NAT gateway for private subnets, ensuring secure internet access for internal resources.
Tags subnets for Kubernetes service and discovery, essential for integration with other AWS services like load balancers and Karpenter.

Deploying EKS with Managed Node Groups

Now that the VPC is configured, let’s move on to deploying the EKS cluster with the eks.tf file configuration. This setup includes defining managed node groups within the EKS cluster, specifying node configurations, security rules, and IAM roles.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.15"

  cluster_name                   = local.name
  cluster_version                = var.eks_cluster_version
  cluster_endpoint_public_access = true
  vpc_id                         = module.vpc.vpc_id
  subnet_ids                     = compact([for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) : substr(cidr_block, 0, 4) == "100." ? subnet_id : null])

  aws_auth_roles = [
    {
      rolearn  = module.eks_blueprints_addons.karpenter.node_iam_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups   = ["system:bootstrappers", "system:nodes"]
    }
  ]

  eks_managed_node_groups = {
    core_node_group = {
      name             = "core-node-group"
      ami_type         = "AL2_x86_64"
      min_size         = 2
      max_size         = 8
      desired_size     = 2
      instance_types   = ["m5.xlarge"]
      capacity_type    = "SPOT"
      labels           = { WorkerType = "SPOT", NodeGroupType = "core" }
      tags             = merge(local.tags, { Name = "core-node-grp" })
    }
  }
}

Key components:

VPC and Subnets: The vpc_id and subnet_ids reference the private subnets, providing a secure foundation for EKS nodes.
Managed Node Groups: This setup defines a core node group with spot instances (capacity_type = "SPOT") to optimize cost, with configurable instance types, sizes, and labels for workload placement.
Security Rules and IAM Roles: Configures additional security rules to manage access between nodes and clusters, along with IAM roles to control permissions for Karpenter and node management.

This configuration gives you a scalable and cost-effective EKS environment that is ready for production workloads, with flexibility to adjust nodes and subnets as needed

Configuring EKS Add-ons

Add-ons enhance your EKS cluster by integrating additional AWS services and open-source tools. With the EKS Blueprints, you can easily set up these add-ons, which range from storage solutions to observability and monitoring tools.

Setting Up the EBS CSI Driver for Persistent Storage

The Amazon EBS CSI Driver is essential for persistent storage on EKS. This module configures the necessary IAM roles for the driver, enabling it to provision and manage EBS volumes.

module "ebs_csi_driver_irsa" {
  source                = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version               = "~> 5.20"
  role_name_prefix      = format("%s-%s-", local.name, "ebs-csi-driver")
  attach_ebs_csi_policy = true
  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"]
    }
  }
  tags = local.tags
}

This configuration creates an IAM role for the EBS CSI Driver using IAM Roles for Service Accounts (IRSA), which allows the driver to interact with EBS securely.

Enabling Amazon CloudWatch Observability

The amazon-cloudwatch-observability add-on integrates CloudWatch for monitoring and logging, providing insights into your cluster’s performance.

eks_addons = {
  amazon-cloudwatch-observability = {
    preserve                 = true
    service_account_role_arn = aws_iam_role.cloudwatch_observability_role.arn
  }
}

This snippet specifies the IAM role required for CloudWatch, enabling detailed observability for your workloads.

Integrating AWS Load Balancer Controller

The AWS Load Balancer Controller allows you to provision and manage Application Load Balancers (ALBs) for Kubernetes services. Here’s how it’s configured:

enable_aws_load_balancer_controller = true
aws_load_balancer_controller = {
  set = [{
    name  = "enableServiceMutatorWebhook"
    value = "false"
  }]
}

The enableServiceMutatorWebhook setting is disabled to avoid automatic modification of service annotations, making it ideal for custom configurations.

Adding Karpenter for Autoscaling

Karpenter is an open-source autoscaler designed for Kubernetes, enabling efficient and dynamic scaling of EC2 instances based on workload requirements. This configuration sets up Karpenter with support for spot instances, reducing costs for non-critical workloads.

enable_karpenter                  = true
karpenter_enable_spot_termination = true
karpenter_node = {
  iam_role_additional_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }
}
karpenter = {
  chart_version       = "0.37.0"
  repository_username = data.aws_ecrpublic_authorization_token.token.user_name
  repository_password = data.aws_ecrpublic_authorization_token.token.password
}

This configuration includes additional IAM policies for Karpenter nodes, making it easier to integrate with AWS services like EC2 for flexible scaling.

These add-ons, configured through the AWS EKS Blueprints and Terraform, help streamline Kubernetes management on AWS while offering enhanced storage, observability, and autoscaling.

To explore the complete configuration, you can find the full code in the GitHub repository https://github.com/timurgaleev/aws-eks-terraform-addons. The repository includes install.sh to deploy the EKS cluster and configure the add-ons seamlessly, along with cleanup.sh to tear down the environment when it’s no longer needed.

Conclusion

This Terraform setup provides a powerful framework for deploying EKS with essential add-ons, such as storage, observability, and autoscaling, to support scalable applications. Specifically, this configuration is designed to enable deployment of applications like OpenAI Chat, showcasing Kubernetes' flexibility for real-time, interactive workloads. With this setup, you’re ready to deploy and manage robust, production-grade EKS clusters in AWS.

Getting Started with Amazon EKS Anywhere: A Practical Guide for On-Premise Kubernetes Deployment

Timur Galeev — Wed, 25 Sep 2024 22:00:00 GMT

Introduction

As businesses increasingly move towards hybrid and multi-cloud environments, managing infrastructure across multiple platforms has become more complex. However, Amazon Web Services (AWS) has introduced a game-changer for organizations that want the power and flexibility of Kubernetes on their on-premise infrastructure. This is where Amazon EKS Anywhere comes into play. In this article, we’ll explore what EKS Anywhere is, its benefits, and how you can set up and manage Kubernetes clusters on your own on-prem servers using VMware vSphere.

Having recently tested EKS Anywhere with my on-prem servers, I can confidently say that it streamlines the process of deploying and managing Kubernetes clusters without the need for complicated third-party tools. Let's walk through the process, from setup to deployment, with some real-world examples.

What is Amazon EKS Anywhere?

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service provided by AWS for running containerized applications. While it simplifies the Kubernetes management process, it traditionally required cloud infrastructure like AWS EC2 instances. However, with EKS Anywhere, AWS now offers a deployment option for customers to create and manage Kubernetes clusters on their on-premise hardware.

The key benefits of EKS Anywhere are:

Consistent Management Experience: It offers the same management tools and experience as Amazon EKS in the AWS Cloud.
Open Source: Built on Amazon EKS Distro, an open-source Kubernetes distribution, it allows users to deploy Kubernetes clusters with minimal effort.
Integration with AWS Tools: Seamlessly integrates with AWS services like AWS Systems Manager (SSM) for monitoring and operations.

In essence, EKS Anywhere allows you to run Kubernetes on your existing infrastructure, ensuring you still benefit from the rich ecosystem and features AWS provides.

Key Features of EKS Anywhere

Hardware Support: It runs on your own hardware or on VMware vSphere, making it ideal for on-premise deployments.
Control Plane: Unlike EKS, where the control plane is managed by AWS, with EKS Anywhere, you manage the control plane yourself.
Cluster Lifecycle Management: EKS Anywhere includes tooling for automating cluster creation, scaling, updates, and even the destruction of Kubernetes clusters.
AWS Integration: Easily view and manage your on-prem Kubernetes clusters using the EKS console, integrating seamlessly with AWS Cloud services.
Support for Third-party Tools: EKS Anywhere supports integrations with tools like Flux for GitOps, eksctl for cluster management, and Cilium for networking.

Setting Up EKS Anywhere on VMware vSphere

For this guide, I’ll walk you through setting up an EKS Anywhere cluster on your on-prem VMware vSphere infrastructure. While you can set up a test cluster on your desktop, here we focus on a more realistic production setup.

Prerequisites:

VMware vSphere version 7.0 or higher.
EKS Anywhere tools installed on your machine.
At least three control plane nodes and three worker nodes for high availability.

Step 1: Install EKS Anywhere CLI Tools

Start by installing the necessary CLI tools. On a Mac, you can do this via Homebrew.

$ brew install aws/tap/eks-anywhere
$ eksctl anywhere version
v0.5.0

Step 2: Generate Cluster Config and Create a Cluster

Let’s create a Kubernetes cluster using eksctl. First, you need to generate a cluster configuration file.

$ CLUSTER_NAME=my-eks-cluster
$ eksctl anywhere generate clusterconfig $CLUSTER_NAME --provider vsphere > $CLUSTER_NAME.yaml

Now that we have the configuration, we can create the cluster on vSphere.

$ eksctl anywhere create cluster -f $CLUSTER_NAME.yaml

The CLI will handle the setup of the control plane, the worker nodes, and the networking components for your cluster. Once the cluster is created, it will be fully operational, and you can use kubectl to interact with it.

Step 3: Export Kubeconfig and Deploy a Test App

Once the cluster is created, you'll have a kubeconfig file to connect to your Kubernetes cluster:

$ export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
$ kubectl get ns

You can now deploy a simple test application to verify everything is working:

$ kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
$ kubectl get pods -l app=hello-eks-a

This will deploy a basic pod that you can access locally:

$ kubectl port-forward deploy/hello-eks-a 8000:80
$ curl localhost:8000

You should see a simple “Hello from EKS Anywhere” message, confirming that the cluster is up and running.

Managing Your Cluster: High Availability and Updates

In a production environment, you’ll want to ensure high availability and smooth updates for your clusters. EKS Anywhere allows you to scale your cluster as needed and manage rolling updates.

For high availability, it's recommended to have at least three control plane nodes and three worker nodes. You can scale the cluster using:

$ eksctl anywhere scale cluster --control-plane-nodes 3 --worker-nodes 3

To update the cluster, use the built-in update tools provided by EKS Anywhere, which work much like the updates on AWS-managed EKS clusters. The update process ensures that your cluster remains stable during the upgrade, even with multiple nodes.

Using EKS Connector for Centralized Management

One of the standout features of EKS Anywhere is EKS Connector, which allows you to manage your on-prem clusters directly from the EKS console. This makes it easy to view and monitor all your Kubernetes clusters, whether they’re running on AWS or on-prem.

To connect your EKS Anywhere cluster to the EKS console:

Register the cluster through the EKS console.
Download and apply the necessary eks-connector.yaml configuration to your cluster.
Once applied, your cluster will be available in the AWS Management Console for monitoring and management.

$ kubectl apply -f eks-connector.yaml

This allows you to manage your on-prem clusters alongside your AWS-based clusters in a single interface.

Conclusion

Amazon EKS Anywhere has made managing on-prem Kubernetes clusters much simpler by bringing AWS-level tools and support to local infrastructures. Whether you're running on VMware vSphere or other compatible environments, EKS Anywhere allows you to benefit from a consistent, simplified management experience, without the need for complex, third-party tools. It also integrates seamlessly with AWS services, making it easy to monitor and scale your infrastructure.

If you're looking to bring Kubernetes to your on-prem servers, EKS Anywhere is an excellent choice that I would highly recommend based on my recent hands-on testing.

AWS EKS vs. AWS ECS: Choosing the Right Container Service for Your Needs

Timur Galeev — Tue, 06 Aug 2024 22:00:00 GMT

Introduction
In the world of cloud-based applications, containers have become a staple for deploying and scaling applications efficiently. AWS offers two primary container services: Elastic Kubernetes Service (EKS) and Elastic Container Service (ECS). Both help developers deploy and manage containers, but they have distinct architectures and best-use scenarios. Let’s explore the differences, pros, and cons of each to help you choose the best one for your needs.

1. What is AWS ECS?

AWS Elastic Container Service (ECS) is a fully managed container service built by Amazon to deploy and manage containers. ECS is optimized for simplicity and efficiency, especially for AWS environments, making it a great choice if you want a straightforward, managed solution for running containers without the complexity of Kubernetes.

Example Code for ECS (Using AWS CLI):

bashCopy code# Create an ECS cluster
aws ecs create-cluster --cluster-name my-ecs-cluster

# Register a task definition
aws ecs register-task-definition --family my-task \
  --container-definitions '[{"name":"my-container","image":"nginx","memory":512,"cpu":256}]'

# Run a task in the ECS cluster
aws ecs run-task --cluster my-ecs-cluster --task-definition my-task

2. What is AWS EKS?

AWS Elastic Kubernetes Service (EKS) is a managed Kubernetes service. It allows you to deploy, scale, and operate Kubernetes on AWS, fully aligned with Kubernetes standards. EKS provides flexibility and compatibility with the Kubernetes ecosystem, ideal for teams with experience in Kubernetes or those wanting more control over container orchestration.

Example Code for EKS (Using kubectl and AWS CLI):

bashCopy code# Create a Kubernetes deployment
kubectl create deployment nginx-deployment --image=nginx

# Expose the deployment as a service
kubectl expose deployment nginx-deployment --port=80 --target-port=80 --type=LoadBalancer

3. Key Differences Between ECS and EKS

The primary difference between ECS and EKS lies in the underlying orchestration. ECS is AWS-native, offering tight integration and simplified operations but is exclusive to AWS. EKS, being Kubernetes-based, is portable and lets you run the same configurations across different cloud providers or on-premises if you use Kubernetes elsewhere.

4. Advantages of AWS ECS

Simplicity: ECS is straightforward to set up, especially if you’re already working with other AWS services.
Tight AWS Integration: ECS has deep integration with AWS IAM, CloudWatch, and other services, making security and monitoring seamless.
Lower Management Overhead: ECS manages most of the infrastructure, so you don’t have to worry about the underlying components like control planes or etcd clusters.

5. Advantages of AWS EKS

Kubernetes Compatibility: EKS is compatible with Kubernetes, making it ideal for teams familiar with Kubernetes and tools like Helm, kubectl, and Prometheus.
Hybrid and Multi-Cloud Flexibility: Since it’s based on Kubernetes, EKS allows applications to be portable, ideal for multi-cloud or hybrid environments.
Extensibility: EKS enables integration with a wide array of Kubernetes plugins and tools, giving developers more control and customization options.

6. When to Choose ECS Over EKS

If your team values simplicity and deep AWS integration, ECS can be an excellent choice. ECS is also ideal when running smaller applications or when your team prefers a managed service that takes care of infrastructure details. ECS may require less management and works well when you need to deploy on AWS alone without multi-cloud portability.

7. When to Choose EKS Over ECS

EKS is a powerful choice if your team has Kubernetes experience or needs hybrid cloud deployment. EKS enables portability, so if there’s a need to run parts of your app on other clouds or on-premises, EKS is better. Kubernetes allows more control over networking, storage, and plugins—ideal for complex applications.

8. Pros and Cons Summary

Feature	ECS	EKS
Ease of Use	Simplified, AWS-native	More complex, Kubernetes-native
Multi-Cloud	AWS-only	Multi-cloud flexibility
Integrations	Deeply integrated with AWS	Compatible with the Kubernetes ecosystem
Management	AWS handles most infrastructure details	More user control, but requires management
Scalability	Scalable within AWS environment	Scalable across clouds and on-premises

9. Which to Use: Practical Scenarios

For example, if you’re a small team running microservices exclusively on AWS, ECS will likely meet your needs with less management overhead. However, if you’re developing a complex, multi-tiered application that may need to scale across multiple clouds, EKS could be more suitable.

10. Conclusion

While both AWS ECS and EKS are strong options, the choice depends on your team’s needs, skill level, and deployment goals. ECS is straightforward and integrates deeply into AWS, making it perfect for teams focused on AWS-native applications. EKS, on the other hand, is ideal for those who want flexibility, Kubernetes compatibility, and multi-cloud options. For most straightforward applications, ECS is often the preferred choice, but EKS brings value for larger and more complex architectures. Choose wisely based on your priorities, but remember that both services are backed by AWS, ensuring scalability and reliability.

Leveraging GitLab and GitLab Self-Managed as Source Providers in AWS CodeBuild

Timur Galeev — Tue, 16 Apr 2024 14:47:57 GMT

In an exciting update, Amazon Web Services (AWS) has announced that GitLab and self-managed GitLab instances are now supported as source providers for AWS CodeBuild projects. This enhancement simplifies the continuous integration and continuous delivery (CI/CD) process, allowing users to initiate builds directly from changes in their source code hosted in GitLab repositories.

AWS CodeBuild is a fully managed, scalable, and flexible build service that compiles your source code, runs tests, and produces software artifacts. With the addition of GitLab and GitLab Self-Managed as source providers, developers can now seamlessly connect their projects to AWS CodeBuild and automate build processes.

To set up a connection between your GitLab repository and AWS CodeBuild, follow these steps:

Navigate to the AWS CodeBuild console in your AWS Management Console.
Create a new build project or select an existing one.
In the "Source" section, choose "GitLab" as your source provider
Provide the necessary details about your GitLab repository, such as the project URL and branch name.
Create or select an existing IAM role with appropriate permissions for AWS CodeBuild to interact with your GitLab repository, AWS resources, and other required services.

Set up any necessary buildspec files or configurations for your project.

 version: 0.2

 phases:
   install:
     runtime-versions:
       nodejs: 14
   build:
     commands:
       - npm install
       - npm run build

Complete the setup process and start or schedule a new build.

The integration between GitLab and AWS CodeBuild enables developers to take advantage of the following benefits:

Streamlined CI/CD processes: With direct access to your GitLab repositories, AWS CodeBuild can automatically initiate builds when changes are detected in the source code. This automation reduces manual intervention and accelerates development cycles.
Enhanced security: By establishing a secure connection using an access token, AWS CodeBuild can interact with your GitLab repository and other related resources while maintaining the necessary security measures.
Scalability: AWS CodeBuild offers a highly scalable build service, allowing you to handle multiple builds concurrently and efficiently. This capability is particularly valuable for large projects or teams that require parallel processing.
Flexibility: The integration supports both GitLab and self-managed GitLab instances, providing developers with the flexibility to choose their preferred source code management solution.

In conclusion, the integration of GitLab and GitLab Self-Managed as source providers in AWS CodeBuild is a significant step forward for streamlined CI/CD processes. By enabling builds to be initiated directly from changes in GitLab repositories, developers can now enjoy an even more efficient and secure development experience when working with AWS services.

Beginning the Journey into ML, AI and GenAI on AWS

Timur Galeev — Mon, 22 Jan 2024 21:50:14 GMT

Machine Learning (ML), Artificial Intelligence (AI), and Generative Artificial Intelligence (GenAI) are transformative technologies that have the potential to revolutionize industries across the globe.

At the last AWS re:Invent, there were numerous updates related to ML/AI and everything associated with these technologies. I also decided to delve into these topics and immerse myself in this field.

I won't delve into explaining the meanings of ML, AI, DL(Deep Learning), and GenAI. However, I'd like to touch upon FMs and LLM as we will focus our attention there. I found myself losing the same question when I came across this topic in my reading or listening. :)

Foundational Models (FMs) within the AWS ecosystem represent fundamental structures and algorithms essential for diverse AI applications. These models, often created by industry-leading AI companies, are integral to the development and functionality of AWS services, shaping the landscape of artificial intelligence on the platform. In the context of Amazon Bedrock, Language Models (LMs) play a pivotal role. These LMs contribute to the service's linguistic capabilities, facilitating advanced language understanding and content generation within the AWS environment.

AWS provides various services for Machine Learning and Artificial Intelligence, including Amazon SageMaker, AWS DeepLens, AWS DeepComposer, Amazon Forecast and more. Familiarize yourself with the services available to determine which ones suit your specific needs.

Generative Artificial Intelligence (GenAI) is a type of artificial intelligence that can generate text, images, or other media using generative models. AWS offers a range of services for building and scaling generative AI applications, including Amazon SageMaker, Amazon Rekognition, AWS DeepRacer, and Amazon Forecast. AWS has also invested in developing foundation models (FMs) for generative AI, which are ultra-large machine learning models that generative AI relies on. AWS has also launched the Generative AI Innovation Center, which connects AWS AI and ML experts with customers around the world to help them envision, design, and launch new generative AI products and services. Generative AI has the potential to revolutionize the way we create and consume media, but it is important to use it responsibly and ethically.

Some examples GenAI: One of the most well-known examples of GenAI is ChatGPT, launched by OpenAI, which became wildly popular overnight and galvanized public attention. Another model from OpenAI, called text-embedding-ada-002, is specifically designed to work with embeddings a type of database specifically designed to feed data into large language models (LLM). However, it’s important to note that generative AI creates artifacts that can be inaccurate or biased, making human validation essential and potentially limiting the time it saves workers. Therefore, end users should be realistic about the value they are looking to achieve, especially when using a service as is.

I've also delved a bit deeper into Broad AI when learning GenAI and I'd like to show this in the form of the following picture as it explains a lot.

Broad AI includes task-specific algorithms, Machine Learning (ML), and Deep Learning. These layers enable AI to perform tasks like image recognition, natural language processing, and complex pattern modeling.

The transition to GenAI involves Transfer Learning, Reinforcement Learning, and Autonomous Learning. These layers allow AI to apply knowledge across contexts, learn from interactions, and independently gather and learn from information.

So, the journey from Broad AI to GenAI represents significant leaps in AI capabilities, moving towards AI systems that can truly understand, learn, and adapt like a human brain.

Let's explore a couple of AWS services that, from my perspective, are among the more popular today.

Amazon SageMaker:

Amazon SageMaker is a comprehensive platform that simplifies the machine learning workflow. It covers everything from data labeling and preparation to model training and deployment. Take advantage of SageMaker's Jupyter notebook integration for interactive data exploration and model development. The platform also supports popular ML frameworks like TensorFlow and PyTorch.

Amazon Q is a groundbreaking Generative AI assistant crafted with a focus on security and privacy. Its purpose is to unleash the transformative capabilities of this technology for employees within organizations of varying sizes and across diverse industries.

Introduces robust enhancements to the generative AI service, Amazon Bedrock.

Amazon Bedrock, an entirely managed service on AWS, provides access to extensive language models and other foundational models (FMs) from prominent artificial intelligence (AI) companies such as AI21, Anthropic, Cohere, Meta, and Stability AI, all consolidated through a unified API.

I would also like to share more information about Amazon Bedrock here about the innovations that were announced at the latest AWS re:Invent.

Fine-tuning for Amazon Bedrock:
Now, there are increased opportunities for model customization in Amazon Bedrock, featuring fine-tuning support for Cohere Command Lite, Meta Llama 2, and Amazon Titan Text models, with Anthropic Claude's support expected soon.

These recent enhancements to Amazon Bedrock significantly reshape how organizations, regardless of their size or industry, can leverage generative AI to drive innovation and redefine customer experiences.

AWS is compatible with all the leading deep-learning frameworks, facilitating their deployment. The deep-learning Amazon Machine Image, accessible on both Amazon Linux and Ubuntu, allows for the creation of managed, auto-scalable GPU clusters. This enables training and inference processes to be conducted at any scale. Also, AWS offers a range of AI services that allow you to integrate pre-trained models into your applications without the need for deep expertise in machine learning. Services like Amazon Rekognition for image and video analysis, Amazon Comprehend for natural language processing, and Amazon Polly for text-to-speech can enhance your applications with AI capabilities.

The best way to solidify your understanding of ML, AI, and GenAI on AWS is through hands-on projects. Start with simple projects and gradually increase complexity as you gain confidence. Use datasets available on platforms like Kaggle or create your own to train and test models.

Conclusion:

Embarking on a journey into Machine Learning, Artificial Intelligence, and Generative Artificial Intelligence on AWS is an exciting endeavor. By following these steps, you can lay a solid foundation for your understanding and proficiency in leveraging AWS services for ML and AI applications. Remember, the key to success is a combination of hands-on experience, continuous learning, and active engagement with the AWS community. Happy training!

CloudFormation or Terraform or both :)

Timur Galeev — Sat, 04 Nov 2023 23:00:00 GMT

Both tools allow provisioning AWS infrastructure as code, but have key differences in approach and capabilities.

Infrastructure Modeling

CloudFormation uses YAML/JSON templates that define resources sequentially.

CloudFormation uses JSON/YAML templates to define AWS resources and their properties sequentially. Resources are created in the order defined in the template.

Terraform uses declarative configuration files and references between resources.

Terraform uses declarative configuration files written in HCL to define resources. Resources can reference attributes of other resources to establish dependencies between them in a flexible way.

Example

# CloudFormation
Resources:
  VPC:
    Type: AWS::EC2::VPC

  Subnet:
    Type: AWS::EC2::Subnet 
    Properties: 
      VpcId: !Ref VPC

# Terraform
resource "aws_vpc" "main" {}

resource "aws_subnet" "example" {
  vpc_id = aws_vpc.main.id
}

State Management

CloudFormation relies on the template to implicitly define the desired state. It does not maintain an explicit real-time state of deployed resources.

Terraform explicitly tracks the real-time state of all resources in a state file, usually stored locally or in remote storage like S3. This allows checking differences between the configuration and current state to maintain consistency.

Programming Interface

CloudFormation provides CLI and APIs.

CloudFormation provides a CLI and AWS APIs for managing templates and deployments. Custom logic can be added through custom resources.

Terraform offers rich plugins and SDK for custom providers.

In addition to the CLI and APIs, Terraform has a rich plugin ecosystem and supports programming infrastructure with its own API and SDK. This allows writing custom providers, provisioners and other automation tools.

Use Cases

Simple single AWS account deployments use CloudFormation
Complex multi-account infrastructure uses Terraform
Automating tasks beyond IaC requires Terraform

For example, a multi-tier app could use:

CloudFormation for per-account VPCs and load balancers
Terraform for cross-account databases/queues
Custom Terraform provider to deploy containers

Other Considerations

Version control
Stack policies
Change sets
Target types
Modules
Automation
IDE integration

In summary, while both serve IaC purposes, Terraform provides more flexibility, portability and automation capabilities - especially for multi-account, hybrid infrastructure deployments at scale.

Using EKS with Lambda on AWS

Timur Galeev — Sun, 08 Oct 2023 22:00:00 GMT

AWS EKS (Elastic Kubernetes Service) allows you to easily run Kubernetes clusters in the AWS cloud. Lambda is AWS' serverless compute service that allows you to run code without provisioning or managing servers. This article discusses how you can integrate EKS with Lambda to build serverless applications on Kubernetes.

Deploying Lambda functions to EKS

The main way to integrate EKS and Lambda is by deploying Lambda functions as Kubernetes deployments and services. This allows Kubernetes to manage and orchestrate the execution of Lambda code.

The steps to deploy a Lambda function to EKS are:

Create a Lambda function using the AWS CLI or SDK. This deploys the code and configuration to Lambda.
Create a Kubernetes Deployment and Service that points to the Lambda function ARN (Amazon Resource Name). The service exposes the function through a ClusterIP.
Kubernetes will trigger invocations of the Lambda function through the service endpoint. It handles load balancing, auto-scaling and orchestration of the function.

Benefits of using EKS and Lambda together

Leverage Kubernetes APIs and tools to deploy, manage and scale Lambda functions
Build serverless applications as Kubernetes workloads for portability across environments
Take advantage of Kubernetes features like auto-scaling, rolling updates, blue-green deployments etc. for Lambda code
Integrate Lambda functions into existing container-based applications on EKS

This allows building fully serverless applications that leverage the power of Kubernetes for orchestration along with AWS Lambda's ease of use.

Here is a sample Python code for deploying a Lambda function as a Kubernetes workload:

# Create Lambda function
lambda_client.create_function(
   FunctionName='myfunction',
   Runtime='python3.8', 
   Handler='index.handler',
   Code={
      'ZipFile': bytecode
   }
)

# Create Kubernetes Deployment
api.create_namespaced_deployment(
   namespace='default',
   body={
      'apiVersion': 'apps/v1',
      'kind': 'Deployment',
      'metadata': {
         'name': 'lambda-deploy'
      },
      'spec': {
         'replicas': 1,
         'selector': {
            'matchLabels': {
               'app': 'lambda'
            }
         },
         'template': {
            'metadata': {
               'labels': {
                  'app': 'lambda'
               }
            },
            'spec': {
               'containers': [
                  {
                     'name': 'lambda-container',
                     'image': 'public.ecr.aws/lambda/python:3.8',
                     'env': [
                        {
                           'name': 'AWS_LAMBDA_FUNCTION_NAME', 
                           'value': 'myfunction'
                        },
                        {
                           'name': 'AWS_REGION',
                           'value': 'us-east-1' 
                        }
                     ]
                  }
               ]
            }
         }
      }
   }
)

# Expose Deployment as Kubernetes Service
api.create_namespaced_service(
   namespace='default',
   body={
      'apiVersion': 'v1', 
      'kind': 'Service',
      'metadata': {
         'name': 'lambda-service' 
      },
      'spec': {
         'ports': [
            {
               'port': 8080, 
               'targetPort': 8080
            }
         ],
         'selector': {
            'app': 'lambda'
         }
      }
   }
)

This demonstrates how to deploy a Lambda function as a Kubernetes workload and expose it through a service for invocation. EKS and Lambda provide a powerful way to build serverless applications on Kubernetes.

Security Group AWS NLB (AWS new feature)

Timur Galeev — Wed, 09 Aug 2023 22:00:00 GMT

You can now create security groups in AWS Network Load Balancer (AWS NLB)

With this update, you can configure rules to ensure that your NLB only accepts traffic from trusted IP addresses, and centrally enforce access control policies

If you are using EKS just update your LB controller to 2.6.0 version and configure it🫡

Please check out more information here:

https://aws.amazon.com/about-aws/whats-new/2023/08/network-load-balancer-supports-security-groups/

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html

Let's jump deep

A crucial aspect of configuring an NLB is setting up security groups to control inbound and outbound traffic to and from the load balancer. A security group acts as a virtual firewall, allowing only specific traffic to reach the NLB based on predefined rules. This article will discuss how to configure security groups for an AWS NLB.

Creating a Security Group for NLB

To create a security group for an NLB, follow these steps:

Log in to the AWS Management Console and navigate to the VPC dashboard.
Click on "Security Groups" in the left-hand menu, then click the "Create security group" button.
Enter a name and description for the security group, select the VPC in which to create it, and click "Create."

Configuring Inbound Rules

Once the security group is created, you need to configure inbound rules to allow traffic to reach the NLB. To do this, follow these steps:

Click on the security group you just created.
Click the "Edit inbound rules" button.
Add a rule for each type of traffic you want to allow, specifying the protocol, port range, and source. For example, if you want to allow HTTP traffic from anywhere, add a rule with the following settings:
- Type: Custom TCP Rule
- Protocol: TCP
- Port range: 80
- Source: 0.0.0.0/0 (or a specific IP address or range)
Click "Save rules" to apply the changes.

Configuring Outbound Rules

By default, outbound traffic is allowed from an NLB. However, you can configure outbound rules to restrict the types of traffic that can leave the NLB. To do this, follow these steps:

Click on the security group you just created.
Click the "Edit outbound rules" button.
Add a rule for each type of traffic you want to allow, specifying the protocol, port range, and destination. For example, if you want to allow all outbound traffic, add a rule with the following settings:
- Type: Allow All Traffic
- Protocol: All
- Port range: All
- Destination: 0.0.0.0/0 (or a specific IP address or range)
Click "Save rules" to apply the changes.

Best Practices for Configuring Security Groups

When configuring security groups for an AWS NLB, follow these best practices:

Allow only the minimum necessary traffic: Only allow the specific types of traffic that your application requires. This reduces the attack surface and helps prevent unauthorized access.
Use specific sources and destinations: Instead of allowing all traffic from anywhere, specify a specific IP address or range. This provides an additional layer of security.
Use security groups in combination with network ACLs: Security groups and network access control lists (ACLs) work together to provide an additional layer of security. While security groups are stateful, meaning that they track the state of connections and allow return traffic, network ACLs are stateless and do not track the state of connections.
Regularly review security group rules: Regularly review your security group rules to ensure that they still meet your needs and are up-to-date with any changes in your application requirements.

Conclusion

Configuring security groups for an AWS NLB is a crucial aspect of setting up and securing your load balancer. By following the best practices outlined in this article, you can ensure that only the necessary traffic is allowed to reach your NLB and that your application remains secure.

🥳Mountpoint for AWS S3💥

Timur Galeev — Tue, 08 Aug 2023 22:00:00 GMT

This will make it easier for me.🫡🤗

With this update you can create a mount point and mount AWS S3 bucket (or a path within a bucket) at the mount point, and then access the bucket using shell commands (ls, cat, dd, find, and so forth), library functions (open, close, read, write, creat, opendir, and so forth) or equivalent commands and functions as supported in the tools and languages that you already use.

Before many AWS users use the S3 APIs and the AWS SDKs to build applications that can list, access, and process the contents of an S3 bucket and now you can more 🫡🥳

Some information about this update:

Pricing – you pay only for the underlying S3 operations.
Performance – Mountpoint is able to take advantage of the elastic throughput offered by S3, including data transfer at up to 100 Gb/second between each EC2 instance and S3.
Credentials – Mountpoint accesses your S3 buckets using the AWS credentials that are in effect when you mount the bucket.
Storage Classes – You can use Mountpoint to access S3 objects in all storage classes except S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, S3 Intelligent-Tiering Archive Access Tier, and S3 Intelligent-Tiering Deep Archive Access Tier.
Open Source – Mountpoint is open source and has a public roadmap. Your contributions are welcome; be sure to read our Contributing Guidelines and our Code of Conduct first.

Some links:

https://aws.amazon.com/about-aws/whats-new/2023/08/mountpoint-amazon-s3-generally-available/

Let's jump deep

Mountpoint for AWS S3 is an open-source tool that enables you to mount an S3 bucket as a file system in your Linux environment, effectively bridging the gap between object storage and traditional file systems. Developed by MinIO, Mountpoint for AWS S3 brings several benefits to the table, making it easier for businesses to manage their cloud storage and streamline operations.

Ease of Integration

Mountpoint for AWS S3 allows you to mount an S3 bucket as a local file system, enabling seamless integration with existing applications and tools that rely on traditional file I/O operations. This eliminates the need for additional programming or customization efforts when working with object storage, saving both time and resources.

Moreover, Mountpoint for AWS S3 supports various Linux file systems, including XFS, ext4, and Btrfs, providing flexibility in choosing the right file system for your specific use case.

Improved Performance

By mounting an S3 bucket as a file system, Mountpoint for AWS S3 enables you to take advantage of native Linux caching mechanisms, such as the page cache and the dentry cache. These caches help reduce latency and improve throughput by storing frequently accessed data in memory, resulting in faster access times and more efficient data transfers.

Additionally, Mountpoint for AWS S3 supports multi-threaded operations, allowing it to leverage multiple CPU cores for parallel data processing. This further enhances performance and enables you to handle large datasets more efficiently.

Data Durability and Security

Amazon S3 is designed for 99.999999999% durability and provides a range of security features, such as access control policies, encryption, and data integrity checks. Mountpoint for AWS S3 ensures that these benefits are passed on to the file system level, allowing you to maintain the same level of data protection and durability without additional configuration.

Furthermore, Mountpoint for AWS S3 supports object locking, a feature that provides an additional layer of protection against accidental or malicious data modifications. Object locking can be used to create write-once-read-many (WORM) workflows, ensuring that critical data remains immutable and cannot be altered or deleted for a specified retention period.

Cost-Effective Scalability

As your storage needs grow, so does the cost of managing and maintaining on-premises infrastructure. AWS S3 offers a pay-as-you-go pricing model, allowing you to scale your storage capacity without the need for upfront investments or complex capacity planning.

Mountpoint for AWS S3 enables you to tap into this scalability while maintaining a familiar file system interface, making it an attractive option for businesses looking to optimize their cloud storage costs.

Conclusion

Mountpoint for AWS S3 is a powerful tool that simplifies the integration of Amazon S3 with your Linux environment, improves performance, and maintains data durability and security. By bridging the gap between object storage and traditional file systems, Mountpoint for AWS S3 offers a cost-effective and scalable solution for managing your cloud storage needs. Whether you're working with big data, media assets, or backup and archival data, Mountpoint for AWS S3 can help streamline your operations and improve overall efficiency.

Deploying Kubernetes Clusters to AWS with k8s-cdk

Timur Galeev — Wed, 07 Jun 2023 22:00:00 GMT

The Kubernetes CDK (k8s-cdk) is an open-source project that makes it easy to define and provision Kubernetes infrastructure on AWS using the AWS CDK. It provides constructs for core Kubernetes resources like Clusters, Nodes, and Services that simplify the deployment of Kubernetes applications to AWS.

So let's start to create:

To get started, install the k8s-cdk library:

npm install --save @kubernetes-cdk/cdk-core

Then create a new CDK app:

cdk init app --language=typescript

This will set up a basic CDK app structure with TypeScript support.

Defining a Cluster

To define a Kubernetes cluster, import the Cluster construct and provide configuration:

import { Cluster } from '@kubernetes-cdk/cdk-core';

//...

new Cluster(this, 'MyCluster', {
  version: k8s.KubernetesVersion.V1_21,
  subnets: [subnet1, subnet2] 
});

This will provision a managed EKS cluster running Kubernetes across the specified subnets.

Deploying Resources

Additional Kubernetes resources like Pods, Services, etc. can then be defined and added to the cluster:

const nginxDeployment = new k8s.Deployment(this, 'NginxDeployment', {
  cluster: cluster,
  spec: {
    selector: {
      matchLabels: {
        app: 'nginx',
      },
    },
    //...
  },
});

Deploying

Finally, synthesize and deploy the CDK app to provision the Kubernetes infrastructure and deploy the resources:

cdk deploy

The k8s-cdk makes it simple to define Kubernetes clusters and applications using familiar AWS CDK patterns. This allows for infrastructure as code deployments of Kubernetes on AWS.

Domain-Driven Design (DDD) in AWS. Find Your Business Domains.

Timur Galeev — Wed, 29 Mar 2023 15:27:12 GMT

This article is an introduction to Domain-Driven Design and how it can be used with AWS. I will provide guidance on how to define business domains in legacy monolithic applications and decompose them into a set of microservices step by step. By starting with Domain-Driven Design for your microservices, you can get the benefits of cloud scaling in your new refactored application.

Is Domain-Driven Design usefull for me?

The purpose of Domain-Driven Design is to free the domain code from technical details to have more room to work with its complexity. It is well suited to work with very complex domains and projects that are starting to dive into legacy.

Domain-Driven Design requires an understanding of the business idea or understanding of the final 'business product'. It requires time and commitment from both business experts and technical implementers. Domain-Driven Design should not be used in situations where you need 'quick solutions'. Instead, use Domain-Driven Design for software that supports the core business area rather than supporting areas. Running Domain-Driven Design can be achieved through an event-storming session. However, as mentioned above, this is a commitment worth making. It will allow you to develop software that is more tailored to the needs of your end clients. It will also help create decoupled services that are more scalable and maintainable. The combination will result in greater business agility.

Event Storming

Event Storming helps teams of business and technical people come to a consensus on what the solution should be. This happens without being distracted by the specific implementation details of how it will be implemented. This means, that it may take longer for the teams to start providing source code. However, all teams will be better aligned as to what each microservice should be responsible for. The event storming workshop is a brainstorming session. In this session all stakeholders in the solution work together to define the business events that correspond to the domains. Suppose, we have a commerce development task where the business event might be a customer who applies for a new product. During the workshop, the group will begin to identify the object that triggered the event, the processes that should occur as a result, and any subsequent event triggered by the original event.

To do this, a team brainstorming session takes place where the event groups can identify areas of their business and then the contexts in which they operate. These can be used to define the usage and the relationship that occurs between each microservice and it’s context. Once the domains have been defined with the help of the business experts, the technical implementers can start designing the solution.

The result of the event-storming session is a domain model for development. The domain model can be used to define a number of bounded contexts.

Bounded Contexts

A bounded context is the boundary where each domain applies. The order contract opening example can be thought of as the 'Order Contract Opening Context' in a shop. In a complete system, there may be other contexts such as the product context, the description context and the manufacturing context. Identifying the business events that cause interactions between the different constrained contexts helps to determine how your microservices will interact with each other in the new architecture.

The example context map is just a sample of the core domains. There are also a number of supporting and subsequent domains. Although it is necessary to have a service that manages these, this is not part of the core application domain for sending products.

Core Domains and refactoring

When you start defining "Core Domains and Subsequent Domains" the question that usually arises is how to manage requests between domains.To do this, we'll look at options for using AWS services for how domains can be implemented. Containerisation or serverless diagram of such solutions would rather be a 'modern architecture' than a good old-fashioned network and virtual machine deployment diagram. The advantage of these solutions is that the diagram itself helps to actually outline, what the logical functionality is, since we can be more expressive and fine-grained with resource usage. The undisputed king of serverless computing platforms has been AWS LambdaAWS Lambda for several years now. It satisfies all the aforementioned conditions and can be used in a number of languages/implementations, including TypeScript. Other viable options might include some of the more well-known container services such as AWS ECS or AWS EKS wrapped Fargate. However, they require considerably more setup and configuration, and also require that containerization actually takes place. It doesn't mean that containerisation is bad, in general containerisation can be good, it all depends on your development idea whether it's refactoring into microservices or starting a new application. If you need Eventing then here it is Simple Notification Service (SNS). It is a push-based service, i.e. it automatically handles the distribution of the event to the recipients. SNS uses a pay-per-use model and it is essentially serverless as the only infrastructure you need is the SNS subject. The modern cloud is about using its own API products to expose its applications, rather than building something of its own with Fastify, Kong or the like. The API gateway acts as the only public interface connected to any other infrastructure, in our case primarily our Lambda compute functions, which will respond to paths defined in the gateway. In the case of AWS the service of interest, unsurprisingly, is called the AWS API Gateway.

The following diagram shows how monolith receives some of the traffic during the gradual addition of new microservices in the example application.

There is also AWS Migration Hub, it will help you in finding your domains and even offer AWS services that you can implement. This will help you to plan a refactoring or plan for migrating from your old OnPrem solutions to AWS with all modern solutions.

Conclusion

To summarise, people just don't tend to talk about 'domains' all day. Most employees do not pay attention to the implementation of domains in the organisation. It is also worth noting that dividing systems into domains after they have been fully designed is also useless. DDD should be done, at least approximately, at the initial design stage. But in any case, DDD is the place to be. In the example of using DDD in AWS it looks simple but when you start to go deeper here you find a lot of services where they all have to be interconnected and here comes the methods and dependencies. That's why it's very important to create a structure at the beginning of the work.

Timur Galeev Blog

Building vibestack: how I stopped re-explaining myself to my AI

A small confession, before anything else

Why the personal layer matters

What vibestack actually is (in one sentence, then several)

Why I built it instead of using something off-the-shelf

The five principles, the way I'd say them out loud

The part where hooks earn their keep

The other kind of skill: thinking partners

/office-hours - the skill to run before writing a single line

/plan-ceo-review - the dispassionate reread

Why these two together

How vibestack installs itself, and why I'm proud of it

The sibling: vibekit

Why this matters in 2026, and not in some abstract way

What I'm doing next, and what I'd do differently

Wrapping up

Sources and further reading

Working with AWS European Sovereign Cloud (ESC): Terraform, IaC, and what's different

Why This Exists

What's Actually Different

The Security Foundation

The Catch (There's Always a Catch)

What Services Are Available

What's Missing

Deploying Containers — The Practical Bits

Infrastructure as Code: The Real Story

Terraform and OpenTofu

AWS CDK

CloudFormation

Multi-Partition Patterns

Planning Your Architecture

Migration Path

Cost Reality

Who Should Actually Use This

The Competitive Landscape

What's Coming

Bottom Line

ECS vs EKS: When You DON'T Need Kubernetes - A Practical Guide to Choosing AWS Container Services

Introduction

Quick Comparison: ECS vs EKS

Architecture: How It Works

When ECS is Your Best Choice

Scenario 1: Multi-Regional Deployment (3-5 Services)

Real Example

Scenario 2: Quick Start and Simplicity

Scenario 3: AWS-Native Project

When EKS Becomes Necessary

Scenario 1: Large Microservices Architecture (20+ Services)

Scenario 2: Multi-Cloud or Hybrid Infrastructure

Scenario 3: Advanced Features

Practical Deployment Examples

Building Docker Images First

ECS Deployment with Terraform

Step 1: VPC Setup

Step 2: ECS Cluster and Service

Step 3: IAM Roles

Deploy It

EKS Deployment with Terraform

Step 1: EKS Cluster

Step 2: IAM for EKS

Step 3: Kubernetes Manifests

Deploy It

Complexity Comparison

Real Cases and Economics

Case 1: Startup with 5 Microservices in 3 Regions

Case 2: Large Project with 30 Services in 1 Region

Time for Setup and Maintenance

Decision Checklist: What to Choose?

Choose ECS if:

Choose EKS if:

Middle Ground

Conclusions

Final Thoughts

Sources

AWS ECS Evolution: Managed Instances and Advanced Deployment Strategies

ECS Managed Instances: Bridging the Gap Between Control and Simplicity

What Makes It Different?

Understanding the Cost Model

When to Choose ECS Managed Instances

`/office-hours` - the skill to run before writing a single line

`/plan-ceo-review` - the dispassionate reread