<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Vibe Coding Forem: AWS Community Builders </title>
    <description>The latest articles on Vibe Coding Forem by AWS Community Builders  (@aws-builders).</description>
    <link>https://vibe.forem.com/aws-builders</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png</url>
      <title>Vibe Coding Forem: AWS Community Builders </title>
      <link>https://vibe.forem.com/aws-builders</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://vibe.forem.com/feed/aws-builders"/>
    <language>en</language>
    <item>
      <title>The Model Is the Brain. The Harness Is the Body. Here's Why That Matters</title>
      <dc:creator>Ajit</dc:creator>
      <pubDate>Mon, 04 May 2026 05:18:31 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/the-model-is-the-brain-the-harness-is-the-body-heres-why-that-matters-2961</link>
      <guid>https://vibe.forem.com/aws-builders/the-model-is-the-brain-the-harness-is-the-body-heres-why-that-matters-2961</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I built the same browser agent twice — once with 500 lines of Python, once with 7 lines of JSON. The second one took 5 minutes. The agent harness layer is becoming the real competitive advantage, not the model.&lt;/p&gt;

&lt;p&gt;Last month, I built a browser automation agent. Playwright. Custom orchestration. Login handlers. Error retries. Session management. React-aware form filling. Anti-detection scripts. &lt;strong&gt;500+ lines of Python.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This week, I built the same thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bedrock"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"modelId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us.anthropic.claude-sonnet-4-6"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentcore_browser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"browser"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"systemPrompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a web browsing assistant."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy. Invoke. It browses websites, extracts data, fills forms. &lt;strong&gt;Seven lines. Zero orchestration code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But here's the thing most people miss: &lt;strong&gt;I kept both versions.&lt;/strong&gt; And that's the real insight.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed (and What Didn't)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;500-Line Script&lt;/th&gt;
&lt;th&gt;7-Line Harness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automates a specific multi-site workflow&lt;/td&gt;
&lt;td&gt;Browses any website, extracts info&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How it decides&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;I wrote every step&lt;/td&gt;
&lt;td&gt;AI decides the steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (Playwright, local)&lt;/td&gt;
&lt;td&gt;~$0.10-0.50 (Bedrock tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;95%+ (deterministic)&lt;/td&gt;
&lt;td&gt;~80% (AI reasoning varies)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only does what I coded&lt;/td&gt;
&lt;td&gt;Handles any browsing task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time to build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 days of debugging&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The 500-line script is better for its specific job.&lt;/strong&gt; It runs faster, cheaper, and more reliably. Because it doesn't need AI — the steps are known.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 7-line harness is better for everything else.&lt;/strong&gt; Research tasks. Data extraction from unfamiliar sites. Competitive analysis. Anything where the steps aren't known in advance.&lt;/p&gt;

&lt;p&gt;This is my POV: &lt;strong&gt;deterministic + AI is the right architecture.&lt;/strong&gt; Don't use a $0.03/call model to click a button you can click with Playwright for free. But don't write 500 lines of Playwright when 7 lines of config can handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Harness Is the New Battleground
&lt;/h2&gt;

&lt;p&gt;Everyone's talking about which model is best. Claude vs GPT vs Gemini. Benchmarks. Context windows. Reasoning scores.&lt;/p&gt;

&lt;p&gt;That conversation is becoming irrelevant.&lt;/p&gt;

&lt;p&gt;Models are commoditizing. Claude Sonnet 4.6 and GPT-5.5 are both "good enough" for most agent tasks. The real question is: &lt;strong&gt;what wraps around the model to make it actually work in production?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the harness — the orchestration loop, tool execution, memory, security, compute isolation. And every cloud provider is racing to own it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Harness Product&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AgentCore Harness&lt;/td&gt;
&lt;td&gt;Preview (Apr 2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bedrock Managed Agents (OpenAI-specific)&lt;/td&gt;
&lt;td&gt;Limited Preview&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini Enterprise Agent Platform&lt;/td&gt;
&lt;td&gt;GA (Apr 2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Microsoft&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Azure AI Agent Service&lt;/td&gt;
&lt;td&gt;GA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Salesforce&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agentforce&lt;/td&gt;
&lt;td&gt;GA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the container orchestration war all over again. In 2015, everyone had containers. The question was who manages running them. Kubernetes won, and whoever controlled K8s controlled where workloads ran.&lt;/p&gt;

&lt;p&gt;In 2026, everyone has models. The question is who manages running agents. &lt;strong&gt;Whoever controls the harness controls the next decade of cloud spend.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How AgentCore Harness Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You (prompt) → AgentCore Harness → Bedrock Model (reasoning)
                    ↓                      ↓
              Firecracker microVM    Tool selection
              (isolated per session)       ↓
                    ↓              AgentCore Browser / Shell / Code
              Persistent memory    
              (across sessions)    
                    ↓
              Streamed response → You
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What AWS handles: compute, orchestration loop, tool invocation, memory, auth, observability.&lt;br&gt;
What you handle: a JSON config and a prompt.&lt;/p&gt;

&lt;p&gt;Each session runs in its own &lt;strong&gt;Firecracker microVM&lt;/strong&gt; — the same isolation technology behind Lambda. Not a container. A VM. One session can't see another's data, cookies, or credentials.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started (I Actually Ran This)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install CLI&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @aws/agentcore@preview

&lt;span class="c"&gt;# Create project&lt;/span&gt;
agentcore create &lt;span class="nt"&gt;--name&lt;/span&gt; browseragent &lt;span class="nt"&gt;--model-provider&lt;/span&gt; bedrock
&lt;span class="nb"&gt;cd &lt;/span&gt;browseragent

&lt;span class="c"&gt;# Add browser tool&lt;/span&gt;
agentcore add tool &lt;span class="nt"&gt;--harness&lt;/span&gt; browseragent &lt;span class="nt"&gt;--type&lt;/span&gt; agentcore_browser &lt;span class="nt"&gt;--name&lt;/span&gt; browser

&lt;span class="c"&gt;# Set target account + region&lt;/span&gt;
&lt;span class="c"&gt;# Edit agentcore/aws-targets.json: [{"name":"default","region":"us-west-2","account":"YOUR_ACCOUNT"}]&lt;/span&gt;

&lt;span class="c"&gt;# Deploy (~3 min)&lt;/span&gt;
agentcore deploy &lt;span class="nt"&gt;--yes&lt;/span&gt;

&lt;span class="c"&gt;# Use it&lt;/span&gt;
agentcore invoke &lt;span class="nt"&gt;--harness&lt;/span&gt; browseragent &lt;span class="nt"&gt;--stream&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Go to example.com and describe what you see"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output from my actual run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔧 Tool: browser
⚡ 6005 in · 110 out · 2.2s
Here's what's on the page at example.com:
### Example Domain
The page contains: "Example Domain" heading, body text about documentation use,
and a "Learn more" link to IANA documentation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real. Not a demo. Not a screenshot from someone else's blog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Considerations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;What I Found&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No harness charge. You pay for Bedrock tokens + Browser session time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;us-west-2, us-east-1, eu-central-1, ap-southeast-2 (preview)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Models&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any Bedrock model, plus OpenAI and Gemini. Switch mid-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Firecracker microVM isolation, IAM execution role, Cedar policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Limitation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Preview — not for production workloads yet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;⚠️ &lt;strong&gt;Gotcha I hit:&lt;/strong&gt; The harness execution role needs &lt;code&gt;bedrock:Converse&lt;/code&gt; and &lt;code&gt;bedrock:ConverseStream&lt;/code&gt; permissions, plus &lt;code&gt;aws-marketplace:ViewSubscriptions&lt;/code&gt; for 3P models. The default CDK policy only includes &lt;code&gt;bedrock:InvokeModel&lt;/code&gt;. I had to add permissions manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Use Harness
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic automation&lt;/strong&gt; (same steps every time) → Playwright. Cheaper, faster, more reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex multi-agent workflows&lt;/strong&gt; → Strands Agents SDK with AgentCore Runtime. More control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Existing framework investment&lt;/strong&gt; (LangChain/CrewAI) → Use AgentCore tools standalone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production workloads&lt;/strong&gt; → Wait for GA. It's preview.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;The model is the brain. The harness is the body. Most teams are spending all their time picking the brain and hand-building the body from scratch every time.&lt;/p&gt;

&lt;p&gt;AgentCore Harness lets you stop building bodies and start building solutions. For 80% of agent use cases, config beats code. For the other 20%, write code — but use the harness infrastructure underneath.&lt;/p&gt;

&lt;p&gt;The teams still hand-coding agent orchestration loops are building technical debt. The same way teams hand-coding REST APIs built technical debt before API Gateway existed.&lt;/p&gt;

&lt;p&gt;The question isn't whether to adopt managed agent infrastructure. It's whether you'll be building on it — or competing against someone who already is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ajit NK — AWS Community Builder, APN FasTrack Partner. Building AI agent solutions at CloudNestle.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;"The model is the brain. The harness is the body. I build the body."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;📚 &lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/harness.html" rel="noopener noreferrer"&gt;AgentCore Harness docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2026/04/agentcore-new-features-to-build-agents-faster/" rel="noopener noreferrer"&gt;What's New announcement (Apr 22, 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://strandsagents.com" rel="noopener noreferrer"&gt;Strands Agents SDK&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>aws</category>
      <category>ai</category>
      <category>agents</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Serverless Workflow Decomposition: When a Step Function Becomes a Monolith</title>
      <dc:creator>Renaldi</dc:creator>
      <pubDate>Sun, 03 May 2026 23:30:00 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/serverless-workflow-decomposition-when-a-step-function-becomes-a-monolith-1hch</link>
      <guid>https://vibe.forem.com/aws-builders/serverless-workflow-decomposition-when-a-step-function-becomes-a-monolith-1hch</guid>
      <description>&lt;p&gt;There is a point in many serverless platforms where a Step Functions workflow that once felt elegant starts to feel like a mini application platform of its own.&lt;/p&gt;

&lt;p&gt;I have seen this happen in teams that are doing many things correctly: they standardized orchestration, they improved visibility, and they moved fragile glue logic out of Lambdas. Then six months later, the workflow has 100+ states, a maze of &lt;code&gt;Choice&lt;/code&gt; branches, deeply nested payload transformations, and a deployment blast radius that makes everyone nervous.&lt;/p&gt;

&lt;p&gt;This post is about &lt;strong&gt;recognizing workflow sprawl early&lt;/strong&gt; and decomposing a Step Functions workflow into a more maintainable architecture without losing the benefits of orchestration.&lt;/p&gt;

&lt;p&gt;I will cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Signs of workflow sprawl&lt;/li&gt;
&lt;li&gt;Splitting by domain and subprocess boundaries&lt;/li&gt;
&lt;li&gt;Parent-child workflow patterns&lt;/li&gt;
&lt;li&gt;Contracting inputs and outputs&lt;/li&gt;
&lt;li&gt;Versioning workflows safely&lt;/li&gt;
&lt;li&gt;An end-to-end walkthrough with architecture and code&lt;/li&gt;
&lt;li&gt;Implementation discussion and migration guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will use AWS Step Functions terminology throughout, but the architectural thinking applies broadly to workflow systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;A large workflow is not automatically a bad workflow.&lt;/p&gt;

&lt;p&gt;In fact, I often start with a single orchestration when I want to make the business process visible quickly. The problem is not “too many states” by itself. The problem is when a workflow stops reflecting a coherent business flow and instead becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a catch-all for multiple domains&lt;/li&gt;
&lt;li&gt;a deployment bottleneck&lt;/li&gt;
&lt;li&gt;a fragile contract hub&lt;/li&gt;
&lt;li&gt;a place where teams are afraid to change anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, I treat it like I would a code monolith that has outgrown its boundaries: &lt;strong&gt;decompose intentionally, not reactively&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I mean by a "Step Function monolith"
&lt;/h2&gt;

&lt;p&gt;For this post, a Step Function becomes a monolith when one state machine accumulates responsibilities that should be owned by separate domains or subprocesses.&lt;/p&gt;

&lt;p&gt;Typical symptoms include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Order orchestration, payment rules, inventory logic, fraud checks, and notifications all embedded in one ASL definition&lt;/li&gt;
&lt;li&gt;Repeated transformation states to make one team's output fit another team's input&lt;/li&gt;
&lt;li&gt;Error handling branches duplicated across unrelated parts of the flow&lt;/li&gt;
&lt;li&gt;A single workflow release requiring coordination across multiple teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not just a readability issue. It affects operability, testing, and change safety.&lt;/p&gt;




&lt;h2&gt;
  
  
  Signs of workflow sprawl
&lt;/h2&gt;

&lt;p&gt;These are the patterns I look for during architecture reviews.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) One workflow owns too many domains
&lt;/h3&gt;

&lt;p&gt;If a single state machine is enforcing rules that belong to Payments, Inventory, Fraud, Fulfillment, and Notifications, it is likely doing too much.&lt;/p&gt;

&lt;p&gt;A good orchestrator should coordinate domains, not absorb their internal logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) The ASL definition becomes hard to reason about
&lt;/h3&gt;

&lt;p&gt;Signs include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;many long &lt;code&gt;Choice&lt;/code&gt; chains&lt;/li&gt;
&lt;li&gt;repeated &lt;code&gt;Pass&lt;/code&gt;/transform states just to reshape data&lt;/li&gt;
&lt;li&gt;large &lt;code&gt;Catch&lt;/code&gt; and &lt;code&gt;Retry&lt;/code&gt; blocks copied across multiple branches&lt;/li&gt;
&lt;li&gt;difficulty tracing the happy path from start to finish&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I need a map just to explain the workflow in a design review, decomposition is usually overdue.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Payloads become "workflow-shaped" instead of domain-shaped
&lt;/h3&gt;

&lt;p&gt;A common smell is a giant state payload that keeps growing because every future step might need something.&lt;/p&gt;

&lt;p&gt;Symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;many fields carried "just in case"&lt;/li&gt;
&lt;li&gt;internal step-specific fields leaking into later steps&lt;/li&gt;
&lt;li&gt;brittle JSONPath references across distant states&lt;/li&gt;
&lt;li&gt;accidental coupling to intermediate output shapes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is often the strongest signal that input/output contracts need to be tightened.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Change blast radius is too large
&lt;/h3&gt;

&lt;p&gt;If a small payment change forces re-testing the full order pipeline end-to-end, you are paying monolith tax in a serverless system.&lt;/p&gt;

&lt;p&gt;I watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;frequent merge conflicts in the same workflow definition&lt;/li&gt;
&lt;li&gt;unrelated teams blocking each other&lt;/li&gt;
&lt;li&gt;release windows for “workflow changes”&lt;/li&gt;
&lt;li&gt;fear of touching central error paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5) Execution histories are huge and troubleshooting is slow
&lt;/h3&gt;

&lt;p&gt;When executions become long and noisy, step histories are harder to navigate. Even when the workflow is functionally correct, operator experience degrades.&lt;/p&gt;

&lt;p&gt;This matters during incidents. The fastest diagnosis usually comes from clear orchestration boundaries and localized subprocess execution histories.&lt;/p&gt;

&lt;h3&gt;
  
  
  6) Reuse pressure leads to copy/paste orchestration
&lt;/h3&gt;

&lt;p&gt;If teams are duplicating chunks of states for common subprocesses (for example, document validation, payment authorization, fraud scoring), that is a strong indicator those chunks should become child workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  7) Mixed execution profiles are forced into one workflow
&lt;/h3&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a mostly synchronous checkout path mixed with long-running fulfillment polling&lt;/li&gt;
&lt;li&gt;high-throughput lightweight paths mixed with complex human approval steps&lt;/li&gt;
&lt;li&gt;latency-sensitive branches mixed with eventual-consistency branches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These often want different execution patterns, retry policies, and operational ownership.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decomposition principles I use
&lt;/h2&gt;

&lt;p&gt;When I decompose a Step Functions workflow, I do not split it by "number of states." I split it by &lt;strong&gt;architectural responsibility&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 1: Keep the parent workflow focused on orchestration decisions
&lt;/h3&gt;

&lt;p&gt;The parent should answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which subprocess runs next?&lt;/li&gt;
&lt;li&gt;Should we continue or compensate?&lt;/li&gt;
&lt;li&gt;What is the overall status?&lt;/li&gt;
&lt;li&gt;Which events should be emitted?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It should not implement deep domain logic that belongs in a domain-owned subprocess.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 2: Split by domain or stable subprocess boundary
&lt;/h3&gt;

&lt;p&gt;Great candidates for child workflows are subprocesses that are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;domain-owned (Payments, KYC, Inventory)&lt;/li&gt;
&lt;li&gt;reusable across multiple parent workflows&lt;/li&gt;
&lt;li&gt;likely to evolve independently&lt;/li&gt;
&lt;li&gt;complex enough to justify dedicated retries/error handling&lt;/li&gt;
&lt;li&gt;testable as a standalone business unit&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Principle 3: Define explicit input and output contracts
&lt;/h3&gt;

&lt;p&gt;Do not pass the entire parent state to every child.&lt;/p&gt;

&lt;p&gt;Instead, define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a minimal child input contract&lt;/li&gt;
&lt;li&gt;a stable child output contract&lt;/li&gt;
&lt;li&gt;an error/failure contract (where applicable)&lt;/li&gt;
&lt;li&gt;version metadata in the contract or state machine aliasing strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the workflow equivalent of well-designed service APIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 4: Decompose to reduce blast radius, not to maximize nesting
&lt;/h3&gt;

&lt;p&gt;Nested workflows are powerful, but over-nesting can create its own complexity.&lt;/p&gt;

&lt;p&gt;I avoid decomposition that creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wrappers around trivial single-step tasks&lt;/li&gt;
&lt;li&gt;nested workflows with no clear ownership&lt;/li&gt;
&lt;li&gt;chains of parent -&amp;gt; child -&amp;gt; grandchild just for aesthetics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is better changeability and operability, not "micro-workflows everywhere."&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 5: Preserve the business narrative
&lt;/h3&gt;

&lt;p&gt;After decomposition, I still want to be able to explain the parent workflow in plain language.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Validate order -&amp;gt; Process payment -&amp;gt; Reserve inventory -&amp;gt; Create shipment -&amp;gt; Notify customer&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the parent becomes an opaque set of “InvokeChildX” states with no business story, the design needs refinement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parent-child workflow patterns
&lt;/h2&gt;

&lt;p&gt;There is no single nesting pattern that fits every case. I typically use a small set of patterns and choose deliberately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern A: Synchronous child workflow (request/response style orchestration)
&lt;/h3&gt;

&lt;p&gt;The parent waits for the child to finish and uses the output immediately.&lt;/p&gt;

&lt;p&gt;Use when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the next parent decision depends on child output&lt;/li&gt;
&lt;li&gt;the subprocess is part of the critical path&lt;/li&gt;
&lt;li&gt;you want localized retries inside the child workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;payment authorization&lt;/li&gt;
&lt;li&gt;fraud decision&lt;/li&gt;
&lt;li&gt;document validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pattern B: Asynchronous child workflow (fire and track)
&lt;/h3&gt;

&lt;p&gt;The parent starts a child workflow and continues later based on an event, callback, or polling strategy.&lt;/p&gt;

&lt;p&gt;Use when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the subprocess is long-running&lt;/li&gt;
&lt;li&gt;an external system controls timing&lt;/li&gt;
&lt;li&gt;human approval or batch windows are involved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fulfillment handoff&lt;/li&gt;
&lt;li&gt;partner settlement&lt;/li&gt;
&lt;li&gt;manual review&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pattern C: Parallel child workflows for independent branches
&lt;/h3&gt;

&lt;p&gt;The parent starts independent subprocesses in parallel and joins after they complete.&lt;/p&gt;

&lt;p&gt;Use when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tasks are independent and safe to run concurrently&lt;/li&gt;
&lt;li&gt;you want to reduce overall latency&lt;/li&gt;
&lt;li&gt;failures should be isolated per branch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fraud + tax calculation + personalization scoring (depending on domain semantics)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pattern D: Domain subprocess library
&lt;/h3&gt;

&lt;p&gt;Create reusable child workflows that multiple parents can call.&lt;/p&gt;

&lt;p&gt;Use when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you repeatedly implement the same orchestration chunk&lt;/li&gt;
&lt;li&gt;the subprocess is clearly owned by one team&lt;/li&gt;
&lt;li&gt;contract stability is good enough for reuse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;identity verification&lt;/li&gt;
&lt;li&gt;payment capture&lt;/li&gt;
&lt;li&gt;notification fan-out preparation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Contracting inputs and outputs (the most important part)
&lt;/h2&gt;

&lt;p&gt;In my experience, decomposition succeeds or fails based on contract discipline.&lt;/p&gt;

&lt;p&gt;If I split a workflow but still pass the full parent payload into every child, I have only moved complexity around. I have not reduced coupling.&lt;/p&gt;

&lt;h3&gt;
  
  
  What a good child contract looks like
&lt;/h3&gt;

&lt;p&gt;A child workflow contract should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;minimal&lt;/strong&gt;: only fields the child needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;explicit&lt;/strong&gt;: named fields, stable structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;typed&lt;/strong&gt;: validated at boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;versionable&lt;/strong&gt;: compatible evolution plan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auditable&lt;/strong&gt;: includes correlation metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I usually use an envelope like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"correlationId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"corr-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"causationId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exec-parent-abc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"orderId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ORD-100045"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"customerId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CUST-9001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;119.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AUD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"paymentMethodToken"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tok_123"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And I expect a child output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"correlationId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"corr-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authorized"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authorizationId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"auth_789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"processorReference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"psp-456"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Contract boundaries I define explicitly
&lt;/h3&gt;

&lt;p&gt;For each child workflow, I define:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Input shape&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Success output shape&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business failure output shape&lt;/strong&gt; (if returned rather than thrown)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical failure behavior&lt;/strong&gt; (exception / failed execution)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timeout expectations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotency expectations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ownership and support team&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This makes nested workflows composable, not just callable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep transformation logic close to the boundary
&lt;/h3&gt;

&lt;p&gt;If the parent needs to adapt a parent model into a child request, I do that immediately before the child call. I do not let “temporary shape conversion” leak across the rest of the workflow.&lt;/p&gt;

&lt;p&gt;Likewise, I normalize child output once after return, then continue with a clean parent-level model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Versioning workflows safely
&lt;/h2&gt;

&lt;p&gt;Workflow decomposition increases the number of deployable units. That is good for blast radius, but it also means you need a safe versioning strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  My rule: version the workflow &lt;em&gt;and&lt;/em&gt; the contract
&lt;/h3&gt;

&lt;p&gt;I treat these as separate concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workflow version&lt;/strong&gt;: the ASL implementation/version/alias of the child state machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contract version&lt;/strong&gt;: the input/output schema version the parent and child agree on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sometimes a workflow changes without changing the contract. Sometimes a contract changes while the business purpose remains the same. I do not force those to be the same version number.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safe versioning practices I use
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1) Invoke child workflows through aliases
&lt;/h4&gt;

&lt;p&gt;The parent should usually call a &lt;strong&gt;child alias ARN&lt;/strong&gt; (for example, &lt;code&gt;:PROD&lt;/code&gt;) rather than a raw, latest definition ARN.&lt;/p&gt;

&lt;p&gt;This gives me a stable target I can move during deployment rollouts and rollbacks.&lt;/p&gt;

&lt;h4&gt;
  
  
  2) Use immutable workflow versions behind aliases
&lt;/h4&gt;

&lt;p&gt;For production workflows, I want immutable versions behind aliases so I can answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which version processed this execution?&lt;/li&gt;
&lt;li&gt;Can I rollback without redefining the workflow?&lt;/li&gt;
&lt;li&gt;Can I shift traffic gradually?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3) Keep contract compatibility during rollout windows
&lt;/h4&gt;

&lt;p&gt;If Parent v3 is rolling out while Child &lt;code&gt;Payments:PROD&lt;/code&gt; shifts from v10 to v11, I want a compatibility window where both versions honor the same contract or the parent chooses a matching alias (&lt;code&gt;PAYMENTS_V1&lt;/code&gt;, &lt;code&gt;PAYMENTS_V2&lt;/code&gt;).&lt;/p&gt;

&lt;h4&gt;
  
  
  4) Prefer additive contract changes
&lt;/h4&gt;

&lt;p&gt;Safer changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add optional output fields&lt;/li&gt;
&lt;li&gt;add optional input fields&lt;/li&gt;
&lt;li&gt;add new reason codes without changing existing semantics (with care)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Riskier changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;renaming fields&lt;/li&gt;
&lt;li&gt;changing meaning of status codes&lt;/li&gt;
&lt;li&gt;changing failure behavior from “return business failure” to “throw”&lt;/li&gt;
&lt;li&gt;changing data types&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  5) Test parent-child compatibility explicitly
&lt;/h4&gt;

&lt;p&gt;I maintain fixtures and contract tests for parent-child integration, especially around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;missing optional fields&lt;/li&gt;
&lt;li&gt;unexpected extra fields&lt;/li&gt;
&lt;li&gt;business failure responses&lt;/li&gt;
&lt;li&gt;timeout and retry behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Reference Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2hecouuo3am4miqjqqa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2hecouuo3am4miqjqqa.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-end walkthrough: decomposing an Order Processing workflow
&lt;/h2&gt;

&lt;p&gt;I will use a realistic example because this is where the trade-offs become visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  The original monolithic workflow (before)
&lt;/h3&gt;

&lt;p&gt;We start with one large &lt;code&gt;OrderProcessing&lt;/code&gt; state machine that does all of this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate order&lt;/li&gt;
&lt;li&gt;fraud check&lt;/li&gt;
&lt;li&gt;authorize payment&lt;/li&gt;
&lt;li&gt;reserve inventory&lt;/li&gt;
&lt;li&gt;create shipment request&lt;/li&gt;
&lt;li&gt;send notifications&lt;/li&gt;
&lt;li&gt;persist status updates&lt;/li&gt;
&lt;li&gt;handle retries and compensation for multiple domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works, but over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Payments team changes create merge conflicts with Fulfillment changes&lt;/li&gt;
&lt;li&gt;The workflow definition is difficult to review&lt;/li&gt;
&lt;li&gt;Troubleshooting a failed shipment step requires scrolling through unrelated payment/fraud logic&lt;/li&gt;
&lt;li&gt;Reusable subprocesses (payments, notifications) are duplicated elsewhere&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The decomposed target architecture (after)
&lt;/h3&gt;

&lt;p&gt;I split the design into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parent workflow: &lt;code&gt;OrderOrchestrator&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;coordinates the overall business flow&lt;/li&gt;
&lt;li&gt;invokes child workflows&lt;/li&gt;
&lt;li&gt;makes continuation/compensation decisions&lt;/li&gt;
&lt;li&gt;emits parent-level events/status transitions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Child workflows&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;PaymentProcessingWorkflow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;InventoryReservationWorkflow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FulfillmentSubmissionWorkflow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CustomerNotificationWorkflow&lt;/code&gt; (optional, often event-driven instead)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each child workflow owns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local retries&lt;/li&gt;
&lt;li&gt;domain-specific branching&lt;/li&gt;
&lt;li&gt;domain telemetry&lt;/li&gt;
&lt;li&gt;domain-specific error normalization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why this split works
&lt;/h3&gt;

&lt;p&gt;This decomposition aligns with domain boundaries and independent change cadence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Payments evolves frequently due to PSP integration and fraud strategy&lt;/li&gt;
&lt;li&gt;Inventory may change due to warehouse logic&lt;/li&gt;
&lt;li&gt;Fulfillment is often async and externally coupled&lt;/li&gt;
&lt;li&gt;Notifications are loosely coupled and may be event-driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The parent remains readable and focused on business progression.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture and flow (walkthrough narrative)
&lt;/h2&gt;

&lt;p&gt;Here is the end-to-end flow in the decomposed design.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) API receives &lt;code&gt;CreateOrder&lt;/code&gt; request
&lt;/h3&gt;

&lt;p&gt;The API layer validates basic request shape, stamps a correlation ID, and starts the parent &lt;code&gt;OrderOrchestrator&lt;/code&gt; workflow (or publishes a command that triggers it, depending on your system style).&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Parent workflow performs lightweight order validation
&lt;/h3&gt;

&lt;p&gt;The parent may perform only orchestration-level checks (for example, required presence checks if not already done), then constructs a &lt;strong&gt;contracted input&lt;/strong&gt; for the payment child workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Parent invokes &lt;code&gt;PaymentProcessingWorkflow&lt;/code&gt; as a synchronous child
&lt;/h3&gt;

&lt;p&gt;The parent waits for payment output because the next step depends on authorization success.&lt;/p&gt;

&lt;p&gt;The child workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;performs fraud/risk checks (if owned by Payments)&lt;/li&gt;
&lt;li&gt;authorizes payment with PSP&lt;/li&gt;
&lt;li&gt;normalizes provider-specific responses&lt;/li&gt;
&lt;li&gt;returns a stable result contract&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The parent receives only what it needs, not the child’s full internal state.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Parent invokes &lt;code&gt;InventoryReservationWorkflow&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;If payment is authorized, the parent calls inventory reservation as another synchronous child and receives a normalized reservation result.&lt;/p&gt;

&lt;h3&gt;
  
  
  5) Parent branches based on combined business outcomes
&lt;/h3&gt;

&lt;p&gt;The parent now makes a high-level decision:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;continue to fulfillment&lt;/li&gt;
&lt;li&gt;compensate payment if inventory failed&lt;/li&gt;
&lt;li&gt;reject order&lt;/li&gt;
&lt;li&gt;send manual review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly where a parent orchestrator adds value.&lt;/p&gt;

&lt;h3&gt;
  
  
  6) Parent starts &lt;code&gt;FulfillmentSubmissionWorkflow&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This may be synchronous or asynchronous depending on downstream fulfillment systems.&lt;/p&gt;

&lt;p&gt;If asynchronous:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the parent may start the child and persist a pending status&lt;/li&gt;
&lt;li&gt;later completion may resume a follow-up workflow or emit events that drive downstream steps&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7) Notifications and analytics are triggered
&lt;/h3&gt;

&lt;p&gt;I often prefer event-driven notification/analytics fan-out instead of keeping them in the critical path. If kept as a child workflow, I keep the contract minimal and failure policy explicit (for example, notification failure should not fail order creation).&lt;/p&gt;

&lt;h3&gt;
  
  
  8) Parent publishes final order status and completes
&lt;/h3&gt;

&lt;p&gt;The parent emits a domain event (for example, &lt;code&gt;OrderAccepted&lt;/code&gt;, &lt;code&gt;OrderPendingFulfillment&lt;/code&gt;, or &lt;code&gt;OrderRejected&lt;/code&gt;) and completes with a stable external result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation discussion
&lt;/h2&gt;

&lt;p&gt;Now I will show concrete examples of how I implement this pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parent workflow (ASL) using nested child workflows
&lt;/h2&gt;

&lt;p&gt;This example uses Step Functions service integration to start child workflows and wait for results. I use &lt;code&gt;startExecution.sync:2&lt;/code&gt; because it returns child output as JSON rather than a JSON-encoded string, which makes downstream data handling cleaner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Order orchestrator parent workflow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BuildPaymentRequest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"BuildPaymentRequest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"correlationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.correlationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"causationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"orderId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.orderId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"customerId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.customerId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"amount.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.totalAmount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"currency.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.currency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"paymentMethodToken.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.paymentMethodToken"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentCall"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"InvokePaymentChild"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"InvokePaymentChild"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::states:startExecution.sync:2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StateMachineArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${PaymentWorkflowAliasArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"meta.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentCall.meta"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"request.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentCall.request"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Retry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"StepFunctions.ExecutionLimitExceeded"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"IntervalSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"BackoffRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"MaxAttempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NormalizePaymentResult"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NormalizePaymentResult"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"authorized.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentExecution.Output.result.authorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"authorizationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentExecution.Output.result.authorizationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"processorReference.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentExecution.Output.result.processorReference"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.payment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PaymentDecision"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PaymentDecision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Choice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.payment.authorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"BooleanEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BuildInventoryRequest"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RejectOrder"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"BuildInventoryRequest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"correlationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.correlationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"causationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"orderId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.orderId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"items.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.items"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"warehousePreference.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.warehousePreference"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryCall"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"InvokeInventoryChild"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"InvokeInventoryChild"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::states:startExecution.sync:2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StateMachineArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${InventoryWorkflowAliasArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"meta.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryCall.meta"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"request.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryCall.request"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"InventoryDecision"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"InventoryDecision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Choice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryExecution.Output.result.reserved"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"BooleanEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"StartFulfillmentChild"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CompensatePayment"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StartFulfillmentChild"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::states:startExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StateMachineArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${FulfillmentWorkflowAliasArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"correlationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.correlationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"causationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"orderId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.orderId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"reservationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.inventoryExecution.Output.result.reservationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"deliveryAddress.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.deliveryAddress"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.fulfillmentStart"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CompleteAccepted"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"CompensatePayment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::states:startExecution.sync:2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StateMachineArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${PaymentCompensationWorkflowAliasArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"correlationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.correlationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"causationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"orderId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.order.orderId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"authorizationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.payment.authorizationId"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$$.Execution.Id"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.paymentCompensation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RejectOrder"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RejectOrder"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Succeed"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"CompleteAccepted"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Succeed"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why this parent is easier to maintain
&lt;/h3&gt;

&lt;p&gt;The parent workflow now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;focuses on sequencing and business decisions&lt;/li&gt;
&lt;li&gt;calls domain-owned child workflows through aliases&lt;/li&gt;
&lt;li&gt;passes minimal, explicit contracts&lt;/li&gt;
&lt;li&gt;can evolve orchestration without rewriting domain subprocess internals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of decomposition I want.&lt;/p&gt;




&lt;h2&gt;
  
  
  Child workflow example: &lt;code&gt;PaymentProcessingWorkflow&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;I keep the child focused and domain-owned. This example is simplified, but it shows the pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Payment processing child workflow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ValidateContract"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ValidateContract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Choice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"And"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.contractVersion"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.request.orderId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"IsPresent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.request.amount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"IsPresent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Variable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.request.paymentMethodToken"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"IsPresent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AuthorizePayment"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ContractError"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"AuthorizePayment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::lambda:invoke"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FunctionName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${AuthorizePaymentFnArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Payload.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultSelector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"result.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.Payload"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ResultPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Retry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Lambda.ServiceException"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Lambda.SdkClientException"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"States.TaskFailed"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"IntervalSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"BackoffRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"MaxAttempts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BuildSuccessResponse"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"BuildSuccessResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"correlationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.meta.correlationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contractVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"authorized.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.auth.result.authorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"authorizationId.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.auth.result.authorizationId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"processorReference.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$.auth.result.processorReference"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ContractError"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Fail"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ContractValidationError"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Cause"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Invalid child workflow input contract"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Design choice I recommend
&lt;/h3&gt;

&lt;p&gt;Notice that the child returns a &lt;strong&gt;normalized result contract&lt;/strong&gt;, not raw PSP payloads. This prevents the parent from becoming coupled to provider-specific fields and keeps domain ownership intact.&lt;/p&gt;




&lt;h2&gt;
  
  
  TypeScript contract definitions (shared library)
&lt;/h2&gt;

&lt;p&gt;I typically create a small shared library for workflow contracts (or generate types from JSON Schema/OpenAPI where appropriate).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// packages/workflow-contracts/src/payment.ts&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;WorkflowMeta&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;correlationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;causationId&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;contractVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PaymentChildRequestV1&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;WorkflowMeta&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;contractVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;customerId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;paymentMethodToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PaymentChildSuccessV1&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;correlationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;contractVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;authorized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;authorizationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;processorReference&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PaymentChildBusinessFailureV1&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;correlationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;contractVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;authorized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;reasonCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;RISK_REJECTED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INSUFFICIENT_FUNDS&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PROCESSOR_DECLINED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;processorReference&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This type layer does not replace runtime validation, but it dramatically improves correctness in parent-child integration code and tests.&lt;/p&gt;




&lt;h2&gt;
  
  
  CDK wiring example (parent and child aliases)
&lt;/h2&gt;

&lt;p&gt;This example shows the shape of how I wire aliases and pass alias ARNs to the parent workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;aws-cdk-lib&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;aws-cdk-lib/aws-stepfunctions&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;constructs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderWorkflowsStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;StackProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Assume these are already defined with actual definitions&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;paymentChild&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StateMachine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PaymentWorkflow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;definitionBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DefinitionBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;inventoryChild&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StateMachine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;InventoryWorkflow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;definitionBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DefinitionBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Publish immutable versions (illustrative)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;paymentVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnStateMachineVersion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PaymentWorkflowVersion&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;stateMachineArn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;paymentChild&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stateMachineArn&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnStateMachineAlias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PaymentWorkflowProdAlias&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PROD&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;routingConfiguration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;stateMachineVersionArn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;paymentVersion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attrStateMachineVersionArn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;inventoryVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnStateMachineVersion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;InventoryWorkflowVersion&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;stateMachineArn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;inventoryChild&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stateMachineArn&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sfn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnStateMachineAlias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;InventoryWorkflowProdAlias&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PROD&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;routingConfiguration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;stateMachineVersionArn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;inventoryVersion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attrStateMachineVersionArn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Parent definition would consume these alias ARNs (via substitutions/templating)&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PaymentWorkflowAliasArn&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;paymentChild&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stateMachineArn&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:PROD`&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;InventoryWorkflowAliasArn&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;inventoryChild&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stateMachineArn&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:PROD`&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// In production, ensure the parent role has least-privilege for nested calls.&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What I pay attention to in deployment pipelines
&lt;/h3&gt;

&lt;p&gt;For child workflows, I want CI/CD to support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;contract tests&lt;/li&gt;
&lt;li&gt;workflow unit/integration tests&lt;/li&gt;
&lt;li&gt;publish new version&lt;/li&gt;
&lt;li&gt;move alias gradually (canary/linear where appropriate)&lt;/li&gt;
&lt;li&gt;rollback alias quickly if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where decomposition pays off operationally. I can deploy a Payment child workflow change without touching the Inventory child or the parent orchestrator if the contract remains stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  IAM and permissions for nested workflows (important operational detail)
&lt;/h2&gt;

&lt;p&gt;Nested workflows are straightforward conceptually, but the IAM details matter.&lt;/p&gt;

&lt;p&gt;When the parent waits synchronously for a child, the integration behavior requires more than only &lt;code&gt;states:StartExecution&lt;/code&gt;. I always validate the parent execution role permissions for nested patterns during deployment and in pre-prod tests, because missing permissions can lead to confusing delays or stuck behavior.&lt;/p&gt;

&lt;p&gt;I also scope permissions narrowly to the child workflows the parent is actually allowed to call. Decomposition should improve boundaries, not weaken them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability after decomposition
&lt;/h2&gt;

&lt;p&gt;A common concern is that decomposition makes tracing harder because the work is spread across multiple executions.&lt;/p&gt;

&lt;p&gt;In practice, I have found the opposite to be true &lt;strong&gt;when I propagate correlation metadata correctly&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I propagate into every child
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;correlationId&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;causationId&lt;/code&gt; (usually the parent execution ID)&lt;/li&gt;
&lt;li&gt;contract version&lt;/li&gt;
&lt;li&gt;domain entity ID (for example, &lt;code&gt;orderId&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I log in each child
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;child workflow name and alias/version (where possible)&lt;/li&gt;
&lt;li&gt;start/end timestamps&lt;/li&gt;
&lt;li&gt;business outcome&lt;/li&gt;
&lt;li&gt;retry counts / terminal error classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it much easier to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which child failed?&lt;/li&gt;
&lt;li&gt;Was it a contract issue or domain issue?&lt;/li&gt;
&lt;li&gt;Which version of the child handled the request?&lt;/li&gt;
&lt;li&gt;Did rollback change the outcome?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to split by domain and subprocess in practice
&lt;/h2&gt;

&lt;p&gt;When teams ask me “where exactly should we split?”, I usually run a quick decomposition workshop with these prompts:&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt 1: Which parts change for different business reasons?
&lt;/h3&gt;

&lt;p&gt;If payment changes because of PSP behavior and inventory changes because of warehouse logic, those belong in different subprocesses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt 2: Which parts require different failure semantics?
&lt;/h3&gt;

&lt;p&gt;If notification failure should not fail order acceptance, that is a strong candidate for decoupling from the parent critical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt 3: Which parts are reusable?
&lt;/h3&gt;

&lt;p&gt;If onboarding, checkout, and subscription renewal all need the same payment authorization flow, that is a candidate child workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt 4: Which parts have different owners/on-call teams?
&lt;/h3&gt;

&lt;p&gt;Team boundaries are not the only factor, but they matter operationally. A child workflow with clear ownership improves support and release confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt 5: Which parts make the parent harder to read than the business process itself?
&lt;/h3&gt;

&lt;p&gt;That is usually the part I extract first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Migration strategy: from one monolith workflow to decomposed workflows safely
&lt;/h2&gt;

&lt;p&gt;I do not recommend a big-bang rewrite. I prefer incremental extraction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Identify one extraction candidate
&lt;/h3&gt;

&lt;p&gt;Pick a subprocess with clear boundaries (for example, Payments).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Define the contract before extracting
&lt;/h3&gt;

&lt;p&gt;Write:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;child input schema/type&lt;/li&gt;
&lt;li&gt;child output schema/type&lt;/li&gt;
&lt;li&gt;failure behavior&lt;/li&gt;
&lt;li&gt;timeouts and retries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Extract the logic into a child workflow
&lt;/h3&gt;

&lt;p&gt;Keep behavior equivalent first. Avoid redesigning everything in the same change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Update parent to call child via alias
&lt;/h3&gt;

&lt;p&gt;Use a stable alias (for example, &lt;code&gt;PROD&lt;/code&gt;) so future child changes do not require parent definition changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Add compatibility and regression tests
&lt;/h3&gt;

&lt;p&gt;Test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;happy path&lt;/li&gt;
&lt;li&gt;business failure path&lt;/li&gt;
&lt;li&gt;timeout/retry path&lt;/li&gt;
&lt;li&gt;malformed contract path&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6: Repeat for the next extraction
&lt;/h3&gt;

&lt;p&gt;After 1-2 successful extractions, teams usually become much more comfortable with the pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  What not to do
&lt;/h2&gt;

&lt;p&gt;I have seen a few anti-patterns appear during decomposition efforts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern 1: "Micro-workflow everything"
&lt;/h3&gt;

&lt;p&gt;Creating a child workflow for every tiny step adds ceremony without improving maintainability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern 2: Passing the entire parent payload into every child
&lt;/h3&gt;

&lt;p&gt;This preserves hidden coupling and makes contracts meaningless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern 3: Parent depends on child internals
&lt;/h3&gt;

&lt;p&gt;If the parent reads deeply nested provider-specific details returned by a child, you have recreated coupling through outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern 4: No versioning strategy
&lt;/h3&gt;

&lt;p&gt;Without aliases/versions and contract discipline, decomposition can increase operational risk instead of reducing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern 5: Decomposition without ownership
&lt;/h3&gt;

&lt;p&gt;If nobody owns a child workflow end-to-end, incidents become harder, not easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;A Step Functions workflow becoming “too large” is not the real problem. The real problem is when &lt;strong&gt;workflow boundaries stop matching business and domain boundaries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When that happens, decomposition is not about making the diagram prettier. It is about restoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;change safety&lt;/li&gt;
&lt;li&gt;testability&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;architectural clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern I keep coming back to is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parent workflow&lt;/strong&gt; for orchestration decisions and business progression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Child workflows&lt;/strong&gt; for domain-owned subprocesses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit contracts&lt;/strong&gt; for inputs/outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioned deployments&lt;/strong&gt; via immutable versions + aliases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong observability metadata&lt;/strong&gt; across execution boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is how I keep Step Functions as an orchestration asset, rather than letting it become a serverless monolith.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS Step Functions Developer Guide (nested workflows, service integrations)&lt;/li&gt;
&lt;li&gt;AWS Step Functions Developer Guide (starting workflows from a task state / &lt;code&gt;StartExecution&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;AWS Step Functions Developer Guide (versions and aliases)&lt;/li&gt;
&lt;li&gt;AWS Step Functions Developer Guide (continuous deployments with versions and aliases)&lt;/li&gt;
&lt;li&gt;AWS Step Functions Developer Guide (best practices)&lt;/li&gt;
&lt;li&gt;AWS Step Functions service quotas documentation&lt;/li&gt;
&lt;li&gt;AWS IAM documentation (least privilege for service integrations)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>KiroHub: Generate a Kiro Skill in 60 Seconds Built With Bedrock Registry and AgentCore Harness</title>
      <dc:creator>Alvaro Llamojha</dc:creator>
      <pubDate>Sun, 03 May 2026 20:41:28 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/kirohub-generate-a-kiro-skill-in-60-seconds-built-with-bedrock-registry-and-agentcore-harness-35bf</link>
      <guid>https://vibe.forem.com/aws-builders/kirohub-generate-a-kiro-skill-in-60-seconds-built-with-bedrock-registry-and-agentcore-harness-35bf</guid>
      <description>&lt;p&gt;I used two Amazon Bedrock AgentCore capabilities, Amazon Bedrock Registry for hybrid search over 10k+ Kiro resources, and AgentCore Harness for testing generated skills against a real agent, to build an AI-powered skill generator for Kiro Hub. Try it at &lt;a href="https://kirohub.dev/generate" rel="noopener noreferrer"&gt;kirohub.dev/generate&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The blank file problem
&lt;/h2&gt;

&lt;p&gt;I've been building &lt;a href="https://kirohub.dev" rel="noopener noreferrer"&gt;Kiro Hub&lt;/a&gt; for a few months now. The hub has over 10,000 community resources, including steering files, hooks, agents, and skills. You can browse, search, and install any of them with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx kirohub add &amp;lt;slug&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wanted to expand Kiro Hub and the next logical step is to be able to create resources based on our current dataset of +10k. I decided to start with Agent Skills. This means that I have to have a better and secure way to ingest custom made Skills. But there is another problem, how do you test the Skill? &lt;/p&gt;

&lt;p&gt;So I decided to implement Bedrock Registry to evolve Kiro Hub into a proper AI context registry with status and steps to move from draft to available. Bedrock AgentCore Harness is a really solid and secure solution to run agents, that also has Skills compatibility. This matches my requirement of testing Agentic Skills in a sandbox. Why not connect those pieces?&lt;/p&gt;

&lt;h2&gt;
  
  
  Create meaningful Skills
&lt;/h2&gt;

&lt;p&gt;The feature lives at &lt;a href="https://kirohub.dev/generate" rel="noopener noreferrer"&gt;kirohub.dev/generate&lt;/a&gt;. You describe what you need in plain language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Create a skill for AWS Lambda error handling best practices&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;or:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I need a skill that helps me write Haiku poems and explains them &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7tfw20klyrxjhhrlswl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7tfw20klyrxjhhrlswl.png" alt="Haiku skill generation" width="800" height="764"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The system generates a complete, structured &lt;code&gt;SKILL.md&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;It is a chat-based interface. You can refine the skill with follow-ups, test it against a real agent to see whether the instructions actually work, and publish it to the hub with one click. From prompt to published, installable skill, the normal path takes under a minute.&lt;/p&gt;

&lt;p&gt;The interesting part is not the editor or the Lambda functions. The interesting part is the combination of retrieval and testing. Registry makes the generated skill more specific. Harness makes the test more realistic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval and storage with Amazon Bedrock Registry
&lt;/h2&gt;

&lt;p&gt;The approach to skill generation is simple but naive: give a model a prompt, explain the &lt;code&gt;SKILL.md&lt;/code&gt; format, and ask it to generate something.&lt;/p&gt;

&lt;p&gt;What makes a skill useful is specificity. Concrete patterns, opinionated guidance, real-world trade-offs, and a structure that an agent can follow. That kind of content already exists across the 10,000+ resources in Kiro Hub. The question was how to get the right examples in front of the model at generation time. And for this I had to evolve Kiro Hub into a proper registry. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/registry.html" rel="noopener noreferrer"&gt;AWS Agent Registry&lt;/a&gt; solves that part. Kiro Hub resources are synced to the Registry as descriptors with names, descriptions, content references, and metadata. Kiro Hub can then resolve matched records back to the full source content used as generation context.&lt;/p&gt;

&lt;p&gt;The Registry exposes a built-in MCP endpoint. The &lt;code&gt;generate-skill&lt;/code&gt; Lambda calls it server-side with JSON-RPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"jsonrpc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tools/call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_registry_records"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"searchQuery"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS Lambda error handling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Registry search uses both semantic and keyword matching, so the query does not need to match exact words. A search for “Lambda error handling” can surface related resources about serverless observability, retry strategies, operational debugging, and CloudWatch logging.&lt;/p&gt;

&lt;p&gt;On the generation side, the Lambda exposes this as a &lt;code&gt;search_skills&lt;/code&gt; tool to the model. The model decides what to search for and when. For a PostgreSQL migration skill, it might search for “database migration patterns,” “PostgreSQL best practices,” and “schema versioning” separately, then synthesize the useful parts into a new skill.&lt;/p&gt;

&lt;p&gt;That changes the output. Without retrieval, the model writes from general knowledge. With retrieval, it has seen how other skill authors structured similar guidance, what sections they included, what tools they referenced, and how specific they were.&lt;/p&gt;

&lt;p&gt;Personally, I find it very important to show transparency. So the inspiration sources also show up in the UI. You can see which existing resources influenced the generated skill and click through to the originals on Kiro Hub. That is useful during refinement. If the model pulled in something that is not quite relevant, you can steer it in another direction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiizwljq5ndk3rhxyw6x9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiizwljq5ndk3rhxyw6x9.png" alt="Inspiration sources from Registry" width="349" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing skills with Amazon Bedrock AgentCore Harness
&lt;/h2&gt;

&lt;p&gt;We got a skill based on other working skills. But how do we trust that our newly generated skill works as we expected? &lt;/p&gt;

&lt;p&gt;A skill is not just markdown. It is a set of instructions that an agent has to discover, load, and follow. You cannot properly evaluate that by reading the file. You need to run it in something close to the environment where it will actually be used.&lt;/p&gt;

&lt;p&gt;That is where Amazon Bedrock &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/harness.html" rel="noopener noreferrer"&gt;AgentCore Harness&lt;/a&gt; fits in. &lt;/p&gt;

&lt;p&gt;A Harness is a managed, config-based agent environment. You configure the model, system prompt, skills, tools, memory, limits, and runtime environment. Each session runs in an isolated environment, and reusing the same session ID lets you continue the conversation for follow-up tests. This allows me to test 'risky' skills without having to compromise my environments. &lt;/p&gt;

&lt;p&gt;When a user tests a generated skill, the system does three things:&lt;/p&gt;

&lt;p&gt;First, the &lt;code&gt;test-skill&lt;/code&gt; Lambda writes the generated &lt;code&gt;SKILL.md&lt;/code&gt; into the session filesystem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/workspace/skills/test-skill/SKILL.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then it invokes the Harness with the skill path and the user’s test scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/workspace/skills/test-skill"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I need help setting up error handling for my Node.js Lambda"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, it streams the agent response back to the UI.&lt;/p&gt;

&lt;p&gt;The important detail is that this is not just “put the skill in the system prompt and call a model.” The skill is loaded from a path, discovered through its frontmatter, and activated when the scenario is relevant. That is the Agent Skill behavior I care about testing.&lt;/p&gt;

&lt;p&gt;If the frontmatter description is vague, the agent may not activate the skill. If the instructions are too broad, the response will show it. If the examples are weak, that becomes obvious quickly. &lt;/p&gt;

&lt;p&gt;This is a feature I wanted to have across Kiro Hub. Be able to test if the desired resource works as we expected, that it doesn't have any side-effects (like prompt injection). This is the difference between checking whether markdown looks good and checking whether an agent can actually use it. &lt;/p&gt;

&lt;p&gt;Harness gives me session isolation, filesystem access, stateful follow-up testing, and standard skill activation. One Harness can serve many test requests safely because isolation comes from the session. If the user wants to keep probing, the same session can continue the conversation with the skill still available.&lt;/p&gt;

&lt;p&gt;That matters for the product experience. You can generate a skill, run a realistic scenario, ask a follow-up, see what breaks, then go back and refine the instructions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxft2nsqt7mpbjtnpkfon.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxft2nsqt7mpbjtnpkfon.png" alt="Testing a skill with AgentCore Harness" width="800" height="706"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The full flow
&lt;/h2&gt;

&lt;p&gt;You describe what you need in the side panel chat. The model searches the Registry for relevant resources and generates a &lt;code&gt;SKILL.md&lt;/code&gt;. You refine it in chat if needed. Then you switch to the Test tab, run it against the AgentCore, inspect the response, and make changes if something is unclear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqd2fwjkg1ixmu6u6hdd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqd2fwjkg1ixmu6u6hdd.png" alt="Haiku Skill Resource" width="800" height="863"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you publish, the skill is written to DynamoDB and S3, then registered in AWS Agent Registry as an &lt;code&gt;AGENT_SKILLS&lt;/code&gt; descriptor. An EventBridge rule triggers auto-validation. A Lambda function scores the skill with Bedrock across documentation quality, reusability, completeness, clarity, and specificity, then approves or rejects it based on the result. &lt;/p&gt;

&lt;p&gt;Once approved, the skill is live on Kiro Hub and installable with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx kirohub add &amp;lt;slug&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What is next
&lt;/h2&gt;

&lt;p&gt;The next piece is Agent Builders: a guided form for creating full Kiro agent configurations in &lt;code&gt;.kiro/agents/*.json&lt;/code&gt;, not just skills. The spec is written, implementation is next. Then moving towards generating and testing steerings, hooks and prompts following the same approach. &lt;/p&gt;

&lt;p&gt;I am also working on Stacks: curated bundles of resources, agents, skills, and steering files, installable with one command. Think starter kits for common project types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Head to &lt;a href="https://kirohub.dev/generate" rel="noopener noreferrer"&gt;kirohub.dev/generate&lt;/a&gt;, describe what you need, and see what comes out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kiro Hub: &lt;a href="https://kirohub.dev" rel="noopener noreferrer"&gt;kirohub.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Generate a Skill: &lt;a href="https://kirohub.dev/generate" rel="noopener noreferrer"&gt;kirohub.dev/generate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS Agent Registry: &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/registry.html" rel="noopener noreferrer"&gt;docs.aws.amazon.com/bedrock-agentcore, Registry&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AgentCore Harness: &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/harness.html" rel="noopener noreferrer"&gt;docs.aws.amazon.com/bedrock-agentcore, Harness&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kiro Skills docs: &lt;a href="https://kiro.dev/docs/skills/" rel="noopener noreferrer"&gt;kiro.dev/docs/skills&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kiro</category>
      <category>ai</category>
      <category>aws</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>AWS Bedrock KB with Glue data catalog</title>
      <dc:creator>Shakir</dc:creator>
      <pubDate>Sun, 03 May 2026 17:57:36 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/aws-bedrock-kb-with-glue-data-catalog-1j9g</link>
      <guid>https://vibe.forem.com/aws-builders/aws-bedrock-kb-with-glue-data-catalog-1j9g</guid>
      <description>&lt;p&gt;Hi 👋, In this post we shall explore Bedrock's structured KB with this architecture: &lt;code&gt;Upload CSVs to S3 &amp;gt; SNS Queue &amp;gt; Crawl data with Glue &amp;gt; Query with Redshift &amp;gt; Bedrock KB &amp;gt; Query with LLM&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Let's do some of this with code. Let's get started.&lt;/p&gt;

&lt;p&gt;Clone the repo and switch to the project directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone git@github.com:networkandcode/networkandcode.github.io.git
&lt;span class="nb"&gt;cd &lt;/span&gt;structured-kb-demo/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do a &lt;a href="https://docs.astral.sh/uv/getting-started/installation/" rel="noopener noreferrer"&gt;uv&lt;/a&gt; sync.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/.env.example" rel="noopener noreferrer"&gt;environment&lt;/a&gt; variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .env
&lt;span class="nv"&gt;AWS_ACCOUNT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="nv"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="nv"&gt;AWS_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap-south-1
&lt;span class="nv"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;

&lt;span class="nv"&gt;BEDROCK_KB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKb
&lt;span class="nv"&gt;BEDROCK_KB_IAM_POLICY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKbIamPolicy
&lt;span class="nv"&gt;BEDROCK_KB_IAM_ROLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKbIamRole

&lt;span class="nv"&gt;GLUE_CRAWLER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-glue-crawler
&lt;span class="nv"&gt;GLUE_CRAWLER_IAM_POLICY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKbGlueCrawlerIamPolicy
&lt;span class="nv"&gt;GLUE_CRAWLER_IAM_ROLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKbGlueCrawlerIamRole
&lt;span class="nv"&gt;GLUE_DB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-glue-db

&lt;span class="nv"&gt;REDSHIFT_IAM_ROLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;StructKbRedshiftIamRole
&lt;span class="nv"&gt;REDSHIFT_NAMESPACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-rs-ns
&lt;span class="nv"&gt;REDSHIFT_WORKGROUP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-rs-wg

&lt;span class="nv"&gt;S3_BUCKET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-bucket
&lt;span class="nv"&gt;S3_FOLDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;inventory

&lt;span class="nv"&gt;SQS_QUEUE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;struct-kb-queue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common files
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/vars.py" rel="noopener noreferrer"&gt;vars&lt;/a&gt; file will load all the env vars once. The &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/arns.py" rel="noopener noreferrer"&gt;arns&lt;/a&gt; file is used to form some of the arns we need. And the logger file is used to setup a common &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/logger.py" rel="noopener noreferrer"&gt;logger&lt;/a&gt; for rest of the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bucket
&lt;/h2&gt;

&lt;p&gt;Setup an S3 &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_s3_bucket.py" rel="noopener noreferrer"&gt;bucket&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_s3_bucket.py 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Bucket struct-kb-s3-bucket created successfully
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Queue
&lt;/h2&gt;

&lt;p&gt;Setup an SQS &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_sqs_queue.py" rel="noopener noreferrer"&gt;queue&lt;/a&gt; with an access policy that allows the S3 bucket to send message to it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_sqs_queue.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Queue created successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Event notification
&lt;/h2&gt;

&lt;p&gt;Update S3 bucket to &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_s3_event_notification.py" rel="noopener noreferrer"&gt;notify&lt;/a&gt; SQS queue on events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_s3_event_notification.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO:logger:Successfully added event notifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Database
&lt;/h2&gt;

&lt;p&gt;Setup a glue &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_glue_db.py" rel="noopener noreferrer"&gt;database&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_glue_db.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Glue database created successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Crawler
&lt;/h2&gt;

&lt;p&gt;Setup an IAM &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_glue_crawler_iam_policy.py" rel="noopener noreferrer"&gt;policy&lt;/a&gt; that allows access to the S3 bucket and SQS queue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_glue_crawler_iam_policy.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Policy created successfully!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup an IAM &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_glue_crawler_iam_role.py" rel="noopener noreferrer"&gt;role&lt;/a&gt; which attaches the policy we just defined as well as the AWS managed glue policy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_glue_crawler_iam_role.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Created role
INFO:logger:AWS Glue Service Role policy attached.
INFO:logger:Custom Glue Crawler policy attached.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now provision a glue &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_glue_crawler.py" rel="noopener noreferrer"&gt;crawler&lt;/a&gt; and attach the role above to it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_glue_crawler.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Crawler created successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Redshift
&lt;/h2&gt;

&lt;p&gt;We shall setup a RedShift IAM &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_redshift_iam_role.py" rel="noopener noreferrer"&gt;role&lt;/a&gt; by attaching the AWS managed policy to it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_redshift_iam_role.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Created role: StructKbRedshiftIamRole
INFO:logger:Attached AmazonRedshiftAllCommandsFullAccess to StructKbRedshiftIamRole
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Provision a namespace, attach the role above to it, and also provision a &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_redshift_workgroup.py" rel="noopener noreferrer"&gt;workgroup&lt;/a&gt; to run the namespace workloads on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_redshift_workgroup.py 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Namespace creation initiated.
INFO:logger:Workgroup creation initiated.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  See the data
&lt;/h2&gt;

&lt;p&gt;There are two small files with sample inventory data: &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/inventory_day_1.csv" rel="noopener noreferrer"&gt;inventory1&lt;/a&gt;, &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/inventory_day_2.csv" rel="noopener noreferrer"&gt;inventory2&lt;/a&gt;.&lt;br&gt;
Let's &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/upload_csv_to_s3.py" rel="noopener noreferrer"&gt;upload&lt;/a&gt; the first one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run upload_csv_to_s3.py inventory_day_1.csv 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Upload Successful: inventory/inventory_day_1.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/run_glue_crawler.py" rel="noopener noreferrer"&gt;Run&lt;/a&gt; the crawler so that it fetches data from S3 and adds a table on glue database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run run_glue_crawler.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Crawler started.
INFO:logger:Crawler is still running...
INFO:logger:Crawler is still running...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler finished. Final State: READY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We did a lot with the cli, let's do some verification from the gui, on the web console. We can see the table on the glue db in the hirerarchy &lt;code&gt;AWS Glue &amp;gt; Data Catalog &amp;gt; Tables&lt;/code&gt;.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnahmbdtn7si7ub8qbug1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnahmbdtn7si7ub8qbug1.png" alt="Table on glue db" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, go to &lt;code&gt;Amazon Redshift &amp;gt; Serveless &amp;gt; Query editor v2&lt;/code&gt; Click on the workspace, and use the default settings to connect. Run this command on the editor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"awsdatacatalog"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"struct-kb-glue-db"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"inventory"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my case the table name is inventory which is same as the s3 folder name. I got results like below.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9d9514msscvnwdt2mod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9d9514msscvnwdt2mod.png" alt="Redshift query result for 1 day" width="800" height="336"&gt;&lt;/a&gt;&lt;br&gt;
Note that there are 10 records.&lt;/p&gt;
&lt;h2&gt;
  
  
  Incremental data
&lt;/h2&gt;

&lt;p&gt;Now, let's add another csv file for &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/inventory_day_2.csv" rel="noopener noreferrer"&gt;day 2&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run upload_csv_to_s3.py inventory_day_2.csv 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQS queue shoud show there is one message available.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n0yebymrq0ehmxb6wr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n0yebymrq0ehmxb6wr2.png" alt="SQS queue status before crawler run" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can run the crawler to fetch the change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run run_glue_crawler.py 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQS messages available should become 0.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrz227ew4asna6tys4cd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrz227ew4asna6tys4cd.png" alt="SQS status after crawler run" width="800" height="336"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;The same query in redshift should now give 20 records.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1afel59aym0y0hqjg25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1afel59aym0y0hqjg25.png" alt="Redshift query result for 2 days" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Bedrock KB
&lt;/h2&gt;

&lt;p&gt;We got the results in redshift editor through the command. We can try to retrieve results via Bedrock KB through natural language.&lt;/p&gt;

&lt;p&gt;Setup IAM &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_bedrock_kb_iam_policy.py" rel="noopener noreferrer"&gt;policy&lt;/a&gt; for bedrock kb.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_bedrock_kb_iam_policy.py 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup IAM &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_bedrock_kb_iam_role.py" rel="noopener noreferrer"&gt;role&lt;/a&gt; and attach this policy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_bedrock_kb_iam_role.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:logger:Created role: StructKbBedrockKbIamRole
INFO:logger:Attached IAM policy to BedrockKB IAM role.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create and sync the &lt;a href="https://github.com/networkandcode/networkandcode.github.io/blob/main/structured-kb-demo/setup_bedrock_kb.py" rel="noopener noreferrer"&gt;knowlege base&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run setup_bedrock_kb.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can go to &lt;code&gt;Amazon Bedrock &amp;gt; Knowledge Bases&lt;/code&gt; on the web console and click on the knowledge base that was created. And test the knowledge base, I've used the following settings with a test prompt.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F55e1y3zwcxjl84gpnych.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F55e1y3zwcxjl84gpnych.png" alt="Test knowledge base" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alright, so that's it for this post, it was somewhat a heavy exercice overall, but I think it would help us really when we have large data, than the simple data examples we have used. So far we tested with the test prompt option in the bedrock kb, we could expand this logic and use this KB with agents made using frameworks like strands, langgraph...Thank you for reading!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>sql</category>
      <category>claude</category>
      <category>ai</category>
    </item>
    <item>
      <title>It's All About That Memory - Using Long and Short Term Memory with Agents</title>
      <dc:creator>Darryl Ruggles</dc:creator>
      <pubDate>Sun, 03 May 2026 17:57:19 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/its-all-about-that-memory-using-long-and-short-term-memory-with-agents-2m21</link>
      <guid>https://vibe.forem.com/aws-builders/its-all-about-that-memory-using-long-and-short-term-memory-with-agents-2m21</guid>
      <description>&lt;p&gt;&lt;em&gt;Building a multi-session detective game with AgentCore Memory's 4 built-in strategies&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every AI memory demo starts the same way. "Hi, my name is Bob." Close the session, open a new one. "What's my name?" "Your name is Bob!" Confetti. Blog post done.&lt;/p&gt;

&lt;p&gt;That's not interesting.&lt;/p&gt;

&lt;p&gt;What if memory wasn't a feature bolted onto an agent - what if it was the entire product? What if the agent couldn't function without it? I wanted to build something where forgetting wasn't a minor inconvenience but a catastrophic failure. A detective who forgets the alibi they just disproved. A narrator who can't recall which suspects have been interviewed. A case file that resets to blank every time you close your browser.&lt;/p&gt;

&lt;p&gt;That's the project: a noir detective mystery game called "The Blackwell Murder," built on Amazon Bedrock AgentCore, where all 4 long-term memory strategies plus short-term memory work together to make the investigation feel continuous across sessions. The detective arrives at a crime scene, interviews suspects, examines evidence, and builds a case - and when they come back the next day, the narrator picks up exactly where they left off.&lt;/p&gt;

&lt;p&gt;The source code is on GitHub: &lt;a href="https://github.com/RDarrylR/agentcore-memory-murder-mystery" rel="noopener noreferrer"&gt;agentcore-memory-murder-mystery&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbkwkmqqv566s2qo2qfr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbkwkmqqv566s2qo2qfr.png" alt="The Blackwell Murder - Architecture" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key mental model: STM is working memory within a conversation, LTM is the case file that persists between sessions. The agent needs both.&lt;/p&gt;

&lt;p&gt;The architecture is deliberately simple. It uses a local FastAPI proxy that sits between the React frontend and AgentCore. The example doesn't include CloudFront, Lambda, or API Gateway. The point of this project is memory rather than AWS networking. If you have AWS credentials and Terraform installed, you can clone the repo and be playing in 15 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This demo uses PUBLIC network mode with no authentication on the proxy for simplicity. Production deployments should use VPC mode with private subnets, authentication on the proxy layer, and VPC endpoints for Bedrock and AgentCore services.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a local proxy?&lt;/strong&gt; The browser can't call AgentCore directly - it requires IAM SigV4 signing. The FastAPI server handles that, plus it gives us a clean place to filter out model artifacts like &lt;code&gt;\&lt;/code&gt; tags before they reach the UI. In production, this proxy would need authentication (Cognito, API keys, or similar) - the demo version accepts any request from localhost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not the &lt;code&gt;agentcore invoke&lt;/code&gt; CLI?&lt;/strong&gt; The Python SDK (&lt;code&gt;bedrock-agentcore&lt;/code&gt;) supports streaming and integrates cleanly with FastAPI's &lt;code&gt;StreamingResponse&lt;/code&gt;. No subprocess overhead, no output parsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Memory - The 4 Strategies
&lt;/h2&gt;

&lt;p&gt;AgentCore Memory has two layers: short-term memory (STM) that handles turn-by-turn conversation within a session, and long-term memory (LTM) with four built-in strategies that extract, organize, and recall information across sessions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqrlcbexjzyykewt96wk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqrlcbexjzyykewt96wk.png" alt="AgentCore Memory Strategies" width="800" height="1106"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What makes this interesting for a detective game is that every strategy maps naturally to how real investigators work. Detectives maintain fact files, write case summaries, track interrogation patterns, and adapt their approach based on what is working. The four LTM strategies do exactly this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Short-Term Memory (STM)
&lt;/h3&gt;

&lt;p&gt;STM captures the raw conversation - detective actions, narrator descriptions, tool calls and results - within a single session. The agent reads back the last few turns automatically so it knows what just happened.&lt;/p&gt;

&lt;p&gt;When the detective says "examine the broken window" and then follows up with "check for fingerprints on the frame," STM is why the agent knows which window you're talking about without you having to repeat the context. STM events in this project expire after 30 days (configurable from 7 to 365 via the &lt;code&gt;event_expiry_duration&lt;/code&gt; parameter).&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Strategy - "CaseFiles"
&lt;/h3&gt;

&lt;p&gt;Extracts and indexes factual information from conversations for retrieval by meaning, not keywords.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace&lt;/strong&gt;: &lt;code&gt;/cases/{actorId}/facts/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is the detective's fact file. Every time the agent learns something concrete - a suspect's alibi, a piece of evidence, a relationship between characters - the semantic strategy extracts it and stores it as a retrievable fact.&lt;/p&gt;

&lt;p&gt;When the detective returns 3 sessions later and asks "what do we know about Helena's alibi?", the agent retrieves everything related to Helena: she claims she was at the Grand Hotel bar until midnight, the bartender says she left at 11:30 PM, there's a 17-minute gap, and the hotel security cameras had a convenient "glitch" during that window. No contradictions slip through. No established facts get lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary Strategy - "CaseNotes"
&lt;/h3&gt;

&lt;p&gt;Creates condensed summaries of each session - the detective's case notes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace&lt;/strong&gt;: &lt;code&gt;/cases/{actorId}/{sessionId}/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;At the end of each session, the summary strategy distills the conversation into a concise case update: what evidence was discovered, which suspects were interviewed, what leads are open, and where the investigation stands.&lt;/p&gt;

&lt;p&gt;When the player starts a new session, the agent retrieves the last summary and opens with a case file briefing: "Case #1247 - The Blackwell Murder. Day 3. Last session you discovered the staged break-in and the 17-minute gap in Helena's alibi. Two leads remain open..." This is how real detectives work. They write case notes so they can pick up where they left off.&lt;/p&gt;

&lt;h3&gt;
  
  
  User Preferences Strategy - "DetectiveStyle"
&lt;/h3&gt;

&lt;p&gt;Automatically identifies and tracks the player's investigation approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace&lt;/strong&gt;: &lt;code&gt;/detectives/{actorId}/preferences/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This strategy watches how the player investigates and adapts the experience. If the player consistently chooses indirect questioning over confrontation, the narrator starts offering more subtle conversation options. If they prefer forensic evidence over witness interviews, crime scenes get richer physical detail.&lt;/p&gt;

&lt;p&gt;It picks up on investigation style (methodical vs. intuitive), interrogation preference (confrontational, sympathetic, indirect), detail level (forensic deep-dives vs. big-picture summaries), and pacing preference (slow reveals vs. rapid progress).&lt;/p&gt;

&lt;p&gt;The preference strategy is subtle. You don't notice it working until the third or fourth session, when the narrator's suggestions start feeling tailored to exactly how you like to play.&lt;/p&gt;

&lt;h3&gt;
  
  
  Episodic Strategy - "Interrogations"
&lt;/h3&gt;

&lt;p&gt;Captures key interactions as structured episodes, then generates cross-session reflections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace&lt;/strong&gt;: &lt;code&gt;/episodes/{actorId}/{sessionId}/&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Reflection namespace&lt;/strong&gt;: &lt;code&gt;/episodes/{actorId}/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is the most compelling strategy, and the one that makes the detective game feel genuinely intelligent. Episodic memory doesn't just store what happened - it reflects on patterns across interactions.&lt;/p&gt;

&lt;p&gt;An episode captures structured fields - the AWS docs define these as situation, intent, assessment, justification, and episode-level reflection. In practice, the output for this project looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Situation&lt;/strong&gt;: Interrogation of Helena Voss regarding her whereabouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent&lt;/strong&gt;: Catch Helena in a lie about the hotel bar timeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assessment&lt;/strong&gt;: Presented the bartender's statement showing she left at 11:30, not midnight&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Justification&lt;/strong&gt;: Helena became defensive, changed story to "went for a walk," refused further questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reflections synthesize across episodes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"This detective excels at catching timeline inconsistencies - present evidence contradictions early"&lt;/li&gt;
&lt;li&gt;"Direct confrontation causes suspects to shut down - this player gets better results with patience"&lt;/li&gt;
&lt;li&gt;"Helena's changing story pattern matches classic alibi fabrication - flag for cross-reference"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical result is detective intuition. The narrator can say things like "You have noticed Helena's story shifts every time you press on the timeline. Your instinct says the 17 minutes matter." The player didn't ask for that observation - the episodic reflection surfaced it automatically.&lt;/p&gt;
&lt;h3&gt;
  
  
  How the Strategies Map to the Investigation
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Detective Equivalent&lt;/th&gt;
&lt;th&gt;What Gets Stored&lt;/th&gt;
&lt;th&gt;When It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STM&lt;/td&gt;
&lt;td&gt;Working memory&lt;/td&gt;
&lt;td&gt;Current conversation&lt;/td&gt;
&lt;td&gt;Within a session - "which window?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic&lt;/td&gt;
&lt;td&gt;Fact file&lt;/td&gt;
&lt;td&gt;Suspects, alibis, evidence, relationships&lt;/td&gt;
&lt;td&gt;Re-interviewing a suspect 3 sessions later&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summary&lt;/td&gt;
&lt;td&gt;Case notes&lt;/td&gt;
&lt;td&gt;Per-session investigation summary&lt;/td&gt;
&lt;td&gt;Opening a new session - "where were we?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User Preference&lt;/td&gt;
&lt;td&gt;Detective instinct&lt;/td&gt;
&lt;td&gt;Play style, interrogation approach&lt;/td&gt;
&lt;td&gt;Narrator adapts tone and suggestions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Episodic&lt;/td&gt;
&lt;td&gt;Interrogation log + intuition&lt;/td&gt;
&lt;td&gt;Key interactions + cross-session reflections&lt;/td&gt;
&lt;td&gt;"Helena's story keeps changing..."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  A Note on Namespace Design
&lt;/h3&gt;

&lt;p&gt;The AWS docs recommend a default namespace pattern like &lt;code&gt;/strategy/{memoryStrategyId}/actor/{actorId}/session/{sessionId}/&lt;/code&gt;. This project uses custom descriptive namespaces instead - &lt;code&gt;/cases/{actorId}/facts/&lt;/code&gt;, &lt;code&gt;/detectives/{actorId}/preferences/&lt;/code&gt;, etc. - because they map directly to how a detective organizes information. When you're debugging and see a record in &lt;code&gt;/cases/sloane/facts/&lt;/code&gt;, you immediately know what it is.&lt;/p&gt;

&lt;p&gt;The tradeoff is that without &lt;code&gt;{memoryStrategyId}&lt;/code&gt; in the path, multiple strategies could theoretically write to overlapping namespaces if you configure them carelessly. In practice, each strategy in this project has a distinct namespace root (&lt;code&gt;/cases/&lt;/code&gt;, &lt;code&gt;/detectives/&lt;/code&gt;, &lt;code&gt;/episodes/&lt;/code&gt;), so there's no overlap. If you're building a system with many strategies, the AWS-recommended pattern with strategy IDs in the path is safer.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where Does the Extraction Logic Live?
&lt;/h3&gt;

&lt;p&gt;This is the thing that took me the longest to internalize: you don't write extraction logic. There's no code in this project that says "pull out facts for semantic memory" or "summarize this session." The platform does all of it.&lt;/p&gt;

&lt;p&gt;When you define a strategy, you provide a type, a name, a description, and namespaces. That's it. The extraction pipeline reads your STM events - the raw conversation messages - and applies each strategy's built-in logic to decide what to extract. You never see the extraction prompt. For the built-in strategies used in this project, customization is limited to the strategy description field. AWS also offers built-in overrides (custom prompts, custom model selection) and self-managed strategies (full pipeline control) for deeper customization - see the &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/memory-strategies.html" rel="noopener noreferrer"&gt;AgentCore Memory documentation&lt;/a&gt; for details.&lt;/p&gt;

&lt;p&gt;For built-in strategies, your actual levers for influencing LTM quality are indirect:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Strategy descriptions&lt;/strong&gt; - the only direct hint you give the extraction model. "Extracts and indexes case facts for semantic retrieval" tells it to focus on facts. "Tracks detective communication style and investigation preferences" tells it to watch for behavioral patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your system prompt&lt;/strong&gt; - shapes how the agent talks, which shapes what the extraction pipeline has to work with. A system prompt that produces atmospheric noir prose gives the summarization strategy rich material. A prompt that produces terse responses gives it less.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your tools&lt;/strong&gt; - return structured data that becomes part of the conversation. When &lt;code&gt;examine_evidence&lt;/code&gt; returns forensic details about tool marks on a window frame, that structured output gives the semantic strategy concrete facts to extract. When &lt;code&gt;interrogate_witness&lt;/code&gt; returns a suspect's shifting alibi, the episodic strategy captures it as a meaningful interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The conversation itself&lt;/strong&gt; - longer, richer conversations produce more extraction material. A single-turn "look at the window" produces less than a multi-turn investigation where the detective examines evidence, cross-references alibis, and confronts a suspect with contradictions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The practical implication is that designing good prompts and tools is indirectly designing your memory. I didn't set out to optimize for LTM quality, but the debug watch tool showed me that conversations where the detective digs deeper - following up on inconsistencies, asking witnesses about specific details, comparing evidence across locations - produce significantly richer LTM records than surface-level interactions. The extraction pipeline rewards conversational depth.&lt;/p&gt;
&lt;h2&gt;
  
  
  Building the Agent
&lt;/h2&gt;

&lt;p&gt;The agent runs on AgentCore via the Strands SDK. Three things matter: the system prompt, the tools, and the memory integration.&lt;/p&gt;
&lt;h3&gt;
  
  
  System Prompt - Noir Narrator Persona
&lt;/h3&gt;

&lt;p&gt;The agent isn't the detective. It's the narrator - the voice in the dark that describes what the detective sees, hears, and feels. The system prompt establishes this firmly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are the narrator for "The Blackwell Murder," a noir detective mystery.
You speak in the style of classic noir fiction - rain-slicked streets, long
shadows, moral ambiguity, and the kind of truth that cuts deeper than any blade.
You are not the detective. You are the voice in the dark that describes what
the detective sees, hears, and feels.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt also includes the full case briefing (locations, suspects, the solution), narrator rules, and memory integration instructions. Two rules that matter the most:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never break character.&lt;/strong&gt; The model must never mention tools, functions, errors, or its own reasoning. If a tool fails, the narrator says "the trail goes cold" - not "there was an error in the category specified."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory integration on session start.&lt;/strong&gt; On the first message in a new session with no prior history, set the scene. On returning sessions where memory context is available, open with a case file briefing.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Custom Tools
&lt;/h3&gt;

&lt;p&gt;Four tools drive the investigation. Each is a &lt;code&gt;@tool&lt;/code&gt;-decorated function that returns narrative text and silently tracks state in a case file:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;examine_evidence(item, method)&lt;/code&gt;&lt;/strong&gt; - Three examination methods (visual, forensic, compare) reveal different details about the same evidence. The broken window looks suspicious on visual inspection, reveals tool marks under forensic analysis, and confirms the staged break-in on comparison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;interrogate_witness(witness, approach, topic)&lt;/code&gt;&lt;/strong&gt; - Four interview approaches (neutral, sympathetic, confrontational, indirect) produce different responses from the same witness. Confrontation shuts Marcus down. Sympathy gets Clara to reveal the shadow she saw. Indirect questioning catches Marcus mentioning the service passage he claims was sealed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;search_location(location, area)&lt;/code&gt;&lt;/strong&gt; - Five locations with multiple searchable areas. The study alone has the desk, window, bookcase, safe, and floor - each hiding different clues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;check_case_file(query, category)&lt;/code&gt;&lt;/strong&gt; - The detective's notebook. Reviews all discovered evidence, suspect information, alibis, timeline events, and connections between suspects. Supports free-text search across all categories.&lt;/p&gt;

&lt;p&gt;Every tool call that discovers something new pushes a notification to the frontend, which updates the Case Board and Persons of Interest panels in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Integration with Strands
&lt;/h3&gt;

&lt;p&gt;The Strands SDK's &lt;code&gt;AgentCoreMemorySessionManager&lt;/code&gt; handles the memory lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentCoreMemoryConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;memory_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MEMORY_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;actor_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;detective_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieval_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/cases/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detective_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/facts/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RetrievalConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/detectives/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detective_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/preferences/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RetrievalConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/episodes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detective_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RetrievalConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AgentCoreMemorySessionManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;REGION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session_manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;examine_evidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interrogate_witness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check_case_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_location&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;session_manager&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_manager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;retrieval_config&lt;/code&gt; tells the session manager which LTM namespaces to query when loading context for a new request. Without it, the agent only gets STM conversation history - it wouldn't recall facts, preferences, or episode patterns from prior sessions.&lt;/p&gt;

&lt;p&gt;The session manager does two things: on entry, it loads relevant memories (STM conversation history, LTM strategy results) into the agent's context. On exit, it persists the current conversation as new memory events. The &lt;code&gt;actor_id&lt;/code&gt; is the detective's name, which namespaces all memory operations so multiple detectives could theoretically investigate the same case without cross-contamination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Configuration
&lt;/h3&gt;

&lt;p&gt;Nova Pro is the default because it has good narrative quality and is the most cost-effective option for iterative development. But the model is switchable at deploy time via the &lt;code&gt;ACTIVE_LLM&lt;/code&gt; environment variable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Model ID&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nova Pro&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us.amazon.nova-pro-v1:0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Default - good balance of quality and cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nova 2 Lite&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us.amazon.nova-2-lite-v1:0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1M context, optional extended thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nova Lite&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us.amazon.nova-lite-v1:0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fastest, lowest cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us.anthropic.claude-sonnet-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Best narrative quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us.anthropic.claude-haiku-4-5-20251001-v1:0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fast and affordable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The difference in narrative quality between Nova Pro and Claude Sonnet is noticeable. Claude produces more atmospheric prose and stays in character more consistently. Nova Pro occasionally breaks the fourth wall by mentioning tool names or its own reasoning process - something I had to filter out in the proxy server. For a polished demo, Claude Sonnet is the better choice. For development and iteration, Nova Pro keeps costs low. A typical 15-20 minute play session (10-15 turns, 4 tool calls per session) costs roughly $0.02-0.05 in model inference alone with Nova Pro. Claude Sonnet runs about 5-10x that. Memory operations and KMS add negligible cost on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code
&lt;/h2&gt;

&lt;p&gt;All durable infrastructure is managed by Terraform using the AWS provider (~&amp;gt; 6.35). The agent itself is deployed via the &lt;code&gt;agentcore&lt;/code&gt; CLI, which handles the zip packaging and runtime provisioning. This is a clean separation: Terraform manages what persists (Memory, IAM, KMS, S3), the CLI manages what deploys (agent code, runtime configuration).&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory and KMS
&lt;/h3&gt;

&lt;p&gt;AgentCore Memory requires a KMS key for encryption. The memory resource itself is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_bedrockagentcore_memory"&lt;/span&gt; &lt;span class="s2"&gt;"detective"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.memory_name}_${var.name_suffix}"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Persistent memory for the AI detective agent"&lt;/span&gt;
  &lt;span class="nx"&gt;event_expiry_duration&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;event_expiry_duration_days&lt;/span&gt;
  &lt;span class="nx"&gt;encryption_key_arn&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_kms_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;memory_execution_role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;memory_execution_role_arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the variable name &lt;code&gt;event_expiry_duration_days&lt;/code&gt; - the Terraform attribute is &lt;code&gt;event_expiry_duration&lt;/code&gt; (which takes a value in days), and the variable adds the &lt;code&gt;_days&lt;/code&gt; suffix for clarity so readers don't have to guess the unit.&lt;/p&gt;

&lt;p&gt;The KMS key policy grants three principals access: the root account for administration (full &lt;code&gt;kms:*&lt;/code&gt;), the AgentCore service for memory encryption operations (&lt;code&gt;kms:Encrypt&lt;/code&gt;, &lt;code&gt;kms:Decrypt&lt;/code&gt;, &lt;code&gt;kms:GenerateDataKey&lt;/code&gt;, &lt;code&gt;kms:DescribeKey&lt;/code&gt;), and the memory execution role for runtime access (same encryption actions). All policies use &lt;code&gt;aws_iam_policy_document&lt;/code&gt; data sources - never inline JSON strings. This gives you compile-time validation and readable diffs. Note: &lt;code&gt;resources = ["*"]&lt;/code&gt; in a KMS key policy means "this key" - it's not a wildcard across all keys.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;random_id&lt;/code&gt; suffix is appended to all AWS resources (S3 buckets, KMS aliases, memory names) to ensure global uniqueness. The suffix is generated once and shared across all modules.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Strategies via Terraform, One via CLI
&lt;/h3&gt;

&lt;p&gt;Here's the real-world gotcha. The &lt;code&gt;aws_bedrockagentcore_memory_strategy&lt;/code&gt; resource supports three of the four strategy types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_bedrockagentcore_memory_strategy"&lt;/span&gt; &lt;span class="s2"&gt;"case_files"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CaseFiles"&lt;/span&gt;
  &lt;span class="nx"&gt;memory_id&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_bedrockagentcore_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;detective&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SEMANTIC"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Extracts and indexes case facts for semantic retrieval"&lt;/span&gt;
  &lt;span class="nx"&gt;namespaces&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/cases/{actorId}/facts/"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_bedrockagentcore_memory_strategy"&lt;/span&gt; &lt;span class="s2"&gt;"case_notes"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CaseNotes"&lt;/span&gt;
  &lt;span class="nx"&gt;memory_id&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_bedrockagentcore_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;detective&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SUMMARIZATION"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Summarizes investigation sessions into concise case notes"&lt;/span&gt;
  &lt;span class="nx"&gt;namespaces&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/cases/{actorId}/{sessionId}/"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_bedrockagentcore_memory_strategy"&lt;/span&gt; &lt;span class="s2"&gt;"detective_style"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DetectiveStyle"&lt;/span&gt;
  &lt;span class="nx"&gt;memory_id&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_bedrockagentcore_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;detective&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"USER_PREFERENCE"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Tracks detective communication style and investigation preferences"&lt;/span&gt;
  &lt;span class="nx"&gt;namespaces&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/detectives/{actorId}/preferences/"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The EPISODIC type isn't yet supported in the Terraform provider as of March 2026. This is tracked in &lt;a href="https://github.com/hashicorp/terraform-provider-aws/issues/45599" rel="noopener noreferrer"&gt;terraform-provider-aws #45599&lt;/a&gt;. The workaround is a &lt;code&gt;make&lt;/code&gt; target that calls the AWS CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws bedrock-agentcore-control update-memory &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory-id&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;MEMORY_ID&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory-strategies&lt;/span&gt; &lt;span class="s1"&gt;'{
    "addMemoryStrategies": [{
      "episodicMemoryStrategy": {
        "name": "Interrogations",
        "description": "Key interrogation episodes with cross-case reflections",
        "namespaces": ["/episodes/{actorId}/{sessionId}/"],
        "reflectionConfiguration": {
          "namespaces": ["/episodes/{actorId}/"]
        }
      }
    }]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things to note about the episodic strategy. First, it requires a &lt;code&gt;reflectionConfiguration&lt;/code&gt; with its own namespace - this is where cross-session reflections are stored. Second, the reflection namespace must be at or above the episode namespace's depth - meaning reflections are less nested than episodes. In practice, this means the reflection namespace must be a prefix of the episode namespace (e.g., &lt;code&gt;/episodes/{actorId}/&lt;/code&gt; works as a reflection namespace for episodes stored in &lt;code&gt;/episodes/{actorId}/{sessionId}/&lt;/code&gt;). Get this wrong and the API returns a validation error that doesn't clearly explain the constraint.&lt;/p&gt;

&lt;p&gt;Third, because the episodic strategy lives outside Terraform, &lt;code&gt;terraform destroy&lt;/code&gt; won't clean it up. If you destroy and recreate the infrastructure, you'll get a naming collision or an orphaned strategy. The project includes a corresponding &lt;code&gt;make remove-episodic-strategy&lt;/code&gt; target for teardown. On the Terraform side, the memory resource's attributes don't reflect CLI-managed strategy state, so &lt;code&gt;terraform plan&lt;/code&gt; won't show unexpected diffs after you add the episodic strategy via the CLI - no &lt;code&gt;ignore_changes&lt;/code&gt; block is needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Note on IAM Permissions
&lt;/h3&gt;

&lt;p&gt;When you deploy an agent with the &lt;code&gt;agentcore&lt;/code&gt; CLI, it auto-creates an IAM role (&lt;code&gt;AmazonBedrockAgentCoreSDKRuntime-*&lt;/code&gt;) with a baseline policy. This policy covers what the agent needs to run - model invocation, memory read/write, and the basics. The agent works fine out of the box.&lt;/p&gt;

&lt;p&gt;Where you will need extra IAM permissions is if you build debug tools that call the boto3 memory APIs directly - like the watch script in this project. Those tools run under your own IAM identity, not the agent's runtime role, and need explicit permissions for &lt;code&gt;ListMemoryRecords&lt;/code&gt;, &lt;code&gt;RetrieveMemoryRecords&lt;/code&gt;, &lt;code&gt;ListEvents&lt;/code&gt;, and KMS decrypt on the memory encryption key. In production, create a separate narrowly-scoped IAM role for debug tools rather than granting these permissions to developer identities. Budget 15 minutes to set this up if you plan to inspect memory outside the agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on the auto-created runtime role.&lt;/strong&gt; The &lt;code&gt;agentcore&lt;/code&gt; CLI generates a role with broad permissions - for example, &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; with &lt;code&gt;Resource: *&lt;/code&gt; rather than scoped to specific model ARNs. This is fine for a demo, but for production deployments, create a custom IAM role with explicitly scoped permissions. At minimum, scope &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; to the specific model ARNs your agent uses and ensure memory access policies reference only the memory resources that agent needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Naming Constraints
&lt;/h3&gt;

&lt;p&gt;AgentCore resource names must match &lt;code&gt;[a-zA-Z][a-zA-Z0-9_]{0,47}&lt;/code&gt; - letters, numbers, and underscores only, starting with a letter. No hyphens. This tripped me up repeatedly: &lt;code&gt;case-files&lt;/code&gt; fails, &lt;code&gt;CaseFiles&lt;/code&gt; works. &lt;code&gt;detective-memory-abc123&lt;/code&gt; fails, &lt;code&gt;detective_memory_abc123&lt;/code&gt; works. KMS aliases are fine with hyphens, but everything else in AgentCore isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Frontend
&lt;/h2&gt;

&lt;p&gt;A noir-themed React 19 SPA with four components: the narrative log (the main detective story), the detective input, the Case Board (discovered evidence), and the Persons of Interest panel (suspect information with alibi status).&lt;/p&gt;

&lt;p&gt;The narrative log displays the agent's noir prose as it streams in via SSE. Tool use events show as gold italic indicators - "Examining evidence...", "Interrogating witness..." - so the player knows the agent is working.&lt;/p&gt;

&lt;p&gt;The Case Board and Persons of Interest panels update in real time as the investigation progresses. When the agent examines evidence or interviews a suspect, the tools push structured notifications through the SSE stream. New evidence items appear with an amber highlight that fades after a few seconds. Suspects show their interview count and alibi verification status (verified, contradicted, or unverified).&lt;/p&gt;

&lt;p&gt;SSE streaming deserves a note. AgentCore returns the response as a &lt;code&gt;StreamingBody&lt;/code&gt; - but when accessed through the &lt;code&gt;invoke_agent_runtime&lt;/code&gt; API, the entire response arrives as a single read. The SSE events are concatenated inside it, sometimes without newline separators:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data: {"chunk": "The rain"}data: {"chunk": " hasn't stopped"}data: {"chunk": " for three days."}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proxy server splits on &lt;code&gt;data:&lt;/code&gt; boundaries using a regex, reassembles all chunk text, strips &lt;code&gt;tags and `&amp;lt;/tool&amp;gt;` XML artifacts that occasionally leak through, and re-emits clean SSE events to the browser. This filtering is model-agnostic - Nova Pro produces these artifacts most often, but any model with reasoning capabilities (including Nova 2 Lite with extended thinking enabled) can emit&lt;/code&gt; blocks. This two-pass approach - collect everything, clean it, then emit - is more reliable than trying to filter partial tags from individual chunks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Session 1: The Crime Scene
&lt;/h3&gt;

&lt;p&gt;The player opens the app. The narrator sets the scene:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The rain has not stopped for three days. You step under the police tape at 42 Ashford Lane, the collar of your coat turned up against the damp. The study door is open now - the locksmith saw to that this morning - but inside, everything is exactly as Edward Blackwell left it. Or as someone left it for him.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The player types "search the study." The agent calls &lt;code&gt;search_location&lt;/code&gt; and describes the room - the scattered desk, the broken window, the bookcase. The Case Board lights up: "Broken Window - glass fragments outside, break pattern radiates outward."&lt;/p&gt;

&lt;p&gt;They examine the broken window forensically. The narrator reveals the wiped frame, the tool marks. The Case Board updates: "Broken Window (forensic) - frame wiped clean, tool marks from inside."&lt;/p&gt;

&lt;p&gt;They interview Clara Webb sympathetically. She breaks down and admits she heard arguing at 11:15 PM. The Persons of Interest panel shows Clara with her first interview note.&lt;/p&gt;

&lt;p&gt;All of this - the evidence, the suspect information, the timeline - flows into AgentCore Memory. STM keeps the turn-by-turn conversation. The semantic strategy extracts the facts. The episodic strategy logs Clara's interrogation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session 2: The Case Continues
&lt;/h3&gt;

&lt;p&gt;The player closes the browser, has lunch, and comes back. They start a new session with the same detective ID. The narrator opens differently now:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Case #1247 - The Blackwell Murder. Your notebook is open on the desk, the pages curling at the edges from the rain. Last time, you found the staged break-in - glass broken outward, frame wiped clean, tool marks from inside. Clara Webb heard arguing at 11:15 PM. Two voices. One was Blackwell. The other was a man she could not identify, but she said the shadow was tall, broad-shouldered. Like Marcus.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The summary strategy provided the session recap. The semantic strategy filled in the specific details. The player picks up where they left off and starts pressing on Marcus's alibi. Three sessions in, when the player consistently uses indirect questioning instead of confrontation, the narrator starts offering subtler options - the preference strategy at work.&lt;/p&gt;

&lt;p&gt;And when the player catches Helena in another timeline inconsistency, the narrator adds: "Her story shifts every time you push on the timeline. Your instinct says the 17 minutes matter more than she is letting on." That is the episodic reflection - pattern recognition across sessions that makes the detective feel like they are building real intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observing Memory in Real Time
&lt;/h2&gt;

&lt;p&gt;Understanding LTM is abstract until you watch it happen. The project includes a debug watch command that polls AgentCore Memory every 5 seconds and prints new STM events and LTM records as they appear:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make debug-memory-watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs &lt;code&gt;python server/debug_memory.py --watch 5&lt;/code&gt;, which seeds with the current state (so you only see new additions) and then streams changes. A typical session looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Seeding current state... done (652 STM events, 156 LTM records)
  Watching for new additions...

[20:11:01] [STM] [fe400085] [user] Use a firm and aggressive approach with Clara
[20:11:11] [STM] [fe400085] [assistant] A confrontational approach with Clara Webb proves
  ineffective. She flinches at the sharp tone and retreats into monosyllables...

[20:11:59] [LTM] [USER_PREFERENCE (DetectiveStyle)]
{"context":"The user initially requested a softer approach when interrogating Clara Webb
but later explicitly requested to use a firm and aggressive approach, indicating a shift
toward more confrontational interrogation tactics with witnesses.",
"preference":"Prefers firm and aggressive interrogation approach with witnesses"}

[20:12:38] [LTM] [SUMMARIZATION (CaseNotes)]
&amp;lt;topic name="Witness Interview - Clara Webb (Confrontational Approach - Failed)"&amp;gt;
Detective Sloane attempted a firm and aggressive approach with Clara Webb. The
confrontational strategy proved completely ineffective. Clara flinched at the sharp
tone and retreated into monosyllables. This failed interrogation confirms Clara's
fear is a significant barrier and indicates a gentler approach is necessary.
&amp;lt;/topic&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;STM events appear immediately as the conversation flows. LTM records follow 30-60 seconds later as the platform's extraction pipeline processes the events. You can see exactly what each strategy produces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SEMANTIC&lt;/strong&gt; records are plain factual statements - "Helena Blackwell was found dead in the study at 10:42 PM"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SUMMARIZATION&lt;/strong&gt; records are topic-tagged XML with detailed session notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;USER_PREFERENCE&lt;/strong&gt; records are structured JSON with context, preference, and categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EPISODIC&lt;/strong&gt; records come in two flavors: situation recaps (&lt;code&gt;"situation": "A detective begins investigating..."&lt;/code&gt;) and cross-session strategy patterns (&lt;code&gt;"title": "Escalating Interrogation Pressure with Evidence Leverage"&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seeing these raw values is what made the strategies click for me. Reading the documentation, I understood that "semantic extracts facts" and "episodic captures patterns." But watching the actual output - seeing the platform independently decide that a failed interrogation was worth logging as an episode, or that a shift from soft to aggressive questioning counted as a preference change - made the system feel real. The extraction isn't just summarizing what happened. It's interpreting the conversation through each strategy's lens and producing genuinely different representations of the same events.&lt;/p&gt;

&lt;p&gt;The watch also exposed a debugging gotcha. As of &lt;code&gt;bedrock-agentcore&lt;/code&gt; SDK version 1.4.4, the AgentCore &lt;code&gt;list_memory_records&lt;/code&gt; and &lt;code&gt;retrieve_memory_records&lt;/code&gt; APIs return results under the key &lt;code&gt;memoryRecordSummaries&lt;/code&gt;, not &lt;code&gt;memoryRecords&lt;/code&gt;. The SDK's &lt;code&gt;retrieve_memories()&lt;/code&gt; method handles this correctly, so the agent works fine - but if you write your own debug scripts using boto3 directly, you'll get empty results and spend hours investigating an extraction pipeline that was working all along. The watch script in this repo has the correct key. Check the latest SDK docs if you're reading this in the future - response key names can change between versions.&lt;/p&gt;

&lt;p&gt;Other debug modes are available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dump everything - strategies, STM events, and LTM records&lt;/span&gt;
uv run python server/debug_memory.py

&lt;span class="c"&gt;# Only LTM records (skip raw conversation events)&lt;/span&gt;
uv run python server/debug_memory.py &lt;span class="nt"&gt;--ltm-only&lt;/span&gt;

&lt;span class="c"&gt;# Only STM events&lt;/span&gt;
uv run python server/debug_memory.py &lt;span class="nt"&gt;--stm-only&lt;/span&gt;

&lt;span class="c"&gt;# Show all sessions (default: most recent only)&lt;/span&gt;
uv run python server/debug_memory.py &lt;span class="nt"&gt;--all-sessions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;STM vs LTM isn't either/or - they serve completely different functions.&lt;/strong&gt; STM is working memory within a conversation. LTM is the case file that persists between sessions. You need both, and trying to use one for the other's job leads to problems. STM without LTM means the detective forgets everything between sessions. LTM without STM means the agent can't follow a multi-turn investigation within a single session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic reflections are the most compelling strategy.&lt;/strong&gt; The semantic strategy is the workhorse - it stores facts and retrieves them reliably. But the episodic strategy's cross-session reflections are what make the agent feel genuinely intelligent. When the narrator surfaces a pattern the player didn't explicitly ask about, it creates a moment that feels like the detective is actually thinking. This is the strategy I would lead with in any demo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model choice matters more than I expected for character consistency.&lt;/strong&gt; Nova Pro occasionally breaks character - mentioning tool names, exposing its reasoning process, or dropping the noir tone mid-paragraph. Claude Sonnet stays in character almost perfectly. For a narrative application where immersion matters, the model's ability to maintain a persona is as important as its raw capability. I ended up adding server-side filtering to strip &lt;code&gt;\&lt;/code&gt; tags and &lt;code&gt;&amp;lt;/tool&amp;gt;&lt;/code&gt; XML artifacts that leaked through from Nova Pro.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering is still the job - the prompt is the product.&lt;/strong&gt; The system prompt went through more revisions than any other file in this project. The first version let the model call six tools in a single turn, drowning the player in information before they had asked a single question. Another version produced beautiful prose but kept breaking character to mention tool names. Getting the narrator to call exactly one tool per turn, stay in character when tools error, and set the scene without immediately investigating required specific, firm language - "do not chain multiple tool calls" works where "one action per turn" didn't. If you're building an agent-based application, expect to spend as much time tuning the system prompt as you do writing the code around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Terraform provider gap is a real-world pattern.&lt;/strong&gt; Three of four strategies are supported in Terraform. The fourth requires a CLI workaround. This is a common pattern with new AWS services - Terraform support lags behind the API by weeks or months. The pragmatic approach is to manage what you can in Terraform and script the rest in your Makefile, documenting the gap clearly so your future self (or your team) knows what to update when provider support arrives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build a memory watch tool early.&lt;/strong&gt; The single most useful debugging aid was a script that polls memory and prints new STM events and LTM records in real time. Without it, memory's a black box - events go in, and you hope the right things come out. With it, you can see exactly what the platform extracts, how long extraction takes (30-60 seconds typically), and whether your namespace configuration is producing records where you expect them. I would build this before writing any agent code on my next project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Going to production would add several layers.&lt;/strong&gt; This demo runs in PUBLIC network mode with an unauthenticated local proxy. A production deployment would need: VPC mode with private subnets, VPC endpoints for Bedrock and AgentCore services (avoiding public internet for API calls), CloudFront distribution with WAF, Cognito or API key authentication on the proxy, a custom IAM role with least-privilege permissions (scoped &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; to specific model ARNs, scoped memory access to specific resources), an S3 backend for Terraform state, and Bedrock Guardrails for input validation. The architecture section of this post shows the demo setup. The production architecture is a different article.&lt;/p&gt;




&lt;p&gt;The full source code, Terraform configurations, and Makefile workflow are available on GitHub &lt;a href="https://github.com/RDarrylR/agentcore-memory-murder-mystery" rel="noopener noreferrer"&gt;agentcore-memory-murder-mystery&lt;/a&gt;. Clone the repo, run &lt;code&gt;make init &amp;amp;&amp;amp; make apply &amp;amp;&amp;amp; make deploy-agent &amp;amp;&amp;amp; make serve&lt;/code&gt;, and start investigating. The rain is still falling on Ashford Lane.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Connect with me on&lt;/em&gt; &lt;a href="https://x.com/RDarrylR" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://bsky.app/profile/darrylruggles.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://www.linkedin.com/in/darryl-ruggles/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://github.com/RDarrylR" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://medium.com/@RDarrylR" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://dev.to/rdarrylr"&gt;Dev.to&lt;/a&gt;&lt;em&gt;, or the&lt;/em&gt; &lt;a href="https://community.aws/@darrylr" rel="noopener noreferrer"&gt;AWS Community&lt;/a&gt;&lt;em&gt;. Check out more of my projects at&lt;/em&gt; &lt;a href="https://darryl-ruggles.cloud" rel="noopener noreferrer"&gt;darryl-ruggles.cloud&lt;/a&gt; &lt;em&gt;and join the&lt;/em&gt; &lt;a href="https://www.believeinserverless.com/" rel="noopener noreferrer"&gt;Believe In Serverless&lt;/a&gt; &lt;em&gt;community.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>agents</category>
      <category>agentcore</category>
      <category>serverless</category>
    </item>
    <item>
      <title>I Injected Three Faults. The Agent Found All of Them.</title>
      <dc:creator>Romar Cablao</dc:creator>
      <pubDate>Sun, 03 May 2026 14:24:37 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/i-injected-three-faults-the-agent-found-all-of-them-5pi</link>
      <guid>https://vibe.forem.com/aws-builders/i-injected-three-faults-the-agent-found-all-of-them-5pi</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Let's get our hands dirty. This part covers the full setup and the actual demo: deploy PayLedger to both regions, wire up Route 53 failover, configure the Agent Space, inject three simultaneous faults, and walk through exactly what the agent found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick recap from Part 1:&lt;/strong&gt; PayLedger is a demo payment ledger deployed to ap-southeast-1 (primary) and ap-northeast-1 (secondary) with Route 53 failover, DynamoDB Global Tables, and a Next.js frontend showing which region is serving. DevOps Agent sits in ap-southeast-2 monitoring both. If you haven't read the first part, you can check it out here:&lt;/p&gt;


&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8" class="crayons-story__hidden-navigation-link"&gt;Runbooks Don't Investigate. AWS DevOps Agent Does.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;
          &lt;a class="crayons-logo crayons-logo--l" href="/aws-builders"&gt;
            &lt;img alt="AWS Community Builders  logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png" class="crayons-logo__image" width="350" height="350"&gt;
          &lt;/a&gt;

          &lt;a href="/romarcablao" class="crayons-avatar  crayons-avatar--s absolute -right-2 -bottom-2 border-solid border-2 border-base-inverted  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1531782%2Fed95ba63-9661-4185-92fa-5f6791443239.png" alt="romarcablao profile" class="crayons-avatar__image" width="567" height="567"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/romarcablao" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Romar Cablao
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Romar Cablao
                
              
              &lt;div id="story-author-preview-content-3598292" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/romarcablao" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1531782%2Fed95ba63-9661-4185-92fa-5f6791443239.png" class="crayons-avatar__image" alt="" width="567" height="567"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Romar Cablao&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

            &lt;span&gt;
              &lt;span class="crayons-story__tertiary fw-normal"&gt; for &lt;/span&gt;&lt;a href="/aws-builders" class="crayons-story__secondary fw-medium"&gt;AWS Community Builders &lt;/a&gt;
            &lt;/span&gt;
          &lt;/div&gt;
          &lt;a href="https://dev.to/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 3&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8" id="article-link-3598292"&gt;
          Runbooks Don't Investigate. AWS DevOps Agent Does.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/aws"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;aws&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/aiops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;aiops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/disasterrecovery"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;disasterrecovery&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            7 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;





&lt;h2&gt;
  
  
  Before You Start
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS account&lt;/td&gt;
&lt;td&gt;IAM admin permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain in Route 53&lt;/td&gt;
&lt;td&gt;Hosted zone for custom domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless Framework v4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;npm install -g serverless&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python 3.12&lt;/td&gt;
&lt;td&gt;Lambda runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACM certificates&lt;/td&gt;
&lt;td&gt;In both apse1 and apne1 for the API subdomain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;New customers get a 2-month free trial for AWS DevOps Agent. After that, billing is per second when the agent is active. Support credits vary by tier.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://aws.amazon.com/devops-agent/pricing/" rel="noopener noreferrer"&gt;AWS DevOps Agent Pricing&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 1: Create the Agent Space
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7umud8kjgu2tzwbm2rx3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7umud8kjgu2tzwbm2rx3.png" alt="Create an Agent Space" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before deploying anything in your workload regions, set up the Agent Space first. The webhook credentials produced here are needed later when you wire up alarm forwarding.&lt;/p&gt;

&lt;p&gt;Switch to &lt;strong&gt;ap-southeast-2&lt;/strong&gt; in the AWS Console. Navigate to AWS DevOps Agent and create a new Agent Space. AWS creates the required IAM roles automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DevOpsAgentRole-AgentSpace&lt;/strong&gt; uses &lt;code&gt;AIDevOpsAgentAccessPolicy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevOpsAgentRole-WebappAdmin&lt;/strong&gt; uses &lt;code&gt;AIDevOpsOperatorAppAccessPolicy&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Link your AWS account. Both workload regions (apse1 and apne1) are in the same account, so a single association gives the agent visibility into both.&lt;/p&gt;

&lt;p&gt;Once the Agent Space is up, grab the webhook URL and HMAC key from the integrations page. You'll use them in Step 5.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-what-are-devops-agent-spaces.html" rel="noopener noreferrer"&gt;What are DevOps Agent Spaces?&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 2: Deploy to Both Regions
&lt;/h2&gt;

&lt;p&gt;Copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt; and fill in your values, then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; deploy-backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This deploys to ap-southeast-1 first (which creates the DynamoDB table), then ap-northeast-1 (which skips table creation via a CloudFormation Condition). API Gateway IDs are auto-discovered from CloudFormation and written back to &lt;code&gt;.env&lt;/code&gt;. No manual copy-pasting.&lt;/p&gt;

&lt;p&gt;If you prefer to run the deploys individually:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Primary (creates the DynamoDB table)&lt;/span&gt;
npx serverless deploy &lt;span class="nt"&gt;--stage&lt;/span&gt; dev &lt;span class="nt"&gt;--region&lt;/span&gt; ap-southeast-1

&lt;span class="c"&gt;# Secondary (skips DynamoDB creation via CloudFormation Condition)&lt;/span&gt;
npx serverless deploy &lt;span class="nt"&gt;--stage&lt;/span&gt; dev &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Verify both health endpoints are up:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://&amp;lt;APSE1_ID&amp;gt;.execute-api.ap-southeast-1.amazonaws.com/health
&lt;span class="c"&gt;# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}&lt;/span&gt;

curl https://&amp;lt;APNE1_ID&amp;gt;.execute-api.ap-northeast-1.amazonaws.com/health
&lt;span class="c"&gt;# {"status": "healthy", "region": "ap-northeast-1", "service": "payledger", "timestamp": "..."}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Enable DynamoDB Global Table
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; setup-global-table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This adds the ap-northeast-1 replica and polls until it reaches &lt;code&gt;ACTIVE&lt;/code&gt; status (typically 2-5 minutes). Under the hood it runs &lt;code&gt;update-table&lt;/code&gt; with &lt;code&gt;replica-updates Create={RegionName=ap-northeast-1}&lt;/code&gt; and waits.&lt;/p&gt;

&lt;p&gt;Seed some transactions so the UI has data to show:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/seed_transactions.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://aws.amazon.com/dynamodb/global-tables/" rel="noopener noreferrer"&gt;Amazon DynamoDB Global Tables&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Step 4: Configure Custom Domains and Route 53 Failover
&lt;/h2&gt;

&lt;p&gt;Two sub-steps here. Before running them, make sure ACM certificates exist in both regions covering the API subdomain and the failover domain.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create API GW custom domains + Alias A records in Route 53&lt;/span&gt;
bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; setup-custom-domains

&lt;span class="c"&gt;# Create Route 53 health checks + PRIMARY/SECONDARY failover CNAME records&lt;/span&gt;
bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; setup-route53
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;setup-custom-domains&lt;/code&gt; creates the regional custom domains (&lt;code&gt;apse1-api-payledger.yourdomain.com&lt;/code&gt;, &lt;code&gt;apne1-api-payledger.yourdomain.com&lt;/code&gt;) and registers both with the failover domain (&lt;code&gt;api-payledger.yourdomain.com&lt;/code&gt;) so API Gateway accepts the Host header from either path.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setup-route53&lt;/code&gt; creates health checks (10s interval, FailureThreshold 2) and the PRIMARY/SECONDARY CNAME failover pair. It polls until both health checks pass before returning.&lt;/p&gt;

&lt;p&gt;After setup, all traffic to &lt;code&gt;api-payledger.yourdomain.com&lt;/code&gt; goes to Singapore. If the health check fails twice (around 20 seconds), Route 53 fails over to Tokyo automatically.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify, should hit primary&lt;/span&gt;
curl https://api-payledger.yourdomain.com/health
&lt;span class="c"&gt;# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html" rel="noopener noreferrer"&gt;Amazon Route 53 Failover Routing&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Step 5: Store the DevOps Agent Webhook Credentials
&lt;/h2&gt;

&lt;p&gt;The alarm notification flow uses a webhook: CloudWatch Alarm → SNS Topic → &lt;code&gt;devopsAgentTrigger&lt;/code&gt; Lambda → DevOps Agent webhook. The &lt;code&gt;setup.sh&lt;/code&gt; script handles this via the &lt;code&gt;setup-webhook&lt;/code&gt; step, which stores the webhook URL and HMAC key from the DevOps Agent console in Secrets Manager.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; setup-webhook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You'll need the webhook URL and HMAC key from your Agent Space in the DevOps Agent console. Set them in your &lt;code&gt;.env&lt;/code&gt; file first:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;DEVOPS_AGENT_WEBHOOK_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://event-ai.ap-southeast-2.api.aws/webhook/generic/your-webhook-id&lt;/span&gt;
&lt;span class="py"&gt;DEVOPS_AGENT_HMAC_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your-hmac-key-here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Deploy the Frontend
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; deploy-frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This provisions the S3 bucket and CloudFront distribution if they don't exist, registers &lt;code&gt;FRONTEND_DOMAIN&lt;/code&gt; in Route 53, builds the Next.js app, syncs the output to S3, and invalidates the CloudFront cache. If you just want to run it locally without the cloud provisioning:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; deploy-frontend &lt;span class="nt"&gt;--local&lt;/span&gt;
&lt;span class="c"&gt;# Writes frontend/.env.local only. Run with: npm run dev --prefix frontend&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The UI polls &lt;code&gt;/health&lt;/code&gt; every 5 seconds. Green banner = Singapore (PRIMARY). Amber banner = Tokyo (FAILOVER). When the region changes, a "Failover detected" banner appears automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39pzzkwok2ftcgic4x1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39pzzkwok2ftcgic4x1s.png" alt="Topology - Healthy State" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 7: Verify Topology
&lt;/h2&gt;

&lt;p&gt;After linking the account, DevOps Agent builds the topology automatically from CloudFormation stacks. Serverless Framework deploys via CloudFormation, so all resources in both regions are discovered without manual setup.&lt;/p&gt;

&lt;p&gt;Three views in the web app: System view (account/region boundaries), Container view (CloudFormation stacks), Resource view (full resource graph with cross-region DynamoDB relationship).&lt;/p&gt;

&lt;p&gt;The topology is powered by the &lt;strong&gt;Agent Space Understanding&lt;/strong&gt; learned skill. It auto-generates when integrations are configured and powers the Topology page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fiospcds1hoflq4bku4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fiospcds1hoflq4bku4.png" alt="AWS DevOps Agent - PayLedger Topology" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-what-is-a-devops-agent-topology.html" rel="noopener noreferrer"&gt;What is a DevOps Agent Topology?&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Step 8: Verify the Full Stack
&lt;/h2&gt;

&lt;p&gt;Run the verify step to confirm all endpoints are reachable through the failover URL before injecting any faults:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/setup.sh &lt;span class="nt"&gt;--step&lt;/span&gt; verify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This runs health checks against both regional endpoints directly, then tests all four endpoints through the Route 53 failover URL including a POST to &lt;code&gt;/transactions&lt;/code&gt;. All checks should pass and return 2xx before you continue.&lt;/p&gt;


&lt;h2&gt;
  
  
  Optional Integrations
&lt;/h2&gt;

&lt;p&gt;The Agent Space works without these, but they make findings easier to consume.&lt;/p&gt;
&lt;h3&gt;
  
  
  Slack
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;AWS DevOps Agent console -&amp;gt; Settings -&amp;gt; Communications -&amp;gt; Slack -&amp;gt; Register (OAuth)&lt;/li&gt;
&lt;li&gt;Agent Space -&amp;gt; Capabilities -&amp;gt; Communications -&amp;gt; Slack -&amp;gt; select channel -&amp;gt; Create&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Agent Space web app shows all investigation findings regardless. Slack is useful if you want findings posted to a channel without keeping the web app open.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/configuring-capabilities-connecting-ticketing-and-chat-slack.html" rel="noopener noreferrer"&gt;Connecting Slack&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  GitHub
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Agent Space -&amp;gt; Capabilities -&amp;gt; Pipeline -&amp;gt; Connect -&amp;gt; GitHub&lt;/li&gt;
&lt;li&gt;Install the AWS DevOps Agent GitHub App on your account&lt;/li&gt;
&lt;li&gt;Grant access to the &lt;code&gt;payledger-aws-devops-agent&lt;/code&gt; repository&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent investigates all three faults without GitHub. The value it adds is deployment correlation. For config-related faults, the agent can correlate errors with recent config changes and deployment history.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/configuring-capabilities-connecting-ci-cd-pipelines-github.html" rel="noopener noreferrer"&gt;Connecting GitHub&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  The Demo: Three Faults at Once
&lt;/h2&gt;

&lt;p&gt;With everything set up, I ran &lt;code&gt;python scripts/fault.py inject&lt;/code&gt;. The default mode assigns one distinct fault per service simultaneously:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/fault.py inject
&lt;span class="c"&gt;# health       -&amp;gt; throttle   (reserved concurrency = 0)&lt;/span&gt;
&lt;span class="c"&gt;# transactions -&amp;gt; envvar     (TABLE_NAME removed)&lt;/span&gt;
&lt;span class="c"&gt;# balance      -&amp;gt; iam        (role swapped to fault-iam, no DynamoDB access)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The CloudWatch 5xx alarm for ap-southeast-1 fired at 21:30:02. Route 53 detected the failing health checks and routed traffic to ap-northeast-1. PayLedger continued serving from Tokyo. DevOps Agent started investigating automatically.&lt;/p&gt;

&lt;p&gt;Here is the full failover in action. You can see the region indicator shift from Singapore to Tokyo in real time:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/xtiF5KeZdSs"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;


&lt;h2&gt;
  
  
  The Investigation
&lt;/h2&gt;

&lt;p&gt;The alarm triggered at 21:30:02. The investigation completed at 21:37:05. Total time: &lt;strong&gt;7 minutes and 3 seconds.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Investigation Timeline
&lt;/h3&gt;

&lt;p&gt;The agent opened by reading two things before making a single AWS API call: the Agent Space Understanding skill and the PayLedger component reference file, both auto-generated learned skills from the connected account. Before any CloudWatch or CloudTrail queries had returned, the agent already had context about the service architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq590s0g02h121dxx55wt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq590s0g02h121dxx55wt.png" alt="Screenshot: Investigation timeline: start, skill reads, first observations" width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From there it split into three parallel tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lambda logs:&lt;/strong&gt; 11 tool calls over 1 minute, comparing a baseline window (13:00-13:05 UTC) against the incident window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail changes:&lt;/strong&gt; 19 tool calls over 2 minutes 4 seconds, pulling config change events for the account and region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda metrics:&lt;/strong&gt; 7 tool calls over 1 minute 43 seconds, error counts, throttle counts, duration, and invocation counts per function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14oang107pquu6ptxsh9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14oang107pquu6ptxsh9.png" alt="Screenshot: Investigation timeline: logs, metrics, audit trail" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By +2m16s, findings were coming back from all three tracks simultaneously.&lt;/p&gt;


&lt;h3&gt;
  
  
  Findings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Finding 1: listTransactions Lambda missing TABLE_NAME causing init crash&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every invocation of &lt;code&gt;payledger-dev-listTransactions&lt;/code&gt; failed during module initialization. The agent pulled the actual log entry from CloudWatch:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[2026-05-02T13:28:06.250Z] [ERROR] KeyError: 'TABLE_NAME'
Traceback (most recent call last):
&lt;/span&gt;&lt;span class="gp"&gt;  File "/var/task/functions/list_transactions.py", line 29, in &amp;lt;module&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="go"&gt;    TABLE_NAME = os.environ["TABLE_NAME"]
INIT_REPORT Phase: init  Status: error  Error Type: Runtime.Unknown
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;26 error records in the incident window, zero in baseline. It confirmed the missing variable by inspecting the live function configuration directly: &lt;code&gt;ALLOWED_ORIGINS&lt;/code&gt;, &lt;code&gt;POWERTOOLS_SERVICE_NAME&lt;/code&gt;, &lt;code&gt;LOG_LEVEL&lt;/code&gt;, &lt;code&gt;REGION&lt;/code&gt; were all present. No &lt;code&gt;TABLE_NAME&lt;/code&gt;. The function was never initializing. Every cold start failed before the handler could run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 2: getBalance Lambda using fault-iam role with no DynamoDB permissions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The function was assigned &lt;code&gt;payledger-dev-fault-iam&lt;/code&gt;, which only has &lt;code&gt;AWSLambdaBasicExecutionRole&lt;/code&gt;. Every DynamoDB query returned &lt;code&gt;AccessDeniedException&lt;/code&gt;. The function handled the exception gracefully, so the Lambda Errors metric showed 0. API Gateway still recorded the 500s. The agent caught this by looking at both metrics separately rather than relying on either one alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 3: health function throttled to zero&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Reserved concurrency had been set to 0, blocking all invocations before execution. 11 throttles at 13:27, 79 throttles at 13:28. Invocation count at 13:28 dropped to only 20 from the normal 90-100 per minute. The function had zero errors when it did execute, confirming it was a concurrency limit, not a code problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The accounting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent reconciled the numbers before writing the final report:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Errors&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;health&lt;/code&gt; (reserved concurrency = 0)&lt;/td&gt;
&lt;td&gt;90 (11 + 79)&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;listTransactions&lt;/code&gt; (missing &lt;code&gt;TABLE_NAME&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;getBalance&lt;/code&gt; (wrong IAM role)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;100 5xx errors, all accounted for.&lt;/p&gt;


&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;CloudTrail confirmed the trigger. All three configuration changes happened within a 2-second window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;PutFunctionConcurrency&lt;/code&gt; on &lt;code&gt;payledger-dev-health&lt;/code&gt;. Reserved concurrency set to 0 (13:27:54Z)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;UpdateFunctionConfiguration&lt;/code&gt; on &lt;code&gt;payledger-dev-listTransactions&lt;/code&gt;. All environment variables cleared (13:27:55Z)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;UpdateFunctionConfiguration&lt;/code&gt; on &lt;code&gt;payledger-dev-getBalance&lt;/code&gt;. Execution role changed to &lt;code&gt;payledger-dev-fault-iam&lt;/code&gt;, env vars cleared (13:27:56Z)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The root cause statement from the agent:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The role name 'payledger-dev-fault-iam', the use of Boto3 scripting, and the rapid self-recovery at 13:29:00Z strongly indicate this was a deliberate chaos engineering / fault injection exercise rather than an accidental misconfiguration."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last line: the agent identified the &lt;code&gt;devopsAgentTrigger&lt;/code&gt; Lambda in the stack and flagged the fault as intentional. It was right.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67k5zc9vr90jrkca6nk8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67k5zc9vr90jrkca6nk8.png" alt="Screenshot: Root Cause" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Mitigation Plan
&lt;/h3&gt;

&lt;p&gt;The agent returned: &lt;strong&gt;no mitigation action required.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two things happened in parallel during this incident. Route 53 detected the failing health checks and automatically failed over to ap-northeast-1 within 20 seconds, so the service kept running throughout. That part required no intervention. On the primary region side, the faults were reversed at 13:29:00 UTC when &lt;code&gt;fault.py restore&lt;/code&gt; ran, 2 minutes after injection. The agent saw the 5xx errors drop to 0, matched it against the CloudTrail restore events, and concluded there was nothing left to fix.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"This was a controlled chaos engineering exercise to test system resilience. The incident self-recovered at 13:29:00 UTC, indicating the configurations were reverted as part of the planned test. Since this was intentional testing and the system has already recovered, no immediate operational mitigation is required."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A system that generates restore commands for changes that have already been reverted would be wrong. The agent recognized self-recovery and didn't produce output that didn't apply.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymicm5cbl18szzfffjj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymicm5cbl18szzfffjj7.png" alt="Screenshot: Mitigation plan tab" width="800" height="231"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;Here is the full AWS DevOps Agent investigation in action:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/4qBFwdP4gNQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;


&lt;h2&gt;
  
  
  Observations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The agent built its own context before touching a single API.&lt;/strong&gt; It started by reading the Agent Space Understanding skill, which auto-generates from your connected account and maps resources, request paths, and service relationships. Before any CloudWatch or CloudTrail queries had returned, it already had the architecture context to make sense of what it was about to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three root causes from one alarm.&lt;/strong&gt; A single 5xx alarm triggered. The agent identified three distinct failure mechanisms, attributed the exact error count to each (90 throttles, 5 init crashes, 5 IAM errors), and traced all three to the same 2-second injection window in CloudTrail. That correlation is not obvious when a throttle, a &lt;code&gt;KeyError&lt;/code&gt;, and an &lt;code&gt;AccessDeniedException&lt;/code&gt; don't look like they came from the same event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The empty mitigation plan was the correct answer.&lt;/strong&gt; My expectation was restore commands. Instead the agent returned "no mitigation action required." Route 53 had already kept the service running via automatic failover. The primary region faults were reversed by &lt;code&gt;fault.py restore&lt;/code&gt;. The agent recognized both facts in the metrics and CloudTrail, and declined to produce output that didn't apply. Knowing when not to act is more useful than generating work that doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It identified the test as intentional.&lt;/strong&gt; Not just "three things broke." The agent concluded this was fault injection, named the evidence (role name, Boto3 scripting, 2-minute self-recovery), and assessed it correctly. That was not something I scripted or hinted at.&lt;/p&gt;


&lt;h2&gt;
  
  
  Restoring the Stack
&lt;/h2&gt;

&lt;p&gt;After the demo, restore all faults:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Restore all faults at once&lt;/span&gt;
python scripts/fault.py restore

&lt;span class="c"&gt;# Or restore individually&lt;/span&gt;
python scripts/restore_fault_iam.py &lt;span class="nt"&gt;--stage&lt;/span&gt; dev
python scripts/restore_fault_throttle.py &lt;span class="nt"&gt;--stage&lt;/span&gt; dev
python scripts/restore_fault_envvar.py &lt;span class="nt"&gt;--stage&lt;/span&gt; dev

&lt;span class="c"&gt;# Wait around 60s for health checks to pass&lt;/span&gt;
curl https://api-payledger.yourdomain.com/health
&lt;span class="c"&gt;# {"status": "healthy", "region": "ap-southeast-1"}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Once the health checks recover, Route 53 routes traffic back to ap-southeast-1. The primary region is restored.&lt;/p&gt;


&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;DR Toolkit&lt;/strong&gt; series covered Prepare. This series covered the middle: a multi-region demo app with real failover, three simultaneous faults, and &lt;strong&gt;AWS DevOps Agent&lt;/strong&gt; investigating all of them from a single alarm trigger. The agent identified the root cause, recognized the service had already recovered, and correctly concluded no action was needed, because the evidence from logs, metrics, and CloudTrail told it this was an injected fault, not a real incident.&lt;/p&gt;

&lt;p&gt;Route 53 kept the service running by routing to the healthy region. DevOps Agent used that time to find exactly what broke in the primary region. That is the relationship between the two: one buys you time, the other uses it.&lt;/p&gt;

&lt;p&gt;The Agent Space Understanding skill was the most visible differentiator in this investigation. It auto-generated from the connected account and gave the agent architecture context before the first API call. No manual input required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS DevOps Agent&lt;/strong&gt; handles the full investigation loop on its own: topology discovery, root cause analysis, and Slack notification. If you have a previous DR Toolkit runbook, you can optionally load it as a Custom Skill to give the agent extra context. If you haven't seen the DR Toolkit series: &lt;a href="https://dev.to/romarcablao/series/38086"&gt;BuildWithAI: DR Toolkit on AWS&lt;/a&gt;.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Try it / Fork it:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayLedger Repo:&lt;/strong&gt; &lt;a href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;github.com/romarcablao/payledger-aws-devops-agent&lt;/a&gt; &lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/romarcablao" rel="noopener noreferrer"&gt;
        romarcablao
      &lt;/a&gt; / &lt;a href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;
        payledger-aws-devops-agent
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      DevOpsAgent: Beyond the Runbook
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;PayLedger — Multi-Region Serverless Payment Ledger&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/romarcablao/payledger-aws-devops-agent/docs/assets/aws-devops-agent-topology.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fromarcablao%2Fpayledger-aws-devops-agent%2FHEAD%2Fdocs%2Fassets%2Faws-devops-agent-topology.png" alt="AWS DevOps Agent Topology"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Multi-region serverless payment ledger for recording transactions and viewing balances with active-passive failover. Deployed across &lt;strong&gt;ap-southeast-1&lt;/strong&gt; (Singapore, primary) and &lt;strong&gt;ap-northeast-1&lt;/strong&gt; (Tokyo, secondary) using AWS Lambda, DynamoDB Global Tables, and Route 53 failover routing.&lt;/p&gt;

&lt;p&gt;Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://kiro.dev" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3696d1e6677c4f16e33e8c23c69699d94c48d7d0a78a7627118a47c2a9e2fd7f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4b69726f2d4944452d626c75653f6c6f676f3d646174613a696d6167652f7376672b786d6c3b6261736536342c50484e325a79423361575230614430694d6a51694947686c6157646f644430694d6a516949485a705a58644362336739496a41674d4341794e4341794e4349675a6d6c7362443069626d39755a53496765473173626e4d39496d6830644841364c79393364336375647a4d7562334a6e4c7a49774d44417663335a6e496a3438634746306143426b50534a4e4d5449674d6b7730494464574d54644d4d5449674d6a4a4d4d6a41674d5464574e3077784d694179576949675a6d6c736244306964326870644755694c7a34384c334e325a7a343d267374796c653d666f722d7468652d6261646765" alt="Kiro"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/devops-agent/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/2cccc7fc811a2c85bb42de7adb48f816cc220c1cf8ab2dd894cbddb938c96ab1/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532532304465764f70732532304167656e742d4175746f6e6f6d6f75732532304f70732d4646393930303f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="AWS DevOps Agent"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/lambda/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/27ec8ce949c39eca034ccd1684eb245e35b3642da7bbd83463606d6ccd5750f1/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532532304c616d6264612d5365727665726c6573732d4646393930303f6c6f676f3d6177736c616d626461266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="AWS Lambda"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/dynamodb/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f1930ecbfe81f1c17aef44b24e89c80a2c64f358d93584fe4a36d8340cc168db/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f416d617a6f6e25323044796e616d6f44422d476c6f62616c2532305461626c65732d3430353344363f6c6f676f3d616d617a6f6e64796e616d6f6462266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon DynamoDB"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/route53/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8cdb0d62d6a60fe2fc48c87a4ad1af01db17d63f7d07b6d040a035a8adc1fe5e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f526f75746525323035332d4661696c6f766572253230444e532d3843344646463f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon Route 53"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/cloudfront/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d0e0694e3b1ad9971a43bc03cc671f6a2c3035a8d713f412ec34e968c1b4f7d7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c6f756446726f6e742d43444e2d3843344646463f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon CloudFront"&gt;&lt;/a&gt;
&lt;a href="https://nextjs.org/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e4987d8ec5523bda97f9a5862a7f29156a391f89d7fad452858e051a64179762/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6578742e6a732d46726f6e74656e642d3030303030303f6c6f676f3d6e657874646f746a73266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Next.js"&gt;&lt;/a&gt;
&lt;a href="https://www.python.org/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/344c953e0c4edc545a7acd96ef5e5f28277afd590b1f140ea99144b12de64f31/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e253230332e31322d52756e74696d652d3337373641423f6c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Python"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/devops-agent/features/" rel="noopener noreferrer"&gt;AWS DevOps Agent features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/devops-agent/pricing/" rel="noopener noreferrer"&gt;AWS DevOps Agent Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-devops-agent-skills.html" rel="noopener noreferrer"&gt;DevOps Agent Skills&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-learned-skills.html" rel="noopener noreferrer"&gt;Learned Skills&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/dynamodb/global-tables/" rel="noopener noreferrer"&gt;Amazon DynamoDB Global Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html" rel="noopener noreferrer"&gt;Amazon Route 53 Failover Routing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>aiops</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Runbooks Don't Investigate. AWS DevOps Agent Does.</title>
      <dc:creator>Romar Cablao</dc:creator>
      <pubDate>Sun, 03 May 2026 13:14:15 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8</link>
      <guid>https://vibe.forem.com/aws-builders/runbooks-dont-investigate-aws-devops-agent-does-44p8</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post-mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/romarcablao/series/38086"&gt;BuildWithAI: DR Toolkit on AWS&lt;/a&gt; series, I ran through how you can build six AI-powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap-southeast-1. Those tools handle what you do before an incident and what you do after. But the part in between, the actual incident response, none of them touch.&lt;/p&gt;

&lt;p&gt;This series covers that middle phase using &lt;strong&gt;AWS DevOps Agent&lt;/strong&gt;. The demo app is &lt;strong&gt;PayLedger&lt;/strong&gt;, a multi-region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data. Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture. Part 2 covers the full setup and the actual demo, including what the agent's investigation looked like when I ran three real faults against it.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/xtiF5KeZdSs"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  The DR Lifecycle, Mapped Out
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;th&gt;Covered by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prepare&lt;/td&gt;
&lt;td&gt;Runbooks, RTO/RPO targets, DR strategy, checklists&lt;/td&gt;
&lt;td&gt;DR Toolkit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detect&lt;/td&gt;
&lt;td&gt;Alarm fires, SNS notifies DevOps Agent, health check fails, DNS fails over&lt;/td&gt;
&lt;td&gt;CloudWatch + Route 53 + SNS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Investigate&lt;/td&gt;
&lt;td&gt;Root cause analysis, cross-region signal correlation&lt;/td&gt;
&lt;td&gt;AWS DevOps Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recover&lt;/td&gt;
&lt;td&gt;Apply fix, bring the unhealthy region back up, validate failback&lt;/td&gt;
&lt;td&gt;Human + runbook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learn&lt;/td&gt;
&lt;td&gt;Prevention recommendations, operational improvements&lt;/td&gt;
&lt;td&gt;DevOps Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect. Alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone built it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, giving the team the information needed to bring that region back up.&lt;/p&gt;

&lt;p&gt;That is what AWS DevOps Agent targets.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is AWS DevOps Agent?
&lt;/h2&gt;

&lt;p&gt;AWS DevOps Agent is a frontier agent for cloud operations. "Frontier agent" is AWS's term for autonomous systems that work independently, scale across concurrent tasks, and run persistently without constant human oversight. It starts working the moment an alarm fires, no manual trigger needed.&lt;/p&gt;

&lt;p&gt;Three capabilities:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous incident response.&lt;/strong&gt; When an alert comes in, the agent starts investigating immediately. It correlates signals across services and regions. If multiple alarms fire from the same root cause, it identifies them as related rather than treating each one separately. Root cause categories it investigates: system changes, input anomalies, resource limits, component failures, and dependency issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive incident prevention.&lt;/strong&gt; After an investigation, the agent recommends improvements in four areas: observability, infrastructure optimization, deployment pipeline, and application resilience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-demand SRE tasks.&lt;/strong&gt; Conversational chat against your actual infrastructure. You can ask about resource state, alarm status, or deployment history without switching consoles.&lt;/p&gt;

&lt;p&gt;The service uses a dual-console architecture. The AWS Console is for admin setup (Agent Space creation, integrations). A separate Agent Space web app is for day-to-day work (investigations, topology, prevention, chat).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;More on features: &lt;a href="https://aws.amazon.com/devops-agent/features/" rel="noopener noreferrer"&gt;AWS DevOps Agent features&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent.html" rel="noopener noreferrer"&gt;About AWS DevOps Agent&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A Note on Region Availability
&lt;/h2&gt;

&lt;p&gt;As of this writing, AWS DevOps Agent is not available in ap-southeast-1 (Singapore) at GA. Supported regions are: us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-2, ap-northeast-1. AWS may add support for more regions in the future, so it is worth checking the &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-supported-regions.html" rel="noopener noreferrer"&gt;supported regions page&lt;/a&gt; before you start.&lt;/p&gt;

&lt;p&gt;The two closest for SEA builders are &lt;strong&gt;ap-southeast-2 (Sydney)&lt;/strong&gt; and &lt;strong&gt;ap-northeast-1 (Tokyo)&lt;/strong&gt;. For this demo I used ap-southeast-2, but you can use any supported region you prefer. The Agent Space and its investigation data live there. Your workload stays wherever it is. Cross-region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Agent Space region is where your investigation data is stored, not where your app runs. For this demo, a single Agent Space in ap-southeast-2 monitors resources in both ap-southeast-1 and ap-northeast-1.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-supported-regions.html" rel="noopener noreferrer"&gt;AWS DevOps Agent Supported Regions&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Demo App: PayLedger
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5tmo5fofi3k3tddbn8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5tmo5fofi3k3tddbn8u.png" alt="PayLedger Topology" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A payment ledger is a practical choice for a DR demo because the requirements are clear. Any outage means transactions fail and balances go stale. The multi-region setup is the right response to that, not over-engineering.&lt;/p&gt;

&lt;p&gt;PayLedger has four endpoints: record a transaction, list recent transactions, get the current balance, and a health check. Deployed to two regions with Route 53 active-passive failover and DynamoDB Global Tables for data replication.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    payledger.yourdomain.com (CloudFront + S3)
                              |
                         Next.js UI
                         (balance, transactions, region indicator)
                              | calls
                              v
                    api-payledger.yourdomain.com
                              |
                         Route 53 (failover routing)
                         |-- PRIMARY  -&amp;gt; ap-southeast-1 (Singapore)
                         +-- SECONDARY -&amp;gt; ap-northeast-1 (Tokyo)

    ap-southeast-1                         ap-northeast-1
    +-- API Gateway                        +-- API Gateway
    +-- Lambda: createTransaction          +-- Lambda: createTransaction
    +-- Lambda: listTransactions           +-- Lambda: listTransactions
    +-- Lambda: getBalance                 +-- Lambda: getBalance
    +-- Lambda: health                     +-- Lambda: health
    +-- Lambda: devopsAgentTrigger         +-- Lambda: devopsAgentTrigger
    +-- DynamoDB &amp;lt;-- Global Table --&amp;gt;      +-- DynamoDB (replica)
    +-- SNS Topic (alarm notifications)    +-- SNS Topic (alarm notifications)
    +-- CloudWatch alarms                  +-- CloudWatch alarms

                    ap-southeast-2 (Sydney)
                    +-- AWS DevOps Agent
                        +-- Agent Space
                        +-- Slack (optional)
                        +-- GitHub (optional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Next.js (static) + S3 + CloudFront&lt;/td&gt;
&lt;td&gt;payledger.yourdomain.com&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS&lt;/td&gt;
&lt;td&gt;Route 53&lt;/td&gt;
&lt;td&gt;Failover routing + health checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;Lambda (Python 3.12)&lt;/td&gt;
&lt;td&gt;5 functions per region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;API Gateway (HTTP API, regional)&lt;/td&gt;
&lt;td&gt;Custom domain per region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;DynamoDB Global Tables&lt;/td&gt;
&lt;td&gt;Multi-region replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;CloudWatch&lt;/td&gt;
&lt;td&gt;Alarms in both regions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Route 53 checks &lt;code&gt;/health&lt;/code&gt; every 10 seconds. If the health check fails twice (around 20 seconds), DNS fails over to Tokyo automatically. Traffic routes to the healthy region while the team investigates and works to restore the primary. The frontend polls &lt;code&gt;/health&lt;/code&gt; every 5 seconds and shows which region is serving: green for Singapore (PRIMARY), amber for Tokyo (FAILOVER).&lt;/p&gt;

&lt;p&gt;DynamoDB Global Tables replicate data between both regions. After failover, the balance and transaction history are intact in Tokyo. Same data, just a different region serving it. That is the whole point of the architecture.&lt;/p&gt;


&lt;h2&gt;
  
  
  How the Demo Works
&lt;/h2&gt;

&lt;p&gt;When faults are injected into ap-southeast-1, the health check starts failing. Route 53 detects the failure and routes traffic to ap-northeast-1 within around 20 seconds. Users continue to be served from Tokyo while DevOps Agent investigates in the background. Once the agent identifies the root causes and the team applies the fixes, the primary region recovers and Route 53 fails back.&lt;/p&gt;

&lt;p&gt;This is the core of the DR story: &lt;strong&gt;failover keeps the service running; the investigation tells you what broke so you can fix it.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Three Fault Scenarios
&lt;/h2&gt;

&lt;p&gt;In Part 2, I inject three faults against the primary region using &lt;code&gt;fault.py&lt;/code&gt;, a Python script for fault injection and restoration. Each represents a common real-world serverless incident.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Fault&lt;/th&gt;
&lt;th&gt;How it breaks&lt;/th&gt;
&lt;th&gt;Root cause category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;IAM permission denied&lt;/td&gt;
&lt;td&gt;Role swapped to fault role with no DynamoDB access&lt;/td&gt;
&lt;td&gt;System change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Lambda throttling&lt;/td&gt;
&lt;td&gt;Reserved concurrency = 0, 429 before function runs&lt;/td&gt;
&lt;td&gt;Resource limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Missing environment variable&lt;/td&gt;
&lt;td&gt;TABLE_NAME removed, KeyError at module load&lt;/td&gt;
&lt;td&gt;Code/config change&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What makes this interesting: all three run simultaneously using &lt;code&gt;python scripts/fault.py inject&lt;/code&gt; (the default mode assigns one distinct fault per service). One alarm fires in ap-southeast-1, three different root causes show up in the investigation, and DevOps Agent has to untangle all of them in a single run. That is a harder test than running each fault separately.&lt;/p&gt;


&lt;h2&gt;
  
  
  Where This Fits in the DR Lifecycle
&lt;/h2&gt;

&lt;p&gt;The DR Toolkit covered the Prepare phase. This series covers Investigate and Recover. The part that happens after the alarm fires.&lt;/p&gt;

&lt;p&gt;DevOps Agent does not need the DR Toolkit to investigate. It reads your topology, correlates signals across services, identifies root causes, and posts findings to Slack on its own. AWS DevOps Agent is capable enough to detect, investigate, root cause, and even generate post-mortem inputs without any external tool.&lt;/p&gt;

&lt;p&gt;The connection here is context: if you want to give the agent extra architecture knowledge upfront, you can optionally load a runbook generated by the DR Toolkit as a Custom Skill.&lt;/p&gt;


&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;In Part 2, we'll get our hands dirty with the full setup and the demo: deploying PayLedger to both regions, configuring Route 53 failover, setting up the Agent Space, and then running the faults. I'll walk through the actual investigation the agent ran: the timeline, the findings, the root cause, and what it concluded about mitigation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhuojko7zuk1f63supvo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhuojko7zuk1f63supvo.png" alt="Up Next" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Try it / Fork it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PayLedger Repo:&lt;/strong&gt; &lt;a href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;github.com/romarcablao/payledger-aws-devops-agent&lt;/a&gt; &lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/romarcablao" rel="noopener noreferrer"&gt;
        romarcablao
      &lt;/a&gt; / &lt;a href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;
        payledger-aws-devops-agent
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      DevOpsAgent: Beyond the Runbook
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;PayLedger — Multi-Region Serverless Payment Ledger&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/romarcablao/payledger-aws-devops-agent/docs/assets/aws-devops-agent-topology.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fromarcablao%2Fpayledger-aws-devops-agent%2FHEAD%2Fdocs%2Fassets%2Faws-devops-agent-topology.png" alt="AWS DevOps Agent Topology"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Multi-region serverless payment ledger for recording transactions and viewing balances with active-passive failover. Deployed across &lt;strong&gt;ap-southeast-1&lt;/strong&gt; (Singapore, primary) and &lt;strong&gt;ap-northeast-1&lt;/strong&gt; (Tokyo, secondary) using AWS Lambda, DynamoDB Global Tables, and Route 53 failover routing.&lt;/p&gt;

&lt;p&gt;Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://kiro.dev" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3696d1e6677c4f16e33e8c23c69699d94c48d7d0a78a7627118a47c2a9e2fd7f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4b69726f2d4944452d626c75653f6c6f676f3d646174613a696d6167652f7376672b786d6c3b6261736536342c50484e325a79423361575230614430694d6a51694947686c6157646f644430694d6a516949485a705a58644362336739496a41674d4341794e4341794e4349675a6d6c7362443069626d39755a53496765473173626e4d39496d6830644841364c79393364336375647a4d7562334a6e4c7a49774d44417663335a6e496a3438634746306143426b50534a4e4d5449674d6b7730494464574d54644d4d5449674d6a4a4d4d6a41674d5464574e3077784d694179576949675a6d6c736244306964326870644755694c7a34384c334e325a7a343d267374796c653d666f722d7468652d6261646765" alt="Kiro"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/devops-agent/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/2cccc7fc811a2c85bb42de7adb48f816cc220c1cf8ab2dd894cbddb938c96ab1/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532532304465764f70732532304167656e742d4175746f6e6f6d6f75732532304f70732d4646393930303f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="AWS DevOps Agent"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/lambda/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/27ec8ce949c39eca034ccd1684eb245e35b3642da7bbd83463606d6ccd5750f1/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532532304c616d6264612d5365727665726c6573732d4646393930303f6c6f676f3d6177736c616d626461266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="AWS Lambda"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/dynamodb/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f1930ecbfe81f1c17aef44b24e89c80a2c64f358d93584fe4a36d8340cc168db/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f416d617a6f6e25323044796e616d6f44422d476c6f62616c2532305461626c65732d3430353344363f6c6f676f3d616d617a6f6e64796e616d6f6462266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon DynamoDB"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/route53/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8cdb0d62d6a60fe2fc48c87a4ad1af01db17d63f7d07b6d040a035a8adc1fe5e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f526f75746525323035332d4661696c6f766572253230444e532d3843344646463f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon Route 53"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/cloudfront/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d0e0694e3b1ad9971a43bc03cc671f6a2c3035a8d713f412ec34e968c1b4f7d7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c6f756446726f6e742d43444e2d3843344646463f6c6f676f3d616d617a6f6e617773266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Amazon CloudFront"&gt;&lt;/a&gt;
&lt;a href="https://nextjs.org/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e4987d8ec5523bda97f9a5862a7f29156a391f89d7fad452858e051a64179762/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6578742e6a732d46726f6e74656e642d3030303030303f6c6f676f3d6e657874646f746a73266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Next.js"&gt;&lt;/a&gt;
&lt;a href="https://www.python.org/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/344c953e0c4edc545a7acd96ef5e5f28277afd590b1f140ea99144b12de64f31/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e253230332e31322d52756e74696d652d3337373641423f6c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465267374796c653d666f722d7468652d6261646765" alt="Python"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/romarcablao/payledger-aws-devops-agent" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/devops-agent/features/" rel="noopener noreferrer"&gt;AWS DevOps Agent features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent.html" rel="noopener noreferrer"&gt;About AWS DevOps Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-supported-regions.html" rel="noopener noreferrer"&gt;AWS DevOps Agent Supported Regions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/dynamodb/global-tables/" rel="noopener noreferrer"&gt;Amazon DynamoDB Global Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html" rel="noopener noreferrer"&gt;Amazon Route 53 Failover Routing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html" rel="noopener noreferrer"&gt;Disaster Recovery of Workloads on AWS&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>aiops</category>
      <category>disasterrecovery</category>
    </item>
    <item>
      <title>Turn WebSockets into Async/Await Requests (AWS WebSocket API Gateway + Lambda)</title>
      <dc:creator>Rishi</dc:creator>
      <pubDate>Sun, 03 May 2026 09:46:46 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/turn-websockets-into-asyncawait-requests-aws-api-websocket-gateway-lambda-1cpl</link>
      <guid>https://vibe.forem.com/aws-builders/turn-websockets-into-asyncawait-requests-aws-api-websocket-gateway-lambda-1cpl</guid>
      <description>&lt;p&gt;Some time ago, I was building a chat application using AWS Websocket API gateway. Things were going smoothly. I created a WebSocket API Gateway, added $connect, $disconnect, and sendMessage/addGroup routes. From the frontend (React) side, everything was fire-and-forget. You send a message, and the onMessageHandler takes care of it 💪🏼&lt;/p&gt;

&lt;p&gt;But then a new requirement of uploading files using S3 signed URLs came up. That's where I needed the Async/Await promise pattern. Now, one option was to create an HTTP API gateway and use it. But that meant a new connection, a new authorizer, and more setup. At that moment, I wished there was a way to use this existing WebSocket connection to get the signed URL ⭐&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwo0x4c82wxa6vfx76we.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwo0x4c82wxa6vfx76we.gif" alt="wish-movie-refernce" width="400" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that’s how this library "&lt;a href="https://www.npmjs.com/package/@tricksumo/ws-await" rel="noopener noreferrer"&gt;ws-await&lt;/a&gt;" was born!&lt;/p&gt;

&lt;p&gt;It lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;establish a WebSocket connection&lt;/li&gt;
&lt;li&gt;send normal fire-and-forget messages&lt;/li&gt;
&lt;li&gt;send messages and wait for the response using async/await&lt;/li&gt;
&lt;li&gt;handle reconnection with exponential backoff&lt;/li&gt;
&lt;li&gt;auto-send heartbeat messages to keep the connection alive&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How does it work?
&lt;/h2&gt;

&lt;p&gt;For the async/await pattern, messages from the frontend are sent with a unique requestId. The client keeps a map of pending promises. On the lambda side, the backend reads the requestId and sends it back in the response.&lt;/p&gt;

&lt;p&gt;Received messages are analyzed for requestId, and if requestId matches the id of any pending promise in the map, that promise is resolved. If a promise sits idle in the map for a period of more than ~30 seconds, it is rejected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Steps to use:
&lt;/h2&gt;

&lt;p&gt;Step 1: Install the library in your React project&lt;/p&gt;

&lt;p&gt;npm install @tricksumo/ws-await zustand&lt;/p&gt;

&lt;p&gt;Step 2: Import createSocket() and establish the connection. Then call ws.send("action") for fire and forget messages and await ws.request("action") for async/await pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createSocket&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@tricksumo/ws-await&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useEffect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createSocket&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wss://id.execute-api.us-east-1.amazonaws.com/prod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;App&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disconnect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleGetSignedURL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;getSignedURL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;fileType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;image/png&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Signed URL:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Request failed:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="nx"&gt;onClick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleGetSignedURL&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="nx"&gt;Click&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="kd"&gt;get&lt;/span&gt; &lt;span class="nx"&gt;signed&lt;/span&gt; &lt;span class="nx"&gt;URL&lt;/span&gt;
        &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/button&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;export default App&lt;/p&gt;

&lt;p&gt;Step 3:  Your Lambda must echo the requestId back in its response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileType&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;signedUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getPresignedUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="nx"&gt;signedUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// ← REQUIRED: echo it back or the Promise never resolves&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reading the connection state in the frontend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useSocket&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@tricksumo/ws-await&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;StatusBar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;isConnected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isConnecting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useSocket&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isConnecting&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Connecting&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/p&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;isConnected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Disconnected&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/p&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Connected&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/p&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Initially, this logic was part of my chat application named &lt;a href="https://tricksumo.com/serverless-chat-app/" rel="noopener noreferrer"&gt;Chatlings&lt;/a&gt;. But I thought it might help others, so I extracted it to create my first ever library 🙌🏼&lt;/p&gt;

</description>
      <category>aws</category>
      <category>websocket</category>
      <category>lambda</category>
      <category>javascriptlibraries</category>
    </item>
    <item>
      <title>Stateful MCP Servers on ECS Fargate: What Happens When You Deploy</title>
      <dc:creator>Avinash Dalvi</dc:creator>
      <pubDate>Sun, 03 May 2026 02:46:00 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/stateful-mcp-servers-on-ecs-fargate-what-happens-when-you-deploy-12l9</link>
      <guid>https://vibe.forem.com/aws-builders/stateful-mcp-servers-on-ecs-fargate-what-happens-when-you-deploy-12l9</guid>
      <description>&lt;p&gt;A few weeks back I was working on a PoC with Bedrock AgentCore Runtime. While doing that I came across multiple blogs and discussions around MCP server hosting on AWS. Most of them were pointing to either Bedrock AgentCore or Lambda. Very few talked about ECS Fargate.&lt;/p&gt;

&lt;p&gt;That got me thinking. I have been using Fargate for containerised workloads for a while now. It is my go-to when a team needs containers without managing the underlying infrastructure. So the question came naturally — can Fargate host a stateful MCP server? And more importantly, what happens when you actually deploy it in a real scenario?&lt;/p&gt;

&lt;p&gt;As an architect I believe you should know all the options before recommending one. Not just what the docs say — what actually happens when you run it. So I decided to test it myself.&lt;/p&gt;

&lt;p&gt;This blog is what I found. Specifically what happens when you run a stateful MCP server on ECS Fargate and then do a rolling deployment while a session is active. The results were not what I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP hosting on AWS — what are your options?
&lt;/h2&gt;

&lt;p&gt;Before jumping into the experiment, let me give some context on why Fargate and not the other options.&lt;/p&gt;

&lt;p&gt;When it comes to hosting MCP servers on AWS you have three realistic paths:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bedrock AgentCore Runtime&lt;/strong&gt; is AWS's managed MCP hosting service. You write your MCP server, deploy it, and AgentCore handles session isolation at the platform level. It supports both stateless and stateful MCP servers. By default stateless mode is recommended — AgentCore automatically adds an &lt;code&gt;Mcp-Session-Id&lt;/code&gt; header and manages connection continuity at the platform level. For multi-turn interactions that need session state preserved across requests, stateful mode (&lt;code&gt;stateless_http=False&lt;/code&gt;) is available and the runtime handles session preservation within the same invocation. The key difference from running stateful MCP on Fargate yourself: AgentCore manages the session layer for you regardless of mode. You are not responsible for sticky sessions, deregistration delays, or what happens to your session during a platform update. That operational burden stays with AWS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Lambda&lt;/strong&gt; comes in two modes now and the difference matters for MCP.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Standard Lambda is stateless by nature. Cold starts are a latency concern — and since August 2025 also a cost concern, as AWS now bills the INIT phase the same as invocation duration. For lightweight or infrequent MCP tool calls this is still simple and cost-effective. But for agent workloads where a session expects low latency tool calls, standard Lambda cold starts can be disruptive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lambda Managed Instances (LMI)&lt;/strong&gt; changes the picture. LMI runs your Lambda functions on EC2 instances in your own account — AWS still manages the instance lifecycle, patching and scaling, but your functions run on longer-lived compute. The result: no cold starts at all, multi-concurrency support where each execution environment handles multiple invocations simultaneously, and EC2-based pricing which can be significantly cheaper for steady-state workloads.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For MCP specifically, LMI is an interesting option for lighter workloads that need low latency tool calls without cold start risk, while keeping the serverless programming model. The constraint is the same as standard Lambda — stateless by nature, so session context still has to live somewhere else. But the cold start objection largely disappears with LMI.&lt;/p&gt;

&lt;p&gt;LMI is designed for steady-state predictable workloads — it scales more gradually than standard Lambda and does not burst instantly. If your MCP workload has very spiky or unpredictable traffic, standard Lambda or Fargate may still be better suited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECS Fargate&lt;/strong&gt; gives you your own container, your own session model, your own trade-offs. Fits teams already running Fargate workloads, teams with compliance or data residency requirements, or teams building something the managed service does not support yet. More control, more responsibility.&lt;/p&gt;

&lt;p&gt;I chose Fargate because I already use it and wanted to understand what it actually does with stateful MCP under real conditions — not a happy path demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up the experiment— with help from Kiro
&lt;/h2&gt;

&lt;p&gt;When I started looking at the AWS sample repository for stateful MCP on ECS — &lt;a href="https://github.com/aws-samples/sample-serverless-mcp-servers/tree/main/stateful-mcp-on-ecs-python" rel="noopener noreferrer"&gt;aws-samples/sample-serverless-mcp-servers&lt;/a&gt; — I found it was SAM based. It also expected VPC, CIDR, ALB and other networking prerequisites to be in place before running &lt;code&gt;sam deploy&lt;/code&gt;. That meant doing a lot of manual setup before I could even start the experiment.&lt;/p&gt;

&lt;p&gt;I did not want to spend my weekend debugging SAM prerequisites. I wanted to get to the actual experiment.&lt;/p&gt;

&lt;p&gt;So I decided to build the infrastructure from scratch. And this is where &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; helped. I used Kiro — AWS's agentic IDE — to scaffold the entire experiment setup: the FastMCP server, the CDK infrastructure including VPC, ALB, ECS cluster and Fargate task definition, and the test client.&lt;/p&gt;

&lt;p&gt;Here is what I built:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A stateful FastMCP server in Python holding session state in memory&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ALB with sticky sessions enabled — &lt;code&gt;lb_cookie&lt;/code&gt; type, 1 day duration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ECS Fargate service with 2 tasks and rolling deployment configured&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A test client using &lt;code&gt;httpx&lt;/code&gt; with a persistent cookie jar, making continuous tool calls every 5 seconds&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Task ID instrumented in every tool response by fetching from the ECS container metadata endpoint&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ECS_CONTAINER_METADATA_URI_V4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;TASK_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TaskARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# fetched once at server startup
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I deliberately did not add a &lt;code&gt;SIGTERM&lt;/code&gt; handler, did not externalise session state, and did not add any retry logic. I wanted to observe the default — what the pattern actually does out of the box before any hardening. The test client ran two operations per cycle: &lt;code&gt;set_session_value&lt;/code&gt; to write state, followed immediately by &lt;code&gt;get_session_state&lt;/code&gt; to read it back and confirm. Session state accumulated across calls — &lt;code&gt;seq_1&lt;/code&gt;, &lt;code&gt;seq_2&lt;/code&gt;, &lt;code&gt;seq_3&lt;/code&gt; and so on — so any loss of state would be immediately visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I observed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The setup confirmed
&lt;/h3&gt;

&lt;p&gt;Before triggering any deployment I confirmed the ALB configuration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetiwinmdpbexyjso9j8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetiwinmdpbexyjso9j8m.png" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything correctly configured as the documentation recommends. If you are new to sticky sessions and why they matter for stateful workloads, the &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/load-balancer-stickiness/welcome.html" rel="noopener noreferrer"&gt;AWS Prescriptive Guidance on load balancer stickiness&lt;/a&gt; is a good starting point.&lt;/p&gt;

&lt;p&gt;I then started the test client and let it run. Then mid-session I triggered a forced rolling deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecs update-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; YOUR_CLUSTER &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service&lt;/span&gt; YOUR_MCP_SERVICE &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--force-new-deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what the logs showed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 1: The cookie rotation red herring
&lt;/h3&gt;

&lt;p&gt;The first thing I noticed in the logs was the &lt;code&gt;AWSALB&lt;/code&gt; cookie changing on every single response — from call 1, before any deployment was triggered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"call_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"set_session_value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"task_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"507bf31b..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"session_cookie_changed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cookie_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWSALB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"old_cookie"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OWXI55yd..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"new_cookie"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"XItyHvSg..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"call_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"set_session_value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"task_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"507bf31b..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"session_cookie_changed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cookie_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWSALB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"old_cookie"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"XItyHvSg..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"new_cookie"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"md8WEI7W..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cookie changing every call. Naturally the first instinct is — stickiness is broken. Requests are bouncing between tasks.&lt;/p&gt;

&lt;p&gt;But look at the task ID. &lt;code&gt;507bf31b&lt;/code&gt; — same on every single successful call across all 39 calls before failure. The ALB was routing to the same task the entire time despite the cookie changing.&lt;/p&gt;

&lt;p&gt;What is actually happening: the ALB re-encrypts the sticky cookie token on every response even when routing to the same target. The cookie value rotates but the target it encodes stays the same. This is normal ALB behaviour — it is not routing instability.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Engineering judgment:&lt;/strong&gt; If you see cookie rotation in your logs and start debugging stickiness, you will spend days on the wrong problem. The cookie value is irrelevant. The target it encodes is what matters. Verify using task ID in your responses, not by watching the cookie.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This also has an important implication: any MCP client that captures the sticky cookie once at session initialisation and reuses it without updating — which is the natural implementation — will break stickiness the moment it sends a stale cookie value. The ALB will treat it as a new session and route via round-robin. With 2 tasks running that means a 50% chance of landing on the wrong task on every call.&lt;/p&gt;

&lt;p&gt;My test client used &lt;code&gt;httpx.Client&lt;/code&gt; with a persistent cookie jar that automatically updates on every response. That is what kept the session alive across 39 calls. The aws-samples repo mentions patching for cookie handling — but does not explain why updating on every response is critical, not just at session init.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 2: The atomic failure
&lt;/h3&gt;

&lt;p&gt;This is the central finding.&lt;/p&gt;

&lt;p&gt;At &lt;code&gt;15:20:12 UTC&lt;/code&gt;, call number 39's &lt;code&gt;set_session_value&lt;/code&gt; succeeded:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-26T15:20:12.067393+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"call_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"set_session_value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"507bf31beb2f41abae593f5cfd023b5e"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"seq_1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"call_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"seq_39"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"call_39"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five seconds later, call number 39's &lt;code&gt;get_session_state&lt;/code&gt; failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-26T15:20:17.134593+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"call_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_session_state"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-32600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Session not found"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same call number. Same MCP session ID. Same logical operation — write then read. The write succeeded on task &lt;code&gt;507bf31b&lt;/code&gt;. The read landed on the new task. The new task had no knowledge of that session. 404.&lt;/p&gt;

&lt;p&gt;The gap was 5 seconds. In those 5 seconds the deregistration delay expired, the old task was removed from the ALB target group, and the next request was routed to the replacement task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not an eventual consistency problem. This is an atomic operation split across a task boundary.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An AI agent that writes state and immediately reads it back to confirm — which is the natural pattern for any tool that modifies and verifies — cannot do so safely across a deployment boundary. The write may have landed on a task that no longer exists by the time the read arrives. The agent cannot tell the difference between "session not found because I have a bug" and "session not found because my task was replaced 5 seconds ago." It cannot retry safely. It cannot roll back. The state is in an unknown condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 3: Your monitoring will show nothing
&lt;/h3&gt;

&lt;p&gt;This is what makes this failure mode operationally dangerous.&lt;/p&gt;

&lt;p&gt;During the entire failure sequence — calls 39 through 50, all returning 404 — here is what your monitoring shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;ALB: healthy targets, no 5xx errors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ECS service: desired count met, tasks running&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CloudWatch alarms: nothing triggered&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ECS service events: deployment completed successfully&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The failure is &lt;code&gt;code: -32600 "Session not found"&lt;/code&gt; — a JSON-RPC application error, not an HTTP error. Your ALB access logs show 404 responses but 404 is not typically alarmed in most setups. And even if it is, the error message is indistinguishable from a bug in your tool implementation.&lt;/p&gt;

&lt;p&gt;Your on-call engineer will look at the infrastructure dashboard and see green. Your application engineer will look at the error and check their code. Both will find nothing wrong. The failure lives in the gap between the deployment event and the application layer — and nothing connects them automatically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Engineering judgment:&lt;/strong&gt; If you are running stateful MCP on Fargate you need an application-level alarm specifically on &lt;code&gt;-32600&lt;/code&gt; errors correlated with deployment events. Infrastructure health checks will not catch this.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One more safety net that will not help here: the &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-circuit-breaker.html" rel="noopener noreferrer"&gt;ECS deployment circuit breaker&lt;/a&gt;. The circuit breaker triggers on tasks that fail to reach RUNNING state or fail health checks. In this failure mode your new task is RUNNING, your health check passes, and ECS considers the deployment successful. The circuit breaker has no visibility into whether active MCP sessions were lost during the transition. The failure passes every gate AWS provides automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 4: The deregistration delay is your session cliff timer
&lt;/h3&gt;

&lt;p&gt;AWS documents the deregistration delay as a connection draining setting. For stateful MCP on Fargate it is actually your session survival window — the countdown timer between when a deployment starts and when your session dies.&lt;/p&gt;

&lt;p&gt;Across my runs with different configurations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Tasks&lt;/th&gt;
&lt;th&gt;Deregistration delay&lt;/th&gt;
&lt;th&gt;Session survived until&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;300s (default)&lt;/td&gt;
&lt;td&gt;Call 47 — ~61s after trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Changed&lt;/td&gt;
&lt;td&gt;Call 48 — ~4 min after trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;300s&lt;/td&gt;
&lt;td&gt;Call 39 — ~3.5 min after trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;300s&lt;/td&gt;
&lt;td&gt;Call 39 — atomic failure at 5s gap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The deregistration delay controlled the survival window in every run. Not the stickiness duration (86400 seconds — that number is fiction during deployments). Not the task count. The deregistration delay alone.&lt;/p&gt;

&lt;p&gt;But here is the honest conclusion: no value of deregistration delay removes the failure. It only changes when the cliff arrives. A 30 second delay means your session cliff is 30 seconds after deployment. A 900 second delay means your session survives longer but your old tasks linger for 15 minutes, slowing rollbacks and increasing cost. You are not solving the problem — you are choosing when to accept the loss.&lt;/p&gt;

&lt;p&gt;One more thing worth noting here: Fargate's default &lt;code&gt;stopTimeout&lt;/code&gt; is 30 seconds (&lt;a href="https://aws.amazon.com/blogs/compute/deep-dive-into-fargate-spot-to-run-your-ecs-tasks-for-up-to-70-less/" rel="noopener noreferrer"&gt;AWS reference&lt;/a&gt;). If you do not set a SIGTERM handler and raise this value, ECS will SIGKILL the container within 30 seconds of sending SIGTERM — regardless of your deregistration delay. So even if you set a 300 second deregistration delay, an unhandled SIGTERM means your session gets a hard kill within 30 seconds. The deregistration delay and stopTimeout work together — both need to be tuned, not just one.&lt;/p&gt;

&lt;p&gt;A minimal SIGTERM handler in FastMCP looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_sigterm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SIGTERM received — draining active sessions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# stay alive within stopTimeout window before exit
&lt;/span&gt;    &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGTERM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_sigterm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sleep value must be less than your &lt;code&gt;stopTimeout&lt;/code&gt; setting. If &lt;code&gt;stopTimeout&lt;/code&gt; is 30 seconds (default) and you sleep 25, the handler completes cleanly. If you forget to raise &lt;code&gt;stopTimeout&lt;/code&gt; above 30 seconds and sleep longer, SIGKILL fires before the handler finishes.&lt;/p&gt;

&lt;p&gt;One related consideration worth flagging: if your health check endpoint and MCP handler run in separate processes or on different ports, a new task can pass the ALB health check before the MCP handler is fully initialised — ECS has no native readiness probe separation the way Kubernetes does. In my implementation both run in the same uvicorn process on port 8000, so if the health check passes the MCP handler is already up. But if your setup is different, design for this explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means architecturally
&lt;/h2&gt;

&lt;p&gt;You have three honest options. I will be clear about which ones I have tested and which are architectural paths for a follow-up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A — Design for the failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Make your MCP tools idempotent. If a write-then-read pair fails, the client can retry the full operation safely without risk of duplicate side effects. This works for tools that are naturally idempotent — read-heavy tools, query tools, lookup tools. It fails for tools that modify external state once — sending a message, creating a record, triggering a payment. If your agent workflow has side effects, idempotency alone is not enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B — Externalise session state&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Move session storage to ElastiCache Redis or DynamoDB. The session is no longer tied to a specific task — any task can serve any session. Rolling deployments become safe because the new task can find the session in the external store. This eliminates the failure mode entirely.&lt;/p&gt;

&lt;p&gt;The cost: the MCP SDK does not support external session persistence natively. You need to patch the session layer. Every tool call now has an external store read/write on the hot path — latency increases. Operational complexity increases. This is the right answer for multi-turn agent workflows that genuinely cannot tolerate session loss. I have not built this yet — it is the subject of a follow-up experiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option C — Go stateless, let the platform handle sessions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is what Bedrock AgentCore chose. Stateless MCP server, session isolation managed at the platform layer. The application never owns session state — the infrastructure does. Zero risk of the failure mode I described above.&lt;/p&gt;

&lt;p&gt;The cost: you give up control over the session model. You take on the constraints of the managed service. If you have compliance requirements around data residency or need session behaviour the platform does not support, this path is not available to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  So is Fargate a good fit for stateful MCP?
&lt;/h2&gt;

&lt;p&gt;It depends — but not in the vague way that phrase usually means. Here is a more specific answer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fargate is a good fit if&lt;/strong&gt; your MCP tools are idempotent and session loss during deployments is acceptable or recoverable in your workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fargate with externalised session state is a good fit if&lt;/strong&gt; you need stateful multi-turn sessions, have compliance or control requirements that rule out managed services, and are willing to own the additional complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fargate with in-memory stateful sessions and the default configuration is not production-ready&lt;/strong&gt; for agent workloads that cannot tolerate session loss. The AWS sample pattern works. Until you deploy. And in production, you deploy all the time.&lt;/p&gt;

&lt;p&gt;If you are building something lighter — a few tools, mostly stateless, occasional multi-turn — Fargate is capable and operationally straightforward. If you are building something larger — long-running agent sessions, complex state, frequent deployments — you need to solve the session persistence problem before you go to production.&lt;/p&gt;

&lt;p&gt;That is the answer I was looking for when I started this. Now I have it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is next
&lt;/h2&gt;

&lt;p&gt;The experiment is not finished. The next step is to actually build Option B — externalise session state to Redis, run the same deployment experiment, and show whether the atomic failure disappears. That blog will have the same structure: real logs, real task IDs, real failure or real fix.&lt;/p&gt;

&lt;p&gt;If you are trying to make this decision for a real workload and want to talk through it, find me on X or LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All experiment code is available on&lt;/em&gt; &lt;a href="https://github.com/AvinashDalvi89/stateful-mcp-on-ecs-fargate-example" rel="noopener noreferrer"&gt;Stateful MCP Server on ECS Fargate - GitHub&lt;/a&gt;&lt;em&gt;. The test client, CDK infrastructure, and FastMCP server with task ID instrumentation are in the repository linked below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ecs</category>
      <category>aws</category>
      <category>fargate</category>
    </item>
    <item>
      <title>The Council has Decided</title>
      <dc:creator>mgbec</dc:creator>
      <pubDate>Sat, 02 May 2026 23:14:11 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/the-council-has-decided-11jh</link>
      <guid>https://vibe.forem.com/aws-builders/the-council-has-decided-11jh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp0anzjdy4xsrr0o0w3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqp0anzjdy4xsrr0o0w3c.png" width="596" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some of the most interesting developments of recent Generative AI implementations are all the different ways we can ask models and agents to work together to come up with solutions for our tasks. We have orchestration, choreography, and every permutation we can think of.&lt;/p&gt;

&lt;p&gt;One of the concepts that many of us have experimented with is the LLM Council pattern from Andrej Karpathy at &lt;a href="https://github.com/karpathy/llm-council" rel="noopener noreferrer"&gt;https://github.com/karpathy/llm-council&lt;/a&gt;. This project sets up three configurable models and asks each the users’ questions. The answers from each model go through peer review and ranking. Finally, the chairman of the LLM Council compiles the responses into a final judgement.&lt;/p&gt;

&lt;p&gt;Why would we choose this framework? Each model has a set of unique combinations of strengths and weaknesses. We can come up with more accurate, more diverse, and more complete answers by trying to combine the best of each.&lt;/p&gt;

&lt;p&gt;I built a variant of this using AWS AgentCore &lt;a href="https://github.com/mgbec/Council-agents" rel="noopener noreferrer"&gt;https://github.com/mgbec/Council-agents&lt;/a&gt;. I substituted a few of Andrej Karpathy’s components with AgentCore elements:&lt;/p&gt;

&lt;p&gt;Instead of OpenRouter + FastAPI + JSON files, this version uses:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Bedrock for multi-model access (Claude, Llama, Mistral, etc.)
&lt;/li&gt;
&lt;li&gt;AgentCore Runtime for serverless hosting with session management
&lt;/li&gt;
&lt;li&gt;AgentCore Memory for conversation persistence across sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did substitute some of models with different versions(easy to change in config.py):&lt;br&gt;&lt;br&gt;
 COUNCIL_MODELS =&lt;br&gt;&lt;br&gt;
“us.anthropic.claude-sonnet-4–20250514-v1:0”&lt;br&gt;&lt;br&gt;
“us.meta.llama4-maverick-17b-instruct-v1:0”&lt;br&gt;&lt;br&gt;
“mistral.mistral-large-2411-v1:0”&lt;/p&gt;

&lt;p&gt;CHAIRMAN_MODEL = “us.anthropic.claude-sonnet-4–20250514-v1:0”&lt;/p&gt;

&lt;p&gt;The basic functions are still the same, however:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ask question and receive individual responses:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpn6ceqe0eect2yure0ix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpn6ceqe0eect2yure0ix.png" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Peer ranking:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap0ounicv5e2dk2w4hmj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap0ounicv5e2dk2w4hmj.png" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd5kw9opoy5akjov5mzf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd5kw9opoy5akjov5mzf.png" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Then a final Council decision is made:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfx6d41ixwjrbf1lvgf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfx6d41ixwjrbf1lvgf1.png" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is all hosted on AWS with a React frontend. The workflow keeps credentials server-side, authenticates users via Cognito, and serves the React app from CloudFront.&lt;/p&gt;

&lt;p&gt;Some of the learning opportunities I had:&lt;/p&gt;

&lt;p&gt;* API Gateway REST APIs have a hard 29-second timeout, but the council takes 30–90 seconds. To work around this, the system uses an async pattern: the frontend submits a request (instant response with a request ID), then polls for the result every 5 seconds. The heavy work runs in a separate SQS-triggered Lambda with no timeout constraint.&lt;/p&gt;

&lt;p&gt;*I originally tried a Lambda Function URL to work around the API Gateway timeout. It would have worked, but the way I had it implemented was not very secure. First, the Lambda function was set up as public, which was not safe at all. My second attempt was having the Lambda itself validate the Cognito JWT on every request. Validation would check token structure, expiration, issuer, app client ID and that the key ID (kid) exists in your Cognito JSON Web Key Set. It did not do RSA signature verification, however, and I scrapped that plan for an async pattern with API Gateway, Lambdas, DynamoDB, and SQS. The full architecture is here, &lt;a href="https://github.com/mgbec/Council-agents/blob/master/architecture.md" rel="noopener noreferrer"&gt;https://github.com/mgbec/Council-agents/blob/master/architecture.md&lt;/a&gt;, but a quick synopsis of the part in question:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzd1e63xmgu9s0ct96ql3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzd1e63xmgu9s0ct96ql3.png" width="739" height="696"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;* For the AgentCore deployment, we can use CodeZip and upload to S3, or Docker image and push it to ECR. In the past I have used the Docker/ECR method, but Kiro told me that the best option for this project is the CodeZip method. “For this project, CodeZip (S3) is the right choice — it’s pure Python with pip-installable dependencies, nothing exotic in the runtime. Container mode is more useful when you need system-level packages, custom binaries, or a specific OS setup.”&lt;/p&gt;

&lt;p&gt;*The Lambda is used as a thin proxy that calls InvokeAgentRuntime, keeping AgentCore ARN and AWS credentials server-side, never exposed to the browser. Lambda then uses the Cognito sub claim to namespace AgentCore sessions, so users have memory isolation.&lt;/p&gt;

&lt;p&gt;* I really enjoy AgentCore Observability in all of its detail. For this project I didn’t see any sessions captured in the Observability Dashboard. I saw plenty of traces but no sessions at all. I asked Kiro about that, and the answer was “The issue is that our agent code (main.py) uses raw boto3 calls via bedrock_client.py rather than the Strands Agent framework. When you use a Strands Agent() with the BedrockAgentCoreApp, the framework automatically propagates session context into the OTEL spans. Our code bypasses that — it just calls boto3.client(“bedrock-runtime”).converse() directly, so the traces show the Bedrock calls but don’t associate them with the AgentCore session.”&lt;/p&gt;

&lt;p&gt;Kiro suggested two possible fixes to see the sessions in AgentCore Observability. The agent code would need to use one of these options:&lt;br&gt;&lt;br&gt;
 -Use a Strands Agent with session management (the framework handles OTEL context automatically)&lt;br&gt;&lt;br&gt;
 -Manually inject the session ID into the OTEL span attributes  &lt;/p&gt;

&lt;p&gt;I did attempt to refactor to use the Strands Agent session management but this created a metastasizing string of errors. I also tried to manually inject the session ID into the span attributes, which also did not work. Finally I tried something I saw in this document about OTEL baggage:&lt;br&gt;&lt;br&gt;
 &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;
 I had no luck with that either so sessions in the Observability Dashboard are a problem for another day.&lt;/p&gt;

&lt;p&gt;*Kiro was great at fixing the Dependabot vulnerabilities when asked to:&lt;br&gt;&lt;br&gt;
 “All 9 vulnerabilities fixed — npm audit fix updated 13 packages and now shows 0 vulnerabilities. Let me verify the build still works, then commit:”&lt;/p&gt;

&lt;p&gt;This was a fun way to implement Andrej Karpathy’s LLM Council idea. The next steps for me might be fixing the session observability, speeding up the responses, or trying a cheaper model. I asked my council to recommend a cost effective model for a chairman, and this was actually a snappy response. It recommended Claude 3 Haiku for the reasons shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzwvlxvz2kdm9ecx6xx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzwvlxvz2kdm9ecx6xx4.png" width="634" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftharpme6r2d3lnnbvqva.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftharpme6r2d3lnnbvqva.png" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m looking forward to all the creativity, new arrangements and workflows we will see in the future. Thanks for reading!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>awscognito</category>
      <category>agents</category>
      <category>bedrockagentcore</category>
    </item>
    <item>
      <title>AGENTS.md, SKILL.md, DESIGN.md: How AI Instructions Split into Three Layers</title>
      <dc:creator>Kento IKEDA</dc:creator>
      <pubDate>Sat, 02 May 2026 21:35:11 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/agentsmd-skillmd-designmd-how-ai-instructions-split-into-three-layers-d0g</link>
      <guid>https://vibe.forem.com/aws-builders/agentsmd-skillmd-designmd-how-ai-instructions-split-into-three-layers-d0g</guid>
      <description>&lt;p&gt;In April 2026, Google Labs released a spec called &lt;code&gt;DESIGN.md&lt;/code&gt;. It's a design system specification readable by AI agents, packaged with a CLI validator: &lt;code&gt;npx @google/design.md lint&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;DESIGN.md&lt;/code&gt; in the picture, we now have three different file types for instructing AI agents. &lt;code&gt;AGENTS.md&lt;/code&gt; has been spreading as an industry standard since 2025 (jointly developed by OpenAI, Google, Sourcegraph, Cursor, and Factory; donated to the Linux Foundation in December 2025). &lt;code&gt;SKILL.md&lt;/code&gt; sits at the core of Anthropic's Claude Skills. And now &lt;code&gt;DESIGN.md&lt;/code&gt;. The three handle different concerns and don't overlap.&lt;/p&gt;

&lt;p&gt;This article is for developers using coding agents like Claude Code, Cursor, or Codex in their work, and for tech leads operating natural-language instruction files like CLAUDE.md and style guides. If your team is doing Spec-Driven Development (SDD), this should also reach you.&lt;/p&gt;

&lt;p&gt;What I want to lay out is two things: how AI instructions are starting to split across three layers — behavior, individual tasks, and visual appearance — and how that connects with SDD as a parallel movement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Pattern: Natural-Language Documents
&lt;/h2&gt;

&lt;p&gt;A few years into the ChatGPT era, most engineers have written some form of "rules I want the AI to follow" in a Markdown file. CLAUDE.md, styleguide.md, CONTRIBUTING.md, internal coding conventions. The locations vary, but the format is roughly the same: unstructured natural language.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;writing-style-guide.md&lt;/code&gt; file I've been building over the past few months is a typical example. It's a style guide I use when writing technical articles with Claude — a list of patterns common in AI-generated text, written down as forbidden phrases. By making Claude Desktop read it every session, the tone of my output stays consistent. It's part of a personal repository (&lt;code&gt;ikenyal-ai-agents&lt;/code&gt;) I use as the harness for my business automation agents — the one I covered in my previous post.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b"&gt;https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The file contains roughly 150 lines: rules like "don't use em dashes," "avoid invitations like 'let's try…!'," "drop AI-style preambles like 'what's interesting is…'." The same repository has 15 instruction files under &lt;code&gt;agents/&lt;/code&gt;, organized by team and role: &lt;code&gt;executive-assistant.md&lt;/code&gt;, &lt;code&gt;sre-support.md&lt;/code&gt;, &lt;code&gt;qa-support.md&lt;/code&gt;, &lt;code&gt;accounting.md&lt;/code&gt;. Each describes "the assumptions to operate under as this role" in plain natural language.&lt;/p&gt;

&lt;p&gt;This approach has clear benefits. You can articulate tone, stance, and implicit rules. New team members can read the files and pick up the expectations. With CLAUDE.md, Claude Code reads it every session, so persona-level instructions land consistently.&lt;/p&gt;

&lt;p&gt;There are limits, too. First, validation falls on humans. Whether a rule was followed or not gets decided by a human reading the output. Second, individual judgment leaks in. "Write politely" means different things to different reviewers.&lt;/p&gt;

&lt;p&gt;The third limit is the actual subject of this article. Rules that are formally verifiable (forbidden phrases, em-dash usage, specific pattern matches) and rules that require judgment (tone, structural choices, how to open with empathy) sit in the same file. So even the verifiable parts end up depending on human review. That's the problem the three new file types are addressing.&lt;/p&gt;

&lt;h2&gt;
  
  
  New Type 1: How DESIGN.md (Google Labs) Specifies Visual Appearance
&lt;/h2&gt;

&lt;p&gt;On April 10, 2026, Google Labs published the &lt;code&gt;DESIGN.md&lt;/code&gt; specification at &lt;code&gt;google-labs-code/design.md&lt;/code&gt;. As of early May, the repo has over 11,000 stars. It's the reference implementation for Google Stitch (&lt;code&gt;stitch.withgoogle.com&lt;/code&gt;), an AI-driven UI generation product.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/google-labs-code/design.md" rel="noopener noreferrer"&gt;https://github.com/google-labs-code/design.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The specification doc lives on the Stitch side.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://stitch.withgoogle.com/docs/design-md/specification" rel="noopener noreferrer"&gt;https://stitch.withgoogle.com/docs/design-md/specification&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What &lt;code&gt;DESIGN.md&lt;/code&gt; covers is the design system specification. You write machine-readable design tokens in YAML at the top of the file (colors, typography, spacing, components), and human-readable design intent in the Markdown body underneath. Both live in the same file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Heritage&lt;/span&gt;
&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#1A1C1E"&lt;/span&gt;
  &lt;span class="na"&gt;tertiary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#B8422E"&lt;/span&gt;
&lt;span class="na"&gt;typography&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;h1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;fontFamily&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Public Sans&lt;/span&gt;
    &lt;span class="na"&gt;fontSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3rem&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## Overview&lt;/span&gt;

Architectural Minimalism meets Journalistic Gravitas.

&lt;span class="gu"&gt;## Colors&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Primary (#1A1C1E): Deep ink for headlines and core text.
&lt;span class="p"&gt;-&lt;/span&gt; Tertiary (#B8422E): "Boston Clay", the sole driver for interaction.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The headline feature of this format is the CLI validator that ships with it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @google/design.md lint DESIGN.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This checks token reference integrity, WCAG contrast ratios, and structural rule compliance, returning the result as JSON. Wire it into CI and you can verify design system consistency on every pull request. There's also a &lt;code&gt;diff&lt;/code&gt; command that compares two &lt;code&gt;DESIGN.md&lt;/code&gt; files and returns token-level changes in a structured form. Design system version control — historically a manual process — gains a verifiable layer.&lt;/p&gt;

&lt;p&gt;For Japanese UIs, the Google Labs spec alone falls short. It doesn't define the typography requirements specific to Japanese (CJK font fallback chains, line height, letter-spacing, kinsoku shori, mixed typesetting). The gap is filled by &lt;code&gt;kzhrknt/awesome-design-md-jp&lt;/code&gt;, which publishes Japan-localized &lt;code&gt;DESIGN.md&lt;/code&gt; files for over 10 services including Apple Japan, SmartHR, freee, note, MUJI, Mercari, LINE, and Toyota. For Japanese products, using both the Google Labs spec and the Japan edition together is the practical approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kzhrknt/awesome-design-md-jp" rel="noopener noreferrer"&gt;https://github.com/kzhrknt/awesome-design-md-jp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What &lt;code&gt;DESIGN.md&lt;/code&gt; carries is the design system that used to be scattered across Figma files and style guide PDFs, now consolidated into a single file with both machine-readable and human-readable parts. Think of it as the spec foundation that lets AI agents generate UIs with a consistent look every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  New Type 2: How SKILL.md (Anthropic) and AGENTS.md Specify Behavior
&lt;/h2&gt;

&lt;p&gt;While &lt;code&gt;DESIGN.md&lt;/code&gt; covers "appearance," &lt;code&gt;SKILL.md&lt;/code&gt; and &lt;code&gt;AGENTS.md&lt;/code&gt; cover "behavior" — defining what the agent is trying to do, how it should proceed, and what it must not do.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SKILL.md&lt;/code&gt; is the file format standardized by agentskills.io as part of the Agent Skills open standard. Anthropic's Claude Skills is one implementation of this standard; the same &lt;code&gt;SKILL.md&lt;/code&gt; works across Claude Code, Claude.ai, and the Agent SDK. Because it's standards-compliant, the same file is also readable by other agents like OpenClaw and Hermes. The structure: declare metadata (skill name, description, allowed tools) in the YAML at the top of the file, and write the task procedure or domain knowledge in the Markdown body below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://agentskills.io/home" rel="noopener noreferrer"&gt;https://agentskills.io/home&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A clear example of &lt;code&gt;SKILL.md&lt;/code&gt; is &lt;code&gt;conorbronsdon/avoid-ai-writing&lt;/code&gt;. It's an English-only skill that detects and rewrites AI patterns in English text — transition phrases like "Moreover," significance inflation like "watershed moment," and roundabout verb constructions like "serves as." It uses a 100+ word replacement table organized into 3 tiers (Tier 1 always replaces, Tier 2 flags when 2+ words appear in the same paragraph, Tier 3 flags only at high density), and audits 36 pattern categories. Two modes: &lt;code&gt;detect&lt;/code&gt; and &lt;code&gt;rewrite&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/conorbronsdon/avoid-ai-writing" rel="noopener noreferrer"&gt;https://github.com/conorbronsdon/avoid-ai-writing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What sets it apart from a one-shot prompt is the structured audit it returns. In &lt;code&gt;rewrite&lt;/code&gt; mode, you get four discrete sections: identified issues, the rewritten text, a summary of changes, and a second-pass audit. What changed and why becomes transparent.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; covers the agent's overall behavior. Project assumptions, roles, prohibitions, escalation rules. As I mentioned at the top, it started with the Amp team at Sourcegraph; today OpenAI, Google, Cursor, and Factory jointly drive it, and it was donated to the Linux Foundation in December 2025. Think of &lt;code&gt;CLAUDE.md&lt;/code&gt; as the Claude-specific version of &lt;code&gt;AGENTS.md&lt;/code&gt;. Claude Code reads &lt;code&gt;CLAUDE.md&lt;/code&gt; rather than &lt;code&gt;AGENTS.md&lt;/code&gt; in its spec, but the pattern recommended by &lt;code&gt;agents.md&lt;/code&gt; is to make &lt;code&gt;AGENTS.md&lt;/code&gt; the actual file and symlink &lt;code&gt;CLAUDE.md&lt;/code&gt; to it. In the personal repository I introduced earlier, the files under &lt;code&gt;agents/&lt;/code&gt; belong to this layer.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SKILL.md&lt;/code&gt; and &lt;code&gt;AGENTS.md&lt;/code&gt; cover different ranges. &lt;code&gt;AGENTS.md&lt;/code&gt; handles "overall context and boundaries." &lt;code&gt;SKILL.md&lt;/code&gt; handles "an executable unit for a specific task."&lt;/p&gt;

&lt;p&gt;The avoid-ai-writing English style auditor I mentioned is a specific task, so it ships as &lt;code&gt;SKILL.md&lt;/code&gt;. A file like &lt;code&gt;agents/genda/qa-support.md&lt;/code&gt;, which describes the assumptions and engagement style of a QA role, defines the agent's boundary — that goes on the &lt;code&gt;AGENTS.md&lt;/code&gt; side.&lt;/p&gt;

&lt;p&gt;The shared concern of these formats is "behavior and procedure," not visual appearance. What the agent knows, what it's tasked with, what it must avoid. That's a movement to fix these in a verifiable form.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Split
&lt;/h2&gt;

&lt;p&gt;Lining up the three file types, the layers each one handles become clear.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;What it carries&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Behavior&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AGENTS.md&lt;/code&gt; / &lt;code&gt;CLAUDE.md&lt;/code&gt; (natural language + rules)&lt;/td&gt;
&lt;td&gt;Overall context, roles, prohibitions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt;, role-specific files like &lt;code&gt;agents/genda/qa-support.md&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Individual task&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SKILL.md&lt;/code&gt; (YAML at top + Markdown body)&lt;/td&gt;
&lt;td&gt;Reusable tasks, procedures, domain knowledge&lt;/td&gt;
&lt;td&gt;avoid-ai-writing, in-house procedure skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Appearance&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;DESIGN.md&lt;/code&gt; (YAML at top + Markdown body)&lt;/td&gt;
&lt;td&gt;Design system spec, verifiable visual rules&lt;/td&gt;
&lt;td&gt;The Google Labs reference, individual service files in &lt;code&gt;kzhrknt/awesome-design-md-jp&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The three are complementary, not competing. CLIs like &lt;code&gt;bergside/typeui&lt;/code&gt; are emerging as tools that can generate or update either &lt;code&gt;SKILL.md&lt;/code&gt; or &lt;code&gt;DESIGN.md&lt;/code&gt;, depending on what you choose — a sign of tooling that assumes the division of labor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/bergside/typeui" rel="noopener noreferrer"&gt;https://github.com/bergside/typeui&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's actually different across the layers is "where to place the balance between machine-readable and human-readable." &lt;code&gt;AGENTS.md&lt;/code&gt; skews almost entirely human-readable; over-structuring it would block the contextual judgment and nuance it needs to convey. &lt;code&gt;SKILL.md&lt;/code&gt; is partially structured by the YAML at the top, but the body stays human-readable — task granularity has to be readable by humans before it can be instructed. &lt;code&gt;DESIGN.md&lt;/code&gt; puts machine-readable design tokens in the top YAML and human-readable design intent in the body, with the two cleanly separated.&lt;/p&gt;

&lt;p&gt;The center of gravity between "machine-readable" and "human-readable" sits in different places per layer. That's just the standard structuring principle — "manage things at different layers in different files" — applied to AI agents. The file names themselves spell out the division: &lt;code&gt;AGENTS.md&lt;/code&gt; ("instructions to the agent"), &lt;code&gt;SKILL.md&lt;/code&gt; ("a reusable skill"), &lt;code&gt;DESIGN.md&lt;/code&gt; ("the design system"). The names match what each one carries.&lt;/p&gt;

&lt;p&gt;Teams that have been packing all their "AI rules" into a single &lt;code&gt;CLAUDE.md&lt;/code&gt; now face a split decision. Open up your &lt;code&gt;CLAUDE.md&lt;/code&gt; and run these questions against it — splits start to surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there a section writing design system rules? → If yes, that goes to &lt;code&gt;DESIGN.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Are specific task procedures in there (monthly aggregation, test review, contract review)? → If yes, those go to &lt;code&gt;SKILL.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;What's left is overall agent context and boundaries (roles, prohibitions, escalation criteria) → that's the &lt;code&gt;AGENTS.md&lt;/code&gt; equivalent that stays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The three-layer split works as a framework for splitting your file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting with SDD
&lt;/h2&gt;

&lt;p&gt;Stepping back to look at the bigger picture: how does the three-layer split relate to the broader movement of "specs for AI"?&lt;/p&gt;

&lt;p&gt;SDD is a development style where you write the spec — requirements, design, tasks, implementation — before generating the code. The underlying idea: "specs aren't disposable scaffolding, they're executable artifacts that produce code." AWS's Kiro provides a workflow that generates &lt;code&gt;requirements.md&lt;/code&gt;, &lt;code&gt;design.md&lt;/code&gt;, and &lt;code&gt;tasks.md&lt;/code&gt; in order under &lt;code&gt;.kiro/specs/{feature}/&lt;/code&gt;. GitHub's Spec Kit (over 90,000 stars) supports the same flow with slash commands like &lt;code&gt;/specify&lt;/code&gt;, &lt;code&gt;/plan&lt;/code&gt;, &lt;code&gt;/tasks&lt;/code&gt;, &lt;code&gt;/implement&lt;/code&gt;. The EARS notation (Easy Approach to Requirements Syntax) used by Kiro reduces ambiguity by formatting requirements into 5 fixed templates. SDD has spread quickly between 2025 and 2026.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kiro.dev/" rel="noopener noreferrer"&gt;https://kiro.dev/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;https://github.com/github/spec-kit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The three-layer split (&lt;code&gt;AGENTS.md&lt;/code&gt; / &lt;code&gt;SKILL.md&lt;/code&gt; / &lt;code&gt;DESIGN.md&lt;/code&gt;) and SDD look like separate movements on the surface. The SDD community concentrates on Kiro and spec-kit usage; the &lt;code&gt;DESIGN.md&lt;/code&gt; side concentrates on formal specs and validation tooling. You don't see many articles bridging the two.&lt;/p&gt;

&lt;p&gt;But put their philosophies side by side and the overlap is striking.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Shared philosophy&lt;/th&gt;
&lt;th&gt;SDD (Kiro etc.)&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;DESIGN.md&lt;/code&gt; / &lt;code&gt;SKILL.md&lt;/code&gt; / &lt;code&gt;AGENTS.md&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Specify before implementing&lt;/td&gt;
&lt;td&gt;requirements → design → tasks → implementation&lt;/td&gt;
&lt;td&gt;behavior → implementation, appearance → implementation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Mix machine-readable + human-readable&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;requirements.md&lt;/code&gt; (EARS notation) + natural language&lt;/td&gt;
&lt;td&gt;YAML at top + Markdown body&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Persistent context for the AI&lt;/td&gt;
&lt;td&gt;reference &lt;code&gt;.kiro/specs/{feature}/&lt;/code&gt; every time&lt;/td&gt;
&lt;td&gt;reference &lt;code&gt;DESIGN.md&lt;/code&gt; / &lt;code&gt;AGENTS.md&lt;/code&gt; every time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Reduce ambiguity through structured syntax&lt;/td&gt;
&lt;td&gt;EARS notation structures requirements (5 templates)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;lint&lt;/code&gt; validates WCAG contrast ratios and structural rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Fix "decisions made" as a place&lt;/td&gt;
&lt;td&gt;spec files are where decisions live&lt;/td&gt;
&lt;td&gt;spec files are where decisions live&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both sit inside the larger "specs for AI" movement and share the same underlying philosophy.&lt;/p&gt;

&lt;p&gt;That said, they're not the same thing. The biggest difference, in one phrase: time horizon.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;SDD&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;DESIGN.md&lt;/code&gt; / &lt;code&gt;SKILL.md&lt;/code&gt; / &lt;code&gt;AGENTS.md&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Time horizon&lt;/td&gt;
&lt;td&gt;Describes "what to build next"&lt;/td&gt;
&lt;td&gt;Describes "rules that already exist"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Single feature / project lifecycle&lt;/td&gt;
&lt;td&gt;Persistent rules and styles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Update rhythm&lt;/td&gt;
&lt;td&gt;New per feature → consume → archive&lt;/td&gt;
&lt;td&gt;Long-term maintenance, gradual growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Subject&lt;/td&gt;
&lt;td&gt;Requirements, design, tasks (procedure for action)&lt;/td&gt;
&lt;td&gt;Rules for behavior, individual tasks, appearance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SDD specs describe "what we're going to build." &lt;code&gt;requirements.md&lt;/code&gt; is "what this feature needs to satisfy"; &lt;code&gt;design.md&lt;/code&gt; is "how to implement this feature"; &lt;code&gt;tasks.md&lt;/code&gt; is "how to break the feature into work." Once the feature ships, they finish their job and get archived.&lt;/p&gt;

&lt;p&gt;The three-layer specs describe "what should always hold." &lt;code&gt;DESIGN.md&lt;/code&gt; provides the color and typography rules every time you generate a UI; &lt;code&gt;AGENTS.md&lt;/code&gt; provides the agent's assumptions across every session. They get maintained long-term and grow incrementally.&lt;/p&gt;

&lt;p&gt;This time-horizon difference is why the two don't compete. Transient specs and persistent specs coexist in the same project. They can also reference each other. Imagine writing "use &lt;code&gt;{colors.tertiary}&lt;/code&gt; for the button" inside &lt;code&gt;.kiro/specs/checkout-feature/design.md&lt;/code&gt; — that lets a transient feature spec reference a color token from a persistent &lt;code&gt;DESIGN.md&lt;/code&gt;. The pattern isn't widely established yet, but the structure fits cleanly.&lt;/p&gt;

&lt;p&gt;One thing worth noting: as of May 2026, the active areas of SDD (the Kiro community and similar) and the active areas of &lt;code&gt;DESIGN.md&lt;/code&gt; / &lt;code&gt;SKILL.md&lt;/code&gt; / &lt;code&gt;AGENTS.md&lt;/code&gt; haven't really crossed paths. The SDD side concentrates on "how to build a feature"; the three-layer side concentrates on "how to deliver the rules."&lt;/p&gt;

&lt;p&gt;You don't have to be doing SDD to start with the three-layer split — the split alone gets you to the door of "specs for AI." If your team is already on SDD, start referencing &lt;code&gt;DESIGN.md&lt;/code&gt; tokens from inside your feature specs and you avoid maintaining the same rules in two places. The two movements look set to converge in the next phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not Everything Becomes a Spec
&lt;/h2&gt;

&lt;p&gt;The discussion of the three-layer split tends to drift toward "shouldn't we just spec everything," but in practice, that doesn't happen.&lt;/p&gt;

&lt;p&gt;Rules that can't be formally verified stay as natural-language documents. Tone, structural choices, cultural nuance. Things like "how to open an article with empathy" or "how to give an ending the right amount of resonance" — judgment-based qualities. The cost of speccing them isn't the issue; the essence falls out when you try.&lt;/p&gt;

&lt;p&gt;The judgment is straightforward: "is this formally verifiable?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Color contrast ratios (verifiable) → &lt;code&gt;DESIGN.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Word substitutions like "leverage → use" (verifiable) → &lt;code&gt;SKILL.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tone (soft assertions, not textbook-sounding), overall stance (not teaching, just organizing) and similar (not verifiable) → stays in &lt;code&gt;AGENTS.md&lt;/code&gt; / &lt;code&gt;CLAUDE.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For small teams, "one natural-language file" is often enough. If &lt;code&gt;CLAUDE.md&lt;/code&gt; alone is keeping things running, there's no need to force a split. The trade-off between the cost of speccing and the load of operating it depends on team size and how long the operation has to last.&lt;/p&gt;

&lt;p&gt;The three-layer split is something you adopt incrementally, just like SDD — you don't need to spec everything at once. Start with the complex areas, the areas where verification helps most.&lt;/p&gt;

&lt;p&gt;In other words, the three-layer split isn't a goal. It's an option you adopt when the situation calls for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;A few options come into view from this overview.&lt;/p&gt;

&lt;p&gt;A reasonable first move is to open your &lt;code&gt;CLAUDE.md&lt;/code&gt; or style guide and sort it into "formally verifiable" and "judgment-based" sections. Color and typography rules, word substitution lists, structural rules. If a useful amount of verifiable content sits there, pick one to break out into either &lt;code&gt;DESIGN.md&lt;/code&gt; (appearance) or &lt;code&gt;SKILL.md&lt;/code&gt; (task). Don't try to split everything at once — start with the most independent piece.&lt;/p&gt;

&lt;p&gt;Pulling in external skills is another route. Drop a ready-made &lt;code&gt;SKILL.md&lt;/code&gt; like &lt;code&gt;avoid-ai-writing&lt;/code&gt; into &lt;code&gt;~/.claude/skills/&lt;/code&gt; and your stance as a writer doesn't change — only the verification gets handed off to the machine.&lt;/p&gt;

&lt;p&gt;Teams already running Kiro or spec-kit are probably at the stage where they could try referencing &lt;code&gt;DESIGN.md&lt;/code&gt; tokens from inside &lt;code&gt;.kiro/specs/{feature}/design.md&lt;/code&gt;. The cross-reference between feature specs and persistent specs is still a thin area in terms of public examples.&lt;/p&gt;

&lt;p&gt;The shared stance: don't try to spec everything at once. Document split → operational trial → speccing — staged migration is the realistic path. The three-layer split isn't a finished form. It's a movement still in progress, and that's the safer way to read it.&lt;/p&gt;

&lt;p&gt;AI rules started splitting from a single natural-language document into three spec formats. That's another side of the same movement as SDD.&lt;/p&gt;

&lt;p&gt;Not everything becomes a spec, but managing different roles in different files — that ordinary structuring is starting to apply to AI agents, too.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>designsystem</category>
    </item>
    <item>
      <title>What Is Apache Polaris? Why Open Data Catalogs Matter and How to Use Them with AWS</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Sat, 02 May 2026 06:27:16 +0000</pubDate>
      <link>https://vibe.forem.com/aws-builders/what-is-apache-polaris-why-open-data-catalogs-matter-and-how-to-use-them-with-aws-5gal</link>
      <guid>https://vibe.forem.com/aws-builders/what-is-apache-polaris-why-open-data-catalogs-matter-and-how-to-use-them-with-aws-5gal</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/28aa29c2f9fbeb" rel="noopener noreferrer"&gt;Apache Polarisとは何か？オープンなデータカタログが求められる理由とAWSとの組み合わせ方を整理する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In recent years, lakehouse architectures centered around Apache Iceberg have been rapidly expanding.&lt;/p&gt;

&lt;p&gt;By placing Iceberg tables on object storage such as S3, it has become possible to query the same data from multiple engines such as Athena, Snowflake, Spark, Trino, and Dremio.&lt;br&gt;
As a result, the discussion has shifted from &lt;em&gt;“Where should data be placed, and which engine should be used for analysis?”&lt;/em&gt; to &lt;em&gt;“Where should data ownership reside, and which catalog should be used to unify governance?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Amid this trend, &lt;strong&gt;Apache Polaris&lt;/strong&gt; has been attracting attention in recent years.&lt;br&gt;
Apache Polaris is an open-source implementation of the Iceberg REST Catalog, led by Snowflake and donated to the Apache Software Foundation.&lt;/p&gt;

&lt;p&gt;Multiple vendors—including Dremio, AWS, Google, Microsoft, and Confluent—are contributing to it, and it is positioned as an &lt;strong&gt;“open catalog”&lt;/strong&gt; that enables cross-platform management of Iceberg tables while avoiding vendor lock-in.&lt;/p&gt;

&lt;p&gt;In this article, I would like to think through the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What Apache Polaris is&lt;/li&gt;
&lt;li&gt;Why open data catalogs are required&lt;/li&gt;
&lt;li&gt;Differences from AWS Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Differences from Snowflake Horizon Catalog&lt;/li&gt;
&lt;li&gt;How responsibilities should be divided when combining with AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In conclusion, Apache Polaris is not something that &lt;em&gt;competes&lt;/em&gt; with AWS Glue Catalog or Snowflake Horizon Catalog; rather, they are catalogs that operate at different layers.&lt;/p&gt;

&lt;p&gt;It may be easier to understand Apache Polaris as a component that enables an architecture such as:&lt;br&gt;
&lt;strong&gt;“The data itself resides in AWS, the catalog is open, and analysis engines are selected based on use cases.”&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What is Apache Polaris?
&lt;/h2&gt;

&lt;p&gt;Apache Polaris is an open-source catalog implementation compliant with the Apache Iceberg REST Catalog specification.&lt;br&gt;
It was announced by Snowflake in 2024 and later became an incubation project under the Apache Software Foundation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The project has now graduated from incubation and has been promoted to a top-level Apache project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Official site:&lt;br&gt;
&lt;a href="https://polaris.apache.org/" rel="noopener noreferrer"&gt;https://polaris.apache.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What Polaris aims to achieve is a &lt;strong&gt;common metadata and governance foundation in a lakehouse centered around Iceberg tables&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A major characteristic is that it is not tied to any specific query engine or cloud vendor, and anyone can access it using the same specification via REST APIs.&lt;/p&gt;


&lt;h3&gt;
  
  
  Key Features of Apache Polaris
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Implementation of Iceberg REST Catalog&lt;/td&gt;
&lt;td&gt;Accessible via standardized REST APIs. Can be directly used from engines such as Spark, Trino, Flink, Snowflake, and Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-catalog architecture&lt;/td&gt;
&lt;td&gt;Multiple catalogs can be defined within a single Polaris instance. Enables separation and management by team or business domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RBAC (Role-Based Access Control)&lt;/td&gt;
&lt;td&gt;Provides a permission model combining principals, principal roles, and catalog roles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External catalog integration&lt;/td&gt;
&lt;td&gt;Can connect to other catalogs compliant with the Iceberg REST specification (e.g., Nessie, Gravitino)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS / Managed support&lt;/td&gt;
&lt;td&gt;Can be self-hosted as OSS, or used as managed offerings such as Snowflake Open Catalog or Dremio Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  What Apache Polaris Solves
&lt;/h3&gt;

&lt;p&gt;As Apache Iceberg has become more widely adopted, multiple Iceberg-compatible catalogs have emerged, including Hive Metastore, JDBC, Nessie, AWS Glue, and Snowflake.&lt;/p&gt;

&lt;p&gt;Since each has its own client libraries and interfaces, the following challenges have arisen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The need to implement catalog clients for each programming language&lt;/li&gt;
&lt;li&gt;Inconsistent access control specifications across catalogs&lt;/li&gt;
&lt;li&gt;Difficulty enforcing governance across multiple catalogs&lt;/li&gt;
&lt;li&gt;As a result, the overall architecture becomes constrained by the chosen catalog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To solve these challenges, the Iceberg REST Catalog specification was introduced.&lt;br&gt;
Apache Polaris is an open-source implementation of that specification, further enhanced with multi-catalog support and RBAC.&lt;/p&gt;

&lt;p&gt;In other words, you can think of it as an &lt;strong&gt;open catalog for Apache Iceberg&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Polaris Security Model
&lt;/h3&gt;

&lt;p&gt;The Polaris security model can be organized into the following three concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Principal&lt;/strong&gt;: An entity representing a user or service. Accesses Polaris via client ID/secret, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Principal Role&lt;/strong&gt;: A grouping of multiple catalog roles. Assigned to principals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Role&lt;/strong&gt;: A set of permissions within a specific catalog. Includes permissions such as &lt;code&gt;TABLE_READ_DATA&lt;/code&gt;, &lt;code&gt;TABLE_CREATE&lt;/code&gt;, and &lt;code&gt;NAMESPACE_LIST&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, you can design it such that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;data_engineer&lt;/code&gt; principal role is assigned both &lt;em&gt;write access to prod_catalog&lt;/em&gt; and &lt;em&gt;administrative access to dev_catalog&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;data_analyst&lt;/code&gt; principal role is assigned only &lt;em&gt;read access to prod_catalog&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An important point is that RBAC is centralized on the catalog side, eliminating the need to implement access control separately for each engine.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Open Data Catalogs Are Required
&lt;/h2&gt;

&lt;p&gt;Let us first consider why open data catalogs are required in the first place.&lt;/p&gt;


&lt;h3&gt;
  
  
  Separation of Data and Engines Has Become a Premise
&lt;/h3&gt;

&lt;p&gt;The greatest value of open table formats such as Apache Iceberg is the ability to separate data storage from query engines.&lt;/p&gt;

&lt;p&gt;It has become possible to freely choose engines such as Athena, Glue, Spark, Snowflake, Dremio, and DuckDB depending on the use case when querying Iceberg tables on S3.&lt;/p&gt;

&lt;p&gt;As a result, the key question in data platforms has shifted from &lt;em&gt;“Which product should we use?”&lt;/em&gt; to &lt;em&gt;“Where should data ownership reside, and who should be responsible for governance at which layer?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, while engines can now be freely selected, the remaining challenge is the &lt;strong&gt;catalog&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  What Happens When Catalogs Are Tied to Engines
&lt;/h3&gt;

&lt;p&gt;When using catalogs tightly coupled with query engines, the following situations tend to occur:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The data itself is open (S3 + Iceberg), but the catalog is tied to a specific engine&lt;/li&gt;
&lt;li&gt;You want to reference the same table from another engine, but the catalog does not support it&lt;/li&gt;
&lt;li&gt;Access control is fragmented across engines, making governance difficult&lt;/li&gt;
&lt;li&gt;Every time the catalog is changed, all engine-side configurations must be redone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, even if storage and formats are open, a closed catalog significantly reduces the benefits of a lakehouse.&lt;/p&gt;

&lt;p&gt;Especially in today’s environments where multi-cloud, multiple products, and multiple engines are commonly combined, how to unify catalogs becomes a key challenge.&lt;/p&gt;


&lt;h3&gt;
  
  
  Requirements for an Open Catalog
&lt;/h3&gt;

&lt;p&gt;Based on this background, lakehouse catalogs are expected to meet the following requirements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compliance with standard APIs&lt;/td&gt;
&lt;td&gt;Support vendor-neutral APIs such as the Iceberg REST Catalog specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-engine support&lt;/td&gt;
&lt;td&gt;Usable across engines such as Spark, Trino, Flink, Snowflake, and Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Centralized RBAC&lt;/td&gt;
&lt;td&gt;Define permissions at the catalog level and apply consistent governance across all engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud / hybrid&lt;/td&gt;
&lt;td&gt;Not dependent on a specific cloud and capable of running on-premises when necessary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS sustainability&lt;/td&gt;
&lt;td&gt;Not discontinued based on vendor decisions; continuously developed in a community-driven manner&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apache Polaris is a catalog designed to satisfy these requirements.&lt;/p&gt;


&lt;h2&gt;
  
  
  Differences from AWS Glue Data Catalog
&lt;/h2&gt;

&lt;p&gt;When building on AWS, AWS Glue Data Catalog is often positioned as the central data catalog.&lt;br&gt;
Here, we will organize the differences between AWS Glue Data Catalog and Apache Polaris.&lt;/p&gt;


&lt;h3&gt;
  
  
  Positioning of AWS Glue Data Catalog
&lt;/h3&gt;

&lt;p&gt;AWS Glue Data Catalog is a core metadata management service in AWS.&lt;/p&gt;

&lt;p&gt;It is natively integrated with AWS analytics services such as Athena, Glue, Redshift Spectrum, and EMR, and plays the role of managing data on S3 as a catalog.&lt;/p&gt;

&lt;p&gt;As discussed in previous articles, Glue Data Catalog is an excellent &lt;strong&gt;technical catalog used by data platforms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/is-aws-glue-data-catalog-sufficient-as-a-data-catalog-organizing-its-design-limitations-and-kih"&gt;Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Functional Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;AWS Glue Data Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Offering&lt;/td&gt;
&lt;td&gt;AWS-managed (closed)&lt;/td&gt;
&lt;td&gt;OSS / Managed (Snowflake Open Catalog, Dremio Catalog, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;AWS proprietary API (recently also provides Iceberg REST compatibility)&lt;/td&gt;
&lt;td&gt;Iceberg REST Catalog specification (open)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud support&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;Multi-cloud / on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engines&lt;/td&gt;
&lt;td&gt;Athena, Glue, Redshift, EMR, Spark&lt;/td&gt;
&lt;td&gt;Spark, Trino, Flink, Snowflake, Dremio, StarRocks, DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-catalog&lt;/td&gt;
&lt;td&gt;Account-level (logical separation via Lake Formation)&lt;/td&gt;
&lt;td&gt;Native support for multiple catalogs within a single instance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access control&lt;/td&gt;
&lt;td&gt;IAM + Lake Formation&lt;/td&gt;
&lt;td&gt;Built-in RBAC (Principal / Principal Role / Catalog Role)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External catalog integration&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Can integrate with Iceberg REST-compliant catalogs (Nessie, Gravitino, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-Iceberg formats&lt;/td&gt;
&lt;td&gt;Supports Hive, JSON, CSV, Parquet, etc.&lt;/td&gt;
&lt;td&gt;Currently Iceberg-centric (Generic Table support on roadmap)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  How to Interpret the Difference
&lt;/h3&gt;

&lt;p&gt;Rather than being in a competitive relationship, it is easier to understand them as catalogs with different roles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;: Strong integration with AWS services, making it the primary choice for workloads completed within AWS. It supports a wide range of data lake formats beyond Iceberg and features such as S3 crawling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Polaris&lt;/strong&gt;: A catalog that enables governance across multiple engines and clouds based on the industry-standard Iceberg REST API. It is effective when you want to enforce consistent RBAC across engines outside AWS (e.g., Snowflake, Dremio).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In summary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your use case is &lt;strong&gt;AWS-contained and includes formats beyond Iceberg&lt;/strong&gt;, Glue Data Catalog is a practical choice&lt;/li&gt;
&lt;li&gt;If you want &lt;strong&gt;common management of Iceberg across multiple engines and a vendor-neutral catalog layer&lt;/strong&gt;, Polaris is suitable&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Differences from Snowflake Horizon Catalog
&lt;/h2&gt;

&lt;p&gt;This is often confused, so let’s clarify the difference between Snowflake Horizon Catalog and Apache Polaris.&lt;br&gt;
Note that it is different from “Snowflake Open Catalog,” despite the similar name.&lt;/p&gt;


&lt;h3&gt;
  
  
  What is Snowflake Horizon Catalog?
&lt;/h3&gt;

&lt;p&gt;Snowflake Horizon Catalog is a data governance and discovery suite provided by Snowflake.&lt;/p&gt;

&lt;p&gt;For data managed within Snowflake (Snowflake-managed tables, stages, views, shared data, etc.), it provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data discovery (search, tagging, descriptions)&lt;/li&gt;
&lt;li&gt;Lineage&lt;/li&gt;
&lt;li&gt;Data quality monitoring&lt;/li&gt;
&lt;li&gt;Masking policies and row access policies&lt;/li&gt;
&lt;li&gt;Automatic classification of sensitive data&lt;/li&gt;
&lt;li&gt;Compliance management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In terms of positioning, it is similar to Amazon DataZone + Lake Formation + Glue Data Quality in AWS.&lt;/p&gt;

&lt;p&gt;In other words, it is the layer responsible for &lt;strong&gt;cataloging and governance so that people can discover, understand, and trust data&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  What is Snowflake Open Catalog (Relation to Polaris)
&lt;/h3&gt;

&lt;p&gt;On the other hand, Snowflake Open Catalog is a managed offering of Apache Polaris.&lt;/p&gt;

&lt;p&gt;Although the name is confusing, this is the lakehouse catalog that serves as an Iceberg REST Catalog.&lt;/p&gt;

&lt;p&gt;In Snowflake’s model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake-managed data&lt;/li&gt;
&lt;li&gt;Snowflake Open Catalog (= Apache Polaris): Lakehouse catalog layer for open table formats such as Iceberg&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Functional Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Snowflake Horizon Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary target&lt;/td&gt;
&lt;td&gt;Data in Snowflake (internal tables, shared data, etc.)&lt;/td&gt;
&lt;td&gt;Iceberg (Generic Table support for other formats is planned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer&lt;/td&gt;
&lt;td&gt;Business catalog / governance layer&lt;/td&gt;
&lt;td&gt;Lakehouse catalog layer (technical catalog)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offering&lt;/td&gt;
&lt;td&gt;Built into Snowflake (closed)&lt;/td&gt;
&lt;td&gt;OSS / Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Snowflake proprietary&lt;/td&gt;
&lt;td&gt;Iceberg REST Catalog specification (open)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data location&lt;/td&gt;
&lt;td&gt;Snowflake internal storage or recognized external data&lt;/td&gt;
&lt;td&gt;Iceberg tables on cloud storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Within Snowflake organizations&lt;/td&gt;
&lt;td&gt;Across multiple engines and clouds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  How to Interpret the Difference
&lt;/h3&gt;

&lt;p&gt;Again, these are not in opposition but complementary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake Horizon Catalog&lt;/strong&gt;: Upper layer that provides data to business users, handling discovery, quality, masking, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Polaris&lt;/strong&gt;: Lower layer (metadata foundation) that exposes Iceberg tables to multiple engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conceptually, the structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────┐
│  Business Catalog / Governance Layer         │ ← Snowflake Horizon Catalog
│  (Discovery / Lineage / Quality / Masking)   │   Amazon DataZone, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Lakehouse Catalog Layer                     │ ← Apache Polaris
│  (Iceberg REST Catalog / RBAC)               │   AWS Glue Data Catalog, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Data Lake (S3 / GCS / Azure Blob)           │
│  Iceberg / Parquet                           │
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you think of Snowflake Horizon Catalog and Apache Polaris as “choosing one or the other,” it feels unnatural, but when organized as different layers, the division of responsibilities becomes clear.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Combine with AWS
&lt;/h2&gt;

&lt;p&gt;From here, we will consider cases where Apache Polaris is introduced into an AWS environment.&lt;br&gt;
Since AWS already has a powerful catalog called Glue Data Catalog, it is important to clarify &lt;strong&gt;how Polaris should be positioned&lt;/strong&gt; and &lt;strong&gt;who is responsible for what&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Expected Architecture
&lt;/h3&gt;

&lt;p&gt;Representative configurations can be organized into the following three patterns.&lt;/p&gt;


&lt;h4&gt;
  
  
  Pattern 1: AWS-only (Glue Data Catalog-centered)
&lt;/h4&gt;

&lt;p&gt;This is the simplest configuration.&lt;br&gt;
It is a typical setup using S3 + Iceberg + Glue Data Catalog, along with Athena / Glue / Redshift Spectrum.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catalog: AWS Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Governance: IAM + Lake Formation&lt;/li&gt;
&lt;li&gt;Query engines: Athena, Redshift Spectrum, Glue ETL, EMR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If everything is completed within AWS and there is no strong need to share with external engines, this configuration remains the most practical.&lt;br&gt;
There is no need to forcibly introduce Apache Polaris.&lt;/p&gt;


&lt;h4&gt;
  
  
  Pattern 2: AWS + Snowflake (Using Polaris as a shared catalog foundation)
&lt;/h4&gt;

&lt;p&gt;This configuration is effective when you want to reference the same Iceberg tables from both AWS (e.g., Athena) and Snowflake.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data storage: S3 + Iceberg&lt;/li&gt;
&lt;li&gt;Catalog: Apache Polaris (OSS self-hosted or Snowflake Open Catalog)&lt;/li&gt;
&lt;li&gt;AWS side: Reference Polaris as an Iceberg REST Catalog (via Spark or third-party tools)&lt;/li&gt;
&lt;li&gt;Snowflake side: Connect to Polaris using External Volume and Catalog Integration (&lt;code&gt;CATALOG_SOURCE = POLARIS&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the Snowflake side, Polaris can be referenced directly as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;polaris_catalog_int&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_SOURCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;POLARIS&lt;/span&gt;
  &lt;span class="n"&gt;TABLE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt;
  &lt;span class="n"&gt;REST_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CATALOG_URI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'https://&amp;lt;polaris-host&amp;gt;/api/catalog'&lt;/span&gt;
    &lt;span class="k"&gt;CATALOG_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;your_polaris_catalog&amp;gt;'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;REST_AUTHENTICATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OAUTH&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_CLIENT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;polaris_client_id&amp;gt;'&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_CLIENT_SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;polaris_client_secret&amp;gt;'&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_ALLOWED_SCOPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'PRINCIPAL_ROLE:ALL'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  Pattern 3: Multi-engine / multi-cloud configuration
&lt;/h4&gt;

&lt;p&gt;In addition to Snowflake, this configuration includes multiple engines such as Dremio, Databricks, Trino, and Flink.&lt;/p&gt;

&lt;p&gt;In this case, all engines reference Polaris as a common Iceberg REST Catalog.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data storage: S3 (and other cloud storage if needed)&lt;/li&gt;
&lt;li&gt;Catalog: Apache Polaris (center of governance)&lt;/li&gt;
&lt;li&gt;Query engines: Snowflake, Dremio, Spark, Trino, Flink, etc.&lt;/li&gt;
&lt;li&gt;Governance: Polaris provides unified RBAC across all engines&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  How to Think About Responsibility Separation
&lt;/h3&gt;

&lt;p&gt;This is the key point.&lt;br&gt;
When combining Polaris, AWS, Snowflake, and others, it is important to clearly define &lt;strong&gt;who is responsible for which layer&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Primary Owner&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data storage (files)&lt;/td&gt;
&lt;td&gt;AWS (S3)&lt;/td&gt;
&lt;td&gt;Storage location of the data. Single Source of Truth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage access control&lt;/td&gt;
&lt;td&gt;AWS (IAM)&lt;/td&gt;
&lt;td&gt;Access permissions to S3 buckets/prefixes are defined on the AWS side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table metadata&lt;/td&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Source of Truth for Iceberg metadata such as schema, snapshots, partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table-level RBAC&lt;/td&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Applies consistent permission rules across engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL / pipelines&lt;/td&gt;
&lt;td&gt;AWS Glue / Lambda / EMR / Spark&lt;/td&gt;
&lt;td&gt;Responsible for ingestion and transformation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;td&gt;Athena / Snowflake / Dremio / Spark&lt;/td&gt;
&lt;td&gt;Engines selected based on use case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business catalog / discovery&lt;/td&gt;
&lt;td&gt;Snowflake Horizon Catalog / Amazon DataZone&lt;/td&gt;
&lt;td&gt;Higher-layer features for search, lineage, quality for users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data quality&lt;/td&gt;
&lt;td&gt;Glue Data Quality / Snowflake DMF&lt;/td&gt;
&lt;td&gt;Implemented at engine or quality service layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What is especially important is the &lt;strong&gt;three-layer separation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data resides in AWS, the catalog is Polaris, and usage is handled by each engine&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By making this separation explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS can focus on storage and IAM management&lt;/li&gt;
&lt;li&gt;Polaris can focus on metadata and access control&lt;/li&gt;
&lt;li&gt;Each query engine can focus on its strengths&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Considerations When Adopting Polaris
&lt;/h3&gt;

&lt;p&gt;Polaris is powerful, but there are also important considerations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational cost when self-hosting OSS&lt;/strong&gt;: Running on EKS or EC2 requires a metastore (e.g., PostgreSQL), authentication infrastructure, monitoring, and upgrade handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed services are often more practical&lt;/strong&gt;: Using Snowflake Open Catalog or Dremio Catalog significantly reduces operational burden&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Less seamless integration with AWS services compared to Glue&lt;/strong&gt;: For AWS-native services such as Athena, Redshift, and QuickSight, using Glue Data Catalog is far more straightforward&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need to avoid double governance&lt;/strong&gt;: If IAM policies on S3 and RBAC in Polaris are inconsistent, troubleshooting becomes complex&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, when deciding whether to adopt Apache Polaris in an AWS environment, it is realistic to evaluate based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether multi-engine requirements exist&lt;/li&gt;
&lt;li&gt;The organization’s stance on vendor lock-in&lt;/li&gt;
&lt;li&gt;Whether operational cost is acceptable (or managed services can be used)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  A Practical Approach
&lt;/h3&gt;

&lt;p&gt;Personally, when considering Polaris in an AWS environment, the following phased approach is practical:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a lakehouse within AWS using Glue Data Catalog + Iceberg&lt;/li&gt;
&lt;li&gt;When integration with other engines such as Snowflake becomes necessary, consider introducing an Iceberg REST layer&lt;/li&gt;
&lt;li&gt;At that point, compare “Glue Iceberg REST endpoint,” “Apache Polaris OSS,” and “Snowflake Open Catalog” based on requirements&lt;/li&gt;
&lt;li&gt;If multi-engine / multi-cloud requirements become clear, redesign with Polaris (especially managed) at the center&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rather than designing with Polaris from the beginning, it is often more practical to &lt;strong&gt;replace the catalog layer with an open one when requirements mature&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we organized the key points around Apache Polaris.&lt;/p&gt;

&lt;p&gt;In the world of data platforms, while storage and formats have become open, a closed catalog reduces the benefits of a lakehouse by half.&lt;/p&gt;

&lt;p&gt;Therefore, there is a need for an &lt;strong&gt;open catalog&lt;/strong&gt; that complies with the Iceberg REST Catalog specification and enables unified governance across multiple engines and clouds.&lt;br&gt;
Apache Polaris is designed to fulfill exactly that role.&lt;/p&gt;

&lt;p&gt;However, it is important to think not in terms of “which one to choose” among Polaris, AWS Glue Data Catalog, and Snowflake Horizon Catalog, but rather &lt;strong&gt;which layer each is responsible for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Glue Data Catalog: Technical catalog within AWS (still the primary choice for AWS-only workloads)&lt;/li&gt;
&lt;li&gt;Apache Polaris: Lakehouse catalog centered on Iceberg, shared across multiple engines&lt;/li&gt;
&lt;li&gt;Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even when combining with AWS, by consciously separating responsibilities as&lt;br&gt;
&lt;strong&gt;“data in AWS, catalog in Polaris, analytics in engines, business catalog in another layer”&lt;/strong&gt;,&lt;br&gt;
you can design an architecture that leverages the strengths of each.&lt;/p&gt;

&lt;p&gt;Going forward, lakehouse architectures are expected to increasingly adopt vendor-neutral designs.&lt;br&gt;
Apache Polaris is likely to become an important component supporting that openness.&lt;/p&gt;

&lt;p&gt;I hope this article will be helpful for those considering Apache Polaris or designing lakehouse architectures across multiple platforms such as AWS and Snowflake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
      <category>iceberg</category>
    </item>
  </channel>
</rss>
