Vibe Coding Forem: AWS Community Builders

The Model Is the Brain. The Harness Is the Body. Here's Why That Matters

Ajit — Mon, 04 May 2026 05:18:31 +0000

TL;DR: I built the same browser agent twice — once with 500 lines of Python, once with 7 lines of JSON. The second one took 5 minutes. The agent harness layer is becoming the real competitive advantage, not the model.

Last month, I built a browser automation agent. Playwright. Custom orchestration. Login handlers. Error retries. Session management. React-aware form filling. Anti-detection scripts. 500+ lines of Python.

This week, I built the same thing:

{
  "model": { "provider": "bedrock", "modelId": "us.anthropic.claude-sonnet-4-6" },
  "tools": [{ "type": "agentcore_browser", "name": "browser" }],
  "systemPrompt": "You are a web browsing assistant."
}

Deploy. Invoke. It browses websites, extracts data, fills forms. Seven lines. Zero orchestration code.

But here's the thing most people miss: I kept both versions. And that's the real insight.

What Changed (and What Didn't)

	500-Line Script	7-Line Harness
What it does	Automates a specific multi-site workflow	Browses any website, extracts info
How it decides	I wrote every step	AI decides the steps
Cost per run	$0 (Playwright, local)	~$0.10-0.50 (Bedrock tokens)
Reliability	95%+ (deterministic)	~80% (AI reasoning varies)
Flexibility	Only does what I coded	Handles any browsing task
Time to build	3 days of debugging	5 minutes

The 500-line script is better for its specific job. It runs faster, cheaper, and more reliably. Because it doesn't need AI — the steps are known.

The 7-line harness is better for everything else. Research tasks. Data extraction from unfamiliar sites. Competitive analysis. Anything where the steps aren't known in advance.

This is my POV: deterministic + AI is the right architecture. Don't use a $0.03/call model to click a button you can click with Playwright for free. But don't write 500 lines of Playwright when 7 lines of config can handle it.

The Harness Is the New Battleground

Everyone's talking about which model is best. Claude vs GPT vs Gemini. Benchmarks. Context windows. Reasoning scores.

That conversation is becoming irrelevant.

Models are commoditizing. Claude Sonnet 4.6 and GPT-5.5 are both "good enough" for most agent tasks. The real question is: what wraps around the model to make it actually work in production?

That's the harness — the orchestration loop, tool execution, memory, security, compute isolation. And every cloud provider is racing to own it:

Provider	Harness Product	Status
AWS	AgentCore Harness	Preview (Apr 2026)
AWS	Bedrock Managed Agents (OpenAI-specific)	Limited Preview
Google	Gemini Enterprise Agent Platform	GA (Apr 2026)
Microsoft	Azure AI Agent Service	GA
Salesforce	Agentforce	GA

This is the container orchestration war all over again. In 2015, everyone had containers. The question was who manages running them. Kubernetes won, and whoever controlled K8s controlled where workloads ran.

In 2026, everyone has models. The question is who manages running agents. Whoever controls the harness controls the next decade of cloud spend.

How AgentCore Harness Works

You (prompt) → AgentCore Harness → Bedrock Model (reasoning)
                    ↓                      ↓
              Firecracker microVM    Tool selection
              (isolated per session)       ↓
                    ↓              AgentCore Browser / Shell / Code
              Persistent memory    
              (across sessions)    
                    ↓
              Streamed response → You

What AWS handles: compute, orchestration loop, tool invocation, memory, auth, observability.
What you handle: a JSON config and a prompt.

Each session runs in its own Firecracker microVM — the same isolation technology behind Lambda. Not a container. A VM. One session can't see another's data, cookies, or credentials.

Getting Started (I Actually Ran This)

# Install CLI
sudo npm install -g @aws/agentcore@preview

# Create project
agentcore create --name browseragent --model-provider bedrock
cd browseragent

# Add browser tool
agentcore add tool --harness browseragent --type agentcore_browser --name browser

# Set target account + region
# Edit agentcore/aws-targets.json: [{"name":"default","region":"us-west-2","account":"YOUR_ACCOUNT"}]

# Deploy (~3 min)
agentcore deploy --yes

# Use it
agentcore invoke --harness browseragent --stream \
  --prompt "Go to example.com and describe what you see"

Output from my actual run:

🔧 Tool: browser
⚡ 6005 in · 110 out · 2.2s
Here's what's on the page at example.com:
### Example Domain
The page contains: "Example Domain" heading, body text about documentation use,
and a "Learn more" link to IANA documentation.

Real. Not a demo. Not a screenshot from someone else's blog.

Production Considerations

Area	What I Found
Cost	No harness charge. You pay for Bedrock tokens + Browser session time
Regions	us-west-2, us-east-1, eu-central-1, ap-southeast-2 (preview)
Models	Any Bedrock model, plus OpenAI and Gemini. Switch mid-session
Security	Firecracker microVM isolation, IAM execution role, Cedar policies
Limitation	Preview — not for production workloads yet

⚠️ Gotcha I hit: The harness execution role needs bedrock:Converse and bedrock:ConverseStream permissions, plus aws-marketplace:ViewSubscriptions for 3P models. The default CDK policy only includes bedrock:InvokeModel. I had to add permissions manually.

When NOT to Use Harness

Deterministic automation (same steps every time) → Playwright. Cheaper, faster, more reliable.
Complex multi-agent workflows → Strands Agents SDK with AgentCore Runtime. More control.
Existing framework investment (LangChain/CrewAI) → Use AgentCore tools standalone.
Production workloads → Wait for GA. It's preview.

Bottom Line

The model is the brain. The harness is the body. Most teams are spending all their time picking the brain and hand-building the body from scratch every time.

AgentCore Harness lets you stop building bodies and start building solutions. For 80% of agent use cases, config beats code. For the other 20%, write code — but use the harness infrastructure underneath.

The teams still hand-coding agent orchestration loops are building technical debt. The same way teams hand-coding REST APIs built technical debt before API Gateway existed.

The question isn't whether to adopt managed agent infrastructure. It's whether you'll be building on it — or competing against someone who already is.

Ajit NK — AWS Community Builder, APN FasTrack Partner. Building AI agent solutions at CloudNestle.
"The model is the brain. The harness is the body. I build the body."

📚 Sources:

Serverless Workflow Decomposition: When a Step Function Becomes a Monolith

Renaldi — Sun, 03 May 2026 23:30:00 +0000

There is a point in many serverless platforms where a Step Functions workflow that once felt elegant starts to feel like a mini application platform of its own.

I have seen this happen in teams that are doing many things correctly: they standardized orchestration, they improved visibility, and they moved fragile glue logic out of Lambdas. Then six months later, the workflow has 100+ states, a maze of Choice branches, deeply nested payload transformations, and a deployment blast radius that makes everyone nervous.

This post is about recognizing workflow sprawl early and decomposing a Step Functions workflow into a more maintainable architecture without losing the benefits of orchestration.

I will cover:

Signs of workflow sprawl
Splitting by domain and subprocess boundaries
Parent-child workflow patterns
Contracting inputs and outputs
Versioning workflows safely
An end-to-end walkthrough with architecture and code
Implementation discussion and migration guidance

I will use AWS Step Functions terminology throughout, but the architectural thinking applies broadly to workflow systems.

Why this matters

A large workflow is not automatically a bad workflow.

In fact, I often start with a single orchestration when I want to make the business process visible quickly. The problem is not “too many states” by itself. The problem is when a workflow stops reflecting a coherent business flow and instead becomes:

a catch-all for multiple domains
a deployment bottleneck
a fragile contract hub
a place where teams are afraid to change anything

At that point, I treat it like I would a code monolith that has outgrown its boundaries: decompose intentionally, not reactively.

What I mean by a "Step Function monolith"

For this post, a Step Function becomes a monolith when one state machine accumulates responsibilities that should be owned by separate domains or subprocesses.

Typical symptoms include:

Order orchestration, payment rules, inventory logic, fraud checks, and notifications all embedded in one ASL definition
Repeated transformation states to make one team's output fit another team's input
Error handling branches duplicated across unrelated parts of the flow
A single workflow release requiring coordination across multiple teams

This is not just a readability issue. It affects operability, testing, and change safety.

Signs of workflow sprawl

These are the patterns I look for during architecture reviews.

1) One workflow owns too many domains

If a single state machine is enforcing rules that belong to Payments, Inventory, Fraud, Fulfillment, and Notifications, it is likely doing too much.

A good orchestrator should coordinate domains, not absorb their internal logic.

2) The ASL definition becomes hard to reason about

Signs include:

many long Choice chains
repeated Pass/transform states just to reshape data
large Catch and Retry blocks copied across multiple branches
difficulty tracing the happy path from start to finish

If I need a map just to explain the workflow in a design review, decomposition is usually overdue.

3) Payloads become "workflow-shaped" instead of domain-shaped

A common smell is a giant state payload that keeps growing because every future step might need something.

Symptoms:

many fields carried "just in case"
internal step-specific fields leaking into later steps
brittle JSONPath references across distant states
accidental coupling to intermediate output shapes

This is often the strongest signal that input/output contracts need to be tightened.

4) Change blast radius is too large

If a small payment change forces re-testing the full order pipeline end-to-end, you are paying monolith tax in a serverless system.

I watch for:

frequent merge conflicts in the same workflow definition
unrelated teams blocking each other
release windows for “workflow changes”
fear of touching central error paths

5) Execution histories are huge and troubleshooting is slow

When executions become long and noisy, step histories are harder to navigate. Even when the workflow is functionally correct, operator experience degrades.

This matters during incidents. The fastest diagnosis usually comes from clear orchestration boundaries and localized subprocess execution histories.

6) Reuse pressure leads to copy/paste orchestration

If teams are duplicating chunks of states for common subprocesses (for example, document validation, payment authorization, fraud scoring), that is a strong indicator those chunks should become child workflows.

7) Mixed execution profiles are forced into one workflow

Examples:

a mostly synchronous checkout path mixed with long-running fulfillment polling
high-throughput lightweight paths mixed with complex human approval steps
latency-sensitive branches mixed with eventual-consistency branches

These often want different execution patterns, retry policies, and operational ownership.

Decomposition principles I use

When I decompose a Step Functions workflow, I do not split it by "number of states." I split it by architectural responsibility.

Principle 1: Keep the parent workflow focused on orchestration decisions

The parent should answer questions like:

Which subprocess runs next?
Should we continue or compensate?
What is the overall status?
Which events should be emitted?

It should not implement deep domain logic that belongs in a domain-owned subprocess.

Principle 2: Split by domain or stable subprocess boundary

Great candidates for child workflows are subprocesses that are:

domain-owned (Payments, KYC, Inventory)
reusable across multiple parent workflows
likely to evolve independently
complex enough to justify dedicated retries/error handling
testable as a standalone business unit

Principle 3: Define explicit input and output contracts

Do not pass the entire parent state to every child.

Instead, define:

a minimal child input contract
a stable child output contract
an error/failure contract (where applicable)
version metadata in the contract or state machine aliasing strategy

This is the workflow equivalent of well-designed service APIs.

Principle 4: Decompose to reduce blast radius, not to maximize nesting

Nested workflows are powerful, but over-nesting can create its own complexity.

I avoid decomposition that creates:

wrappers around trivial single-step tasks
nested workflows with no clear ownership
chains of parent -> child -> grandchild just for aesthetics

The goal is better changeability and operability, not "micro-workflows everywhere."

Principle 5: Preserve the business narrative

After decomposition, I still want to be able to explain the parent workflow in plain language.

For example:

Validate order -> Process payment -> Reserve inventory -> Create shipment -> Notify customer

If the parent becomes an opaque set of “InvokeChildX” states with no business story, the design needs refinement.

Parent-child workflow patterns

There is no single nesting pattern that fits every case. I typically use a small set of patterns and choose deliberately.

Pattern A: Synchronous child workflow (request/response style orchestration)

The parent waits for the child to finish and uses the output immediately.

Use when:

the next parent decision depends on child output
the subprocess is part of the critical path
you want localized retries inside the child workflow

Examples:

payment authorization
fraud decision
document validation

Pattern B: Asynchronous child workflow (fire and track)

The parent starts a child workflow and continues later based on an event, callback, or polling strategy.

Use when:

the subprocess is long-running
an external system controls timing
human approval or batch windows are involved

Examples:

fulfillment handoff
partner settlement
manual review

Pattern C: Parallel child workflows for independent branches

The parent starts independent subprocesses in parallel and joins after they complete.

Use when:

tasks are independent and safe to run concurrently
you want to reduce overall latency
failures should be isolated per branch

Examples:

fraud + tax calculation + personalization scoring (depending on domain semantics)

Pattern D: Domain subprocess library

Create reusable child workflows that multiple parents can call.

Use when:

you repeatedly implement the same orchestration chunk
the subprocess is clearly owned by one team
contract stability is good enough for reuse

Examples:

identity verification
payment capture
notification fan-out preparation

Contracting inputs and outputs (the most important part)

In my experience, decomposition succeeds or fails based on contract discipline.

If I split a workflow but still pass the full parent payload into every child, I have only moved complexity around. I have not reduced coupling.

What a good child contract looks like

A child workflow contract should be:

minimal: only fields the child needs
explicit: named fields, stable structure
typed: validated at boundaries
versionable: compatible evolution plan
auditable: includes correlation metadata

I usually use an envelope like this:

{
  "meta": {
    "correlationId": "corr-123",
    "causationId": "exec-parent-abc",
    "contractVersion": "1.0"
  },
  "request": {
    "orderId": "ORD-100045",
    "customerId": "CUST-9001",
    "amount": 119.85,
    "currency": "AUD",
    "paymentMethodToken": "tok_123"
  }
}

And I expect a child output like:

{
  "meta": {
    "correlationId": "corr-123",
    "contractVersion": "1.0"
  },
  "result": {
    "authorized": true,
    "authorizationId": "auth_789",
    "processorReference": "psp-456"
  }
}

Contract boundaries I define explicitly

For each child workflow, I define:

Input shape
Success output shape
Business failure output shape (if returned rather than thrown)
Technical failure behavior (exception / failed execution)
Timeout expectations
Idempotency expectations
Ownership and support team

This makes nested workflows composable, not just callable.

Keep transformation logic close to the boundary

If the parent needs to adapt a parent model into a child request, I do that immediately before the child call. I do not let “temporary shape conversion” leak across the rest of the workflow.

Likewise, I normalize child output once after return, then continue with a clean parent-level model.

Versioning workflows safely

Workflow decomposition increases the number of deployable units. That is good for blast radius, but it also means you need a safe versioning strategy.

My rule: version the workflow and the contract

I treat these as separate concerns:

Workflow version: the ASL implementation/version/alias of the child state machine
Contract version: the input/output schema version the parent and child agree on

Sometimes a workflow changes without changing the contract. Sometimes a contract changes while the business purpose remains the same. I do not force those to be the same version number.

Safe versioning practices I use

1) Invoke child workflows through aliases

The parent should usually call a child alias ARN (for example, :PROD) rather than a raw, latest definition ARN.

This gives me a stable target I can move during deployment rollouts and rollbacks.

2) Use immutable workflow versions behind aliases

For production workflows, I want immutable versions behind aliases so I can answer:

Which version processed this execution?
Can I rollback without redefining the workflow?
Can I shift traffic gradually?

3) Keep contract compatibility during rollout windows

If Parent v3 is rolling out while Child Payments:PROD shifts from v10 to v11, I want a compatibility window where both versions honor the same contract or the parent chooses a matching alias (PAYMENTS_V1, PAYMENTS_V2).

4) Prefer additive contract changes

Safer changes:

add optional output fields
add optional input fields
add new reason codes without changing existing semantics (with care)

Riskier changes:

renaming fields
changing meaning of status codes
changing failure behavior from “return business failure” to “throw”
changing data types

5) Test parent-child compatibility explicitly

I maintain fixtures and contract tests for parent-child integration, especially around:

missing optional fields
unexpected extra fields
business failure responses
timeout and retry behavior

Reference Architecture

End-to-end walkthrough: decomposing an Order Processing workflow

I will use a realistic example because this is where the trade-offs become visible.

The original monolithic workflow (before)

We start with one large OrderProcessing state machine that does all of this:

validate order
fraud check
authorize payment
reserve inventory
create shipment request
send notifications
persist status updates
handle retries and compensation for multiple domains

It works, but over time:

Payments team changes create merge conflicts with Fulfillment changes
The workflow definition is difficult to review
Troubleshooting a failed shipment step requires scrolling through unrelated payment/fraud logic
Reusable subprocesses (payments, notifications) are duplicated elsewhere

The decomposed target architecture (after)

I split the design into:

Parent workflow: OrderOrchestrator

coordinates the overall business flow
invokes child workflows
makes continuation/compensation decisions
emits parent-level events/status transitions

Child workflows

PaymentProcessingWorkflow
InventoryReservationWorkflow
FulfillmentSubmissionWorkflow
CustomerNotificationWorkflow (optional, often event-driven instead)

Each child workflow owns:

local retries
domain-specific branching
domain telemetry
domain-specific error normalization

Why this split works

This decomposition aligns with domain boundaries and independent change cadence:

Payments evolves frequently due to PSP integration and fraud strategy
Inventory may change due to warehouse logic
Fulfillment is often async and externally coupled
Notifications are loosely coupled and may be event-driven

The parent remains readable and focused on business progression.

Architecture and flow (walkthrough narrative)

Here is the end-to-end flow in the decomposed design.

1) API receives `CreateOrder` request

The API layer validates basic request shape, stamps a correlation ID, and starts the parent OrderOrchestrator workflow (or publishes a command that triggers it, depending on your system style).

2) Parent workflow performs lightweight order validation

The parent may perform only orchestration-level checks (for example, required presence checks if not already done), then constructs a contracted input for the payment child workflow.

3) Parent invokes `PaymentProcessingWorkflow` as a synchronous child

The parent waits for payment output because the next step depends on authorization success.

The child workflow:

performs fraud/risk checks (if owned by Payments)
authorizes payment with PSP
normalizes provider-specific responses
returns a stable result contract

The parent receives only what it needs, not the child’s full internal state.

4) Parent invokes `InventoryReservationWorkflow`

If payment is authorized, the parent calls inventory reservation as another synchronous child and receives a normalized reservation result.

5) Parent branches based on combined business outcomes

The parent now makes a high-level decision:

continue to fulfillment
compensate payment if inventory failed
reject order
send manual review

This is exactly where a parent orchestrator adds value.

6) Parent starts `FulfillmentSubmissionWorkflow`

This may be synchronous or asynchronous depending on downstream fulfillment systems.

If asynchronous:

the parent may start the child and persist a pending status
later completion may resume a follow-up workflow or emit events that drive downstream steps

7) Notifications and analytics are triggered

I often prefer event-driven notification/analytics fan-out instead of keeping them in the critical path. If kept as a child workflow, I keep the contract minimal and failure policy explicit (for example, notification failure should not fail order creation).

8) Parent publishes final order status and completes

The parent emits a domain event (for example, OrderAccepted, OrderPendingFulfillment, or OrderRejected) and completes with a stable external result.

Implementation discussion

Now I will show concrete examples of how I implement this pattern.

Parent workflow (ASL) using nested child workflows

This example uses Step Functions service integration to start child workflows and wait for results. I use startExecution.sync:2 because it returns child output as JSON rather than a JSON-encoded string, which makes downstream data handling cleaner.

{
  "Comment": "Order orchestrator parent workflow",
  "StartAt": "BuildPaymentRequest",
  "States": {
    "BuildPaymentRequest": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "causationId.$": "$$.Execution.Id",
          "contractVersion": "1.0"
        },
        "request": {
          "orderId.$": "$.order.orderId",
          "customerId.$": "$.order.customerId",
          "amount.$": "$.order.totalAmount",
          "currency.$": "$.order.currency",
          "paymentMethodToken.$": "$.order.paymentMethodToken"
        }
      },
      "ResultPath": "$.paymentCall",
      "Next": "InvokePaymentChild"
    },
    "InvokePaymentChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${PaymentWorkflowAliasArn}",
        "Input": {
          "meta.$": "$.paymentCall.meta",
          "request.$": "$.paymentCall.request",
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.paymentExecution",
      "Retry": [
        {
          "ErrorEquals": ["StepFunctions.ExecutionLimitExceeded"],
          "IntervalSeconds": 2,
          "BackoffRate": 2,
          "MaxAttempts": 3
        }
      ],
      "Next": "NormalizePaymentResult"
    },
    "NormalizePaymentResult": {
      "Type": "Pass",
      "Parameters": {
        "authorized.$": "$.paymentExecution.Output.result.authorized",
        "authorizationId.$": "$.paymentExecution.Output.result.authorizationId",
        "processorReference.$": "$.paymentExecution.Output.result.processorReference"
      },
      "ResultPath": "$.payment",
      "Next": "PaymentDecision"
    },
    "PaymentDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.payment.authorized",
          "BooleanEquals": true,
          "Next": "BuildInventoryRequest"
        }
      ],
      "Default": "RejectOrder"
    },
    "BuildInventoryRequest": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "causationId.$": "$$.Execution.Id",
          "contractVersion": "1.0"
        },
        "request": {
          "orderId.$": "$.order.orderId",
          "items.$": "$.order.items",
          "warehousePreference.$": "$.order.warehousePreference"
        }
      },
      "ResultPath": "$.inventoryCall",
      "Next": "InvokeInventoryChild"
    },
    "InvokeInventoryChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${InventoryWorkflowAliasArn}",
        "Input": {
          "meta.$": "$.inventoryCall.meta",
          "request.$": "$.inventoryCall.request",
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.inventoryExecution",
      "Next": "InventoryDecision"
    },
    "InventoryDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.inventoryExecution.Output.result.reserved",
          "BooleanEquals": true,
          "Next": "StartFulfillmentChild"
        }
      ],
      "Default": "CompensatePayment"
    },
    "StartFulfillmentChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution",
      "Parameters": {
        "StateMachineArn": "${FulfillmentWorkflowAliasArn}",
        "Input": {
          "meta": {
            "correlationId.$": "$.meta.correlationId",
            "causationId.$": "$$.Execution.Id",
            "contractVersion": "1.0"
          },
          "request": {
            "orderId.$": "$.order.orderId",
            "reservationId.$": "$.inventoryExecution.Output.result.reservationId",
            "deliveryAddress.$": "$.order.deliveryAddress"
          },
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.fulfillmentStart",
      "Next": "CompleteAccepted"
    },
    "CompensatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${PaymentCompensationWorkflowAliasArn}",
        "Input": {
          "meta": {
            "correlationId.$": "$.meta.correlationId",
            "causationId.$": "$$.Execution.Id",
            "contractVersion": "1.0"
          },
          "request": {
            "orderId.$": "$.order.orderId",
            "authorizationId.$": "$.payment.authorizationId"
          },
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.paymentCompensation",
      "Next": "RejectOrder"
    },
    "RejectOrder": {
      "Type": "Succeed"
    },
    "CompleteAccepted": {
      "Type": "Succeed"
    }
  }
}

Why this parent is easier to maintain

The parent workflow now:

focuses on sequencing and business decisions
calls domain-owned child workflows through aliases
passes minimal, explicit contracts
can evolve orchestration without rewriting domain subprocess internals

That is the kind of decomposition I want.

Child workflow example: `PaymentProcessingWorkflow`

I keep the child focused and domain-owned. This example is simplified, but it shows the pattern.

{
  "Comment": "Payment processing child workflow",
  "StartAt": "ValidateContract",
  "States": {
    "ValidateContract": {
      "Type": "Choice",
      "Choices": [
        {
          "And": [
            { "Variable": "$.meta.contractVersion", "StringEquals": "1.0" },
            { "Variable": "$.request.orderId", "IsPresent": true },
            { "Variable": "$.request.amount", "IsPresent": true },
            { "Variable": "$.request.paymentMethodToken", "IsPresent": true }
          ],
          "Next": "AuthorizePayment"
        }
      ],
      "Default": "ContractError"
    },
    "AuthorizePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${AuthorizePaymentFnArn}",
        "Payload.$": "$"
      },
      "ResultSelector": {
        "result.$": "$.Payload"
      },
      "ResultPath": "$.auth",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.SdkClientException", "States.TaskFailed"],
          "IntervalSeconds": 2,
          "BackoffRate": 2,
          "MaxAttempts": 3
        }
      ],
      "Next": "BuildSuccessResponse"
    },
    "BuildSuccessResponse": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "contractVersion": "1.0"
        },
        "result": {
          "authorized.$": "$.auth.result.authorized",
          "authorizationId.$": "$.auth.result.authorizationId",
          "processorReference.$": "$.auth.result.processorReference"
        }
      },
      "End": true
    },
    "ContractError": {
      "Type": "Fail",
      "Error": "ContractValidationError",
      "Cause": "Invalid child workflow input contract"
    }
  }
}

Design choice I recommend

Notice that the child returns a normalized result contract, not raw PSP payloads. This prevents the parent from becoming coupled to provider-specific fields and keeps domain ownership intact.

TypeScript contract definitions (shared library)

I typically create a small shared library for workflow contracts (or generate types from JSON Schema/OpenAPI where appropriate).

// packages/workflow-contracts/src/payment.ts

export interface WorkflowMeta {
  correlationId: string;
  causationId?: string;
  contractVersion: "1.0" | "1.1";
}

export interface PaymentChildRequestV1 {
  meta: WorkflowMeta & { contractVersion: "1.0" };
  request: {
    orderId: string;
    customerId: string;
    amount: number;
    currency: string;
    paymentMethodToken: string;
  };
}

export interface PaymentChildSuccessV1 {
  meta: {
    correlationId: string;
    contractVersion: "1.0";
  };
  result: {
    authorized: boolean;
    authorizationId: string;
    processorReference: string;
  };
}

export interface PaymentChildBusinessFailureV1 {
  meta: {
    correlationId: string;
    contractVersion: "1.0";
  };
  result: {
    authorized: false;
    reasonCode: "RISK_REJECTED" | "INSUFFICIENT_FUNDS" | "PROCESSOR_DECLINED";
    processorReference?: string;
  };
}

This type layer does not replace runtime validation, but it dramatically improves correctness in parent-child integration code and tests.

CDK wiring example (parent and child aliases)

This example shows the shape of how I wire aliases and pass alias ARNs to the parent workflow.

import * as cdk from "aws-cdk-lib";
import * as sfn from "aws-cdk-lib/aws-stepfunctions";
import { Construct } from "constructs";

export class OrderWorkflowsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Assume these are already defined with actual definitions
    const paymentChild = new sfn.StateMachine(this, "PaymentWorkflow", {
      definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
    });

    const inventoryChild = new sfn.StateMachine(this, "InventoryWorkflow", {
      definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
    });

    // Publish immutable versions (illustrative)
    const paymentVersion = new sfn.CfnStateMachineVersion(this, "PaymentWorkflowVersion", {
      stateMachineArn: paymentChild.stateMachineArn
    });

    new sfn.CfnStateMachineAlias(this, "PaymentWorkflowProdAlias", {
      name: "PROD",
      routingConfiguration: [
        {
          stateMachineVersionArn: paymentVersion.attrStateMachineVersionArn,
          weight: 100
        }
      ]
    });

    const inventoryVersion = new sfn.CfnStateMachineVersion(this, "InventoryWorkflowVersion", {
      stateMachineArn: inventoryChild.stateMachineArn
    });

    new sfn.CfnStateMachineAlias(this, "InventoryWorkflowProdAlias", {
      name: "PROD",
      routingConfiguration: [
        {
          stateMachineVersionArn: inventoryVersion.attrStateMachineVersionArn,
          weight: 100
        }
      ]
    });

    // Parent definition would consume these alias ARNs (via substitutions/templating)
    new cdk.CfnOutput(this, "PaymentWorkflowAliasArn", {
      value: `${paymentChild.stateMachineArn}:PROD`
    });

    new cdk.CfnOutput(this, "InventoryWorkflowAliasArn", {
      value: `${inventoryChild.stateMachineArn}:PROD`
    });

    // In production, ensure the parent role has least-privilege for nested calls.
  }
}

What I pay attention to in deployment pipelines

For child workflows, I want CI/CD to support:

contract tests
workflow unit/integration tests
publish new version
move alias gradually (canary/linear where appropriate)
rollback alias quickly if needed

This is where decomposition pays off operationally. I can deploy a Payment child workflow change without touching the Inventory child or the parent orchestrator if the contract remains stable.

IAM and permissions for nested workflows (important operational detail)

Nested workflows are straightforward conceptually, but the IAM details matter.

When the parent waits synchronously for a child, the integration behavior requires more than only states:StartExecution. I always validate the parent execution role permissions for nested patterns during deployment and in pre-prod tests, because missing permissions can lead to confusing delays or stuck behavior.

I also scope permissions narrowly to the child workflows the parent is actually allowed to call. Decomposition should improve boundaries, not weaken them.

Observability after decomposition

A common concern is that decomposition makes tracing harder because the work is spread across multiple executions.

In practice, I have found the opposite to be true when I propagate correlation metadata correctly.

What I propagate into every child

correlationId
causationId (usually the parent execution ID)
contract version
domain entity ID (for example, orderId)

What I log in each child

child workflow name and alias/version (where possible)
start/end timestamps
business outcome
retry counts / terminal error classification

This makes it much easier to answer:

Which child failed?
Was it a contract issue or domain issue?
Which version of the child handled the request?
Did rollback change the outcome?

How to split by domain and subprocess in practice

When teams ask me “where exactly should we split?”, I usually run a quick decomposition workshop with these prompts:

Prompt 1: Which parts change for different business reasons?

If payment changes because of PSP behavior and inventory changes because of warehouse logic, those belong in different subprocesses.

Prompt 2: Which parts require different failure semantics?

If notification failure should not fail order acceptance, that is a strong candidate for decoupling from the parent critical path.

Prompt 3: Which parts are reusable?

If onboarding, checkout, and subscription renewal all need the same payment authorization flow, that is a candidate child workflow.

Prompt 4: Which parts have different owners/on-call teams?

Team boundaries are not the only factor, but they matter operationally. A child workflow with clear ownership improves support and release confidence.

Prompt 5: Which parts make the parent harder to read than the business process itself?

That is usually the part I extract first.

Migration strategy: from one monolith workflow to decomposed workflows safely

I do not recommend a big-bang rewrite. I prefer incremental extraction.

Step 1: Identify one extraction candidate

Pick a subprocess with clear boundaries (for example, Payments).

Step 2: Define the contract before extracting

Write:

child input schema/type
child output schema/type
failure behavior
timeouts and retries

Step 3: Extract the logic into a child workflow

Keep behavior equivalent first. Avoid redesigning everything in the same change.

Step 4: Update parent to call child via alias

Use a stable alias (for example, PROD) so future child changes do not require parent definition changes.

Step 5: Add compatibility and regression tests

Test:

happy path
business failure path
timeout/retry path
malformed contract path

Step 6: Repeat for the next extraction

After 1-2 successful extractions, teams usually become much more comfortable with the pattern.

What not to do

I have seen a few anti-patterns appear during decomposition efforts.

Anti-pattern 1: "Micro-workflow everything"

Creating a child workflow for every tiny step adds ceremony without improving maintainability.

Anti-pattern 2: Passing the entire parent payload into every child

This preserves hidden coupling and makes contracts meaningless.

Anti-pattern 3: Parent depends on child internals

If the parent reads deeply nested provider-specific details returned by a child, you have recreated coupling through outputs.

Anti-pattern 4: No versioning strategy

Without aliases/versions and contract discipline, decomposition can increase operational risk instead of reducing it.

Anti-pattern 5: Decomposition without ownership

If nobody owns a child workflow end-to-end, incidents become harder, not easier.

Final thoughts

A Step Functions workflow becoming “too large” is not the real problem. The real problem is when workflow boundaries stop matching business and domain boundaries.

When that happens, decomposition is not about making the diagram prettier. It is about restoring:

change safety
testability
ownership
observability
architectural clarity

The pattern I keep coming back to is simple:

Parent workflow for orchestration decisions and business progression
Child workflows for domain-owned subprocesses
Explicit contracts for inputs/outputs
Versioned deployments via immutable versions + aliases
Strong observability metadata across execution boundaries

That is how I keep Step Functions as an orchestration asset, rather than letting it become a serverless monolith.

References

AWS Step Functions Developer Guide (nested workflows, service integrations)
AWS Step Functions Developer Guide (starting workflows from a task state / StartExecution)
AWS Step Functions Developer Guide (versions and aliases)
AWS Step Functions Developer Guide (continuous deployments with versions and aliases)
AWS Step Functions Developer Guide (best practices)
AWS Step Functions service quotas documentation
AWS IAM documentation (least privilege for service integrations)

KiroHub: Generate a Kiro Skill in 60 Seconds Built With Bedrock Registry and AgentCore Harness

Alvaro Llamojha — Sun, 03 May 2026 20:41:28 +0000

I used two Amazon Bedrock AgentCore capabilities, Amazon Bedrock Registry for hybrid search over 10k+ Kiro resources, and AgentCore Harness for testing generated skills against a real agent, to build an AI-powered skill generator for Kiro Hub. Try it at kirohub.dev/generate.

The blank file problem

I've been building Kiro Hub for a few months now. The hub has over 10,000 community resources, including steering files, hooks, agents, and skills. You can browse, search, and install any of them with:

npx kirohub add <slug>

I wanted to expand Kiro Hub and the next logical step is to be able to create resources based on our current dataset of +10k. I decided to start with Agent Skills. This means that I have to have a better and secure way to ingest custom made Skills. But there is another problem, how do you test the Skill?

So I decided to implement Bedrock Registry to evolve Kiro Hub into a proper AI context registry with status and steps to move from draft to available. Bedrock AgentCore Harness is a really solid and secure solution to run agents, that also has Skills compatibility. This matches my requirement of testing Agentic Skills in a sandbox. Why not connect those pieces?

Create meaningful Skills

The feature lives at kirohub.dev/generate. You describe what you need in plain language:

Create a skill for AWS Lambda error handling best practices

or:

I need a skill that helps me write Haiku poems and explains them

The system generates a complete, structured SKILL.md file.

It is a chat-based interface. You can refine the skill with follow-ups, test it against a real agent to see whether the instructions actually work, and publish it to the hub with one click. From prompt to published, installable skill, the normal path takes under a minute.

The interesting part is not the editor or the Lambda functions. The interesting part is the combination of retrieval and testing. Registry makes the generated skill more specific. Harness makes the test more realistic.

Retrieval and storage with Amazon Bedrock Registry

The approach to skill generation is simple but naive: give a model a prompt, explain the SKILL.md format, and ask it to generate something.

What makes a skill useful is specificity. Concrete patterns, opinionated guidance, real-world trade-offs, and a structure that an agent can follow. That kind of content already exists across the 10,000+ resources in Kiro Hub. The question was how to get the right examples in front of the model at generation time. And for this I had to evolve Kiro Hub into a proper registry.

AWS Agent Registry solves that part. Kiro Hub resources are synced to the Registry as descriptors with names, descriptions, content references, and metadata. Kiro Hub can then resolve matched records back to the full source content used as generation context.

The Registry exposes a built-in MCP endpoint. The generate-skill Lambda calls it server-side with JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "search_registry_records",
    "arguments": {
      "searchQuery": "AWS Lambda error handling",
      "maxResults": 5
    }
  }
}

Registry search uses both semantic and keyword matching, so the query does not need to match exact words. A search for “Lambda error handling” can surface related resources about serverless observability, retry strategies, operational debugging, and CloudWatch logging.

On the generation side, the Lambda exposes this as a search_skills tool to the model. The model decides what to search for and when. For a PostgreSQL migration skill, it might search for “database migration patterns,” “PostgreSQL best practices,” and “schema versioning” separately, then synthesize the useful parts into a new skill.

That changes the output. Without retrieval, the model writes from general knowledge. With retrieval, it has seen how other skill authors structured similar guidance, what sections they included, what tools they referenced, and how specific they were.

Personally, I find it very important to show transparency. So the inspiration sources also show up in the UI. You can see which existing resources influenced the generated skill and click through to the originals on Kiro Hub. That is useful during refinement. If the model pulled in something that is not quite relevant, you can steer it in another direction.

Testing skills with Amazon Bedrock AgentCore Harness

We got a skill based on other working skills. But how do we trust that our newly generated skill works as we expected?

A skill is not just markdown. It is a set of instructions that an agent has to discover, load, and follow. You cannot properly evaluate that by reading the file. You need to run it in something close to the environment where it will actually be used.

That is where Amazon Bedrock AgentCore Harness fits in.

A Harness is a managed, config-based agent environment. You configure the model, system prompt, skills, tools, memory, limits, and runtime environment. Each session runs in an isolated environment, and reusing the same session ID lets you continue the conversation for follow-up tests. This allows me to test 'risky' skills without having to compromise my environments.

When a user tests a generated skill, the system does three things:

First, the test-skill Lambda writes the generated SKILL.md into the session filesystem:

/workspace/skills/test-skill/SKILL.md

Then it invokes the Harness with the skill path and the user’s test scenario:

{
  "skills": [
    {
      "path": "/workspace/skills/test-skill"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "I need help setting up error handling for my Node.js Lambda"
        }
      ]
    }
  ]
}

Finally, it streams the agent response back to the UI.

The important detail is that this is not just “put the skill in the system prompt and call a model.” The skill is loaded from a path, discovered through its frontmatter, and activated when the scenario is relevant. That is the Agent Skill behavior I care about testing.

If the frontmatter description is vague, the agent may not activate the skill. If the instructions are too broad, the response will show it. If the examples are weak, that becomes obvious quickly.

This is a feature I wanted to have across Kiro Hub. Be able to test if the desired resource works as we expected, that it doesn't have any side-effects (like prompt injection). This is the difference between checking whether markdown looks good and checking whether an agent can actually use it.

Harness gives me session isolation, filesystem access, stateful follow-up testing, and standard skill activation. One Harness can serve many test requests safely because isolation comes from the session. If the user wants to keep probing, the same session can continue the conversation with the skill still available.

That matters for the product experience. You can generate a skill, run a realistic scenario, ask a follow-up, see what breaks, then go back and refine the instructions.

The full flow

You describe what you need in the side panel chat. The model searches the Registry for relevant resources and generates a SKILL.md. You refine it in chat if needed. Then you switch to the Test tab, run it against the AgentCore, inspect the response, and make changes if something is unclear.

When you publish, the skill is written to DynamoDB and S3, then registered in AWS Agent Registry as an AGENT_SKILLS descriptor. An EventBridge rule triggers auto-validation. A Lambda function scores the skill with Bedrock across documentation quality, reusability, completeness, clarity, and specificity, then approves or rejects it based on the result.

Once approved, the skill is live on Kiro Hub and installable with:

npx kirohub add <slug>

What is next

The next piece is Agent Builders: a guided form for creating full Kiro agent configurations in .kiro/agents/*.json, not just skills. The spec is written, implementation is next. Then moving towards generating and testing steerings, hooks and prompts following the same approach.

I am also working on Stacks: curated bundles of resources, agents, skills, and steering files, installable with one command. Think starter kits for common project types.

Try it

Head to kirohub.dev/generate, describe what you need, and see what comes out.

AWS Bedrock KB with Glue data catalog

Shakir — Sun, 03 May 2026 17:57:36 +0000

Hi 👋, In this post we shall explore Bedrock's structured KB with this architecture: Upload CSVs to S3 > SNS Queue > Crawl data with Glue > Query with Redshift > Bedrock KB > Query with LLM.

Setup

Let's do some of this with code. Let's get started.

Clone the repo and switch to the project directory.

git clone git@github.com:networkandcode/networkandcode.github.io.git
cd structured-kb-demo/

Do a uv sync.

uv sync

Setup environment variables.

$ cat .env
AWS_ACCOUNT_ID=
AWS_ACCESS_KEY_ID=
AWS_REGION=ap-south-1
AWS_SECRET_ACCESS_KEY=

BEDROCK_KB=StructKb
BEDROCK_KB_IAM_POLICY=StructKbIamPolicy
BEDROCK_KB_IAM_ROLE=StructKbIamRole

GLUE_CRAWLER=struct-kb-glue-crawler
GLUE_CRAWLER_IAM_POLICY=StructKbGlueCrawlerIamPolicy
GLUE_CRAWLER_IAM_ROLE=StructKbGlueCrawlerIamRole
GLUE_DB=struct-kb-glue-db

REDSHIFT_IAM_ROLE=StructKbRedshiftIamRole
REDSHIFT_NAMESPACE=struct-kb-rs-ns
REDSHIFT_WORKGROUP=struct-kb-rs-wg

S3_BUCKET=struct-kb-bucket
S3_FOLDER=inventory

SQS_QUEUE=struct-kb-queue

Common files

The vars file will load all the env vars once. The arns file is used to form some of the arns we need. And the logger file is used to setup a common logger for rest of the code.

Bucket

Setup an S3 bucket.

uv run setup_s3_bucket.py

INFO:logger:Bucket struct-kb-s3-bucket created successfully

Queue

Setup an SQS queue with an access policy that allows the S3 bucket to send message to it.

uv run setup_sqs_queue.py

INFO:logger:Queue created successfully.

Event notification

Update S3 bucket to notify SQS queue on events.

uv run setup_s3_event_notification.py

INFO:logger:Successfully added event notifications

Database

Setup a glue database.

uv run setup_glue_db.py

INFO:logger:Glue database created successfully.

Crawler

Setup an IAM policy that allows access to the S3 bucket and SQS queue.

uv run setup_glue_crawler_iam_policy.py

INFO:logger:Policy created successfully!

Setup an IAM role which attaches the policy we just defined as well as the AWS managed glue policy.

uv run setup_glue_crawler_iam_role.py

INFO:logger:Created role
INFO:logger:AWS Glue Service Role policy attached.
INFO:logger:Custom Glue Crawler policy attached.

We can now provision a glue crawler and attach the role above to it.

uv run setup_glue_crawler.py

INFO:logger:Crawler created successfully.

Redshift

We shall setup a RedShift IAM role by attaching the AWS managed policy to it.

uv run setup_redshift_iam_role.py

INFO:logger:Created role: StructKbRedshiftIamRole
INFO:logger:Attached AmazonRedshiftAllCommandsFullAccess to StructKbRedshiftIamRole

Provision a namespace, attach the role above to it, and also provision a workgroup to run the namespace workloads on it.

uv run setup_redshift_workgroup.py

INFO:logger:Namespace creation initiated.
INFO:logger:Workgroup creation initiated.

See the data

There are two small files with sample inventory data: inventory1, inventory2.
Let's upload the first one.

uv run upload_csv_to_s3.py inventory_day_1.csv

Upload Successful: inventory/inventory_day_1.csv

Run the crawler so that it fetches data from S3 and adds a table on glue database.

uv run run_glue_crawler.py

INFO:logger:Crawler started.
INFO:logger:Crawler is still running...
INFO:logger:Crawler is still running...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler finished. Final State: READY

We did a lot with the cli, let's do some verification from the gui, on the web console. We can see the table on the glue db in the hirerarchy AWS Glue > Data Catalog > Tables.

Now, go to Amazon Redshift > Serveless > Query editor v2 Click on the workspace, and use the default settings to connect. Run this command on the editor:

SELECT * FROM "awsdatacatalog"."struct-kb-glue-db"."inventory"

In my case the table name is inventory which is same as the s3 folder name. I got results like below.

Note that there are 10 records.

Incremental data

Now, let's add another csv file for day 2.

uv run upload_csv_to_s3.py inventory_day_2.csv

The SQS queue shoud show there is one message available.

We can run the crawler to fetch the change.

uv run run_glue_crawler.py

The SQS messages available should become 0.

The same query in redshift should now give 20 records.

Bedrock KB

We got the results in redshift editor through the command. We can try to retrieve results via Bedrock KB through natural language.

Setup IAM policy for bedrock kb.

uv run setup_bedrock_kb_iam_policy.py

Setup IAM role and attach this policy.

uv run setup_bedrock_kb_iam_role.py

INFO:logger:Created role: StructKbBedrockKbIamRole
INFO:logger:Attached IAM policy to BedrockKB IAM role.

Create and sync the knowlege base.

uv run setup_bedrock_kb.py

We can go to Amazon Bedrock > Knowledge Bases on the web console and click on the knowledge base that was created. And test the knowledge base, I've used the following settings with a test prompt.

Alright, so that's it for this post, it was somewhat a heavy exercice overall, but I think it would help us really when we have large data, than the simple data examples we have used. So far we tested with the test prompt option in the bedrock kb, we could expand this logic and use this KB with agents made using frameworks like strands, langgraph...Thank you for reading!

It's All About That Memory - Using Long and Short Term Memory with Agents

Darryl Ruggles — Sun, 03 May 2026 17:57:19 +0000

Building a multi-session detective game with AgentCore Memory's 4 built-in strategies

Every AI memory demo starts the same way. "Hi, my name is Bob." Close the session, open a new one. "What's my name?" "Your name is Bob!" Confetti. Blog post done.

That's not interesting.

What if memory wasn't a feature bolted onto an agent - what if it was the entire product? What if the agent couldn't function without it? I wanted to build something where forgetting wasn't a minor inconvenience but a catastrophic failure. A detective who forgets the alibi they just disproved. A narrator who can't recall which suspects have been interviewed. A case file that resets to blank every time you close your browser.

That's the project: a noir detective mystery game called "The Blackwell Murder," built on Amazon Bedrock AgentCore, where all 4 long-term memory strategies plus short-term memory work together to make the investigation feel continuous across sessions. The detective arrives at a crime scene, interviews suspects, examines evidence, and builds a case - and when they come back the next day, the narrator picks up exactly where they left off.

The source code is on GitHub: agentcore-memory-murder-mystery

Architecture Overview

The key mental model: STM is working memory within a conversation, LTM is the case file that persists between sessions. The agent needs both.

The architecture is deliberately simple. It uses a local FastAPI proxy that sits between the React frontend and AgentCore. The example doesn't include CloudFront, Lambda, or API Gateway. The point of this project is memory rather than AWS networking. If you have AWS credentials and Terraform installed, you can clone the repo and be playing in 15 minutes.

This demo uses PUBLIC network mode with no authentication on the proxy for simplicity. Production deployments should use VPC mode with private subnets, authentication on the proxy layer, and VPC endpoints for Bedrock and AgentCore services.

Why a local proxy? The browser can't call AgentCore directly - it requires IAM SigV4 signing. The FastAPI server handles that, plus it gives us a clean place to filter out model artifacts like \ tags before they reach the UI. In production, this proxy would need authentication (Cognito, API keys, or similar) - the demo version accepts any request from localhost.

Why not the agentcore invoke CLI? The Python SDK (bedrock-agentcore) supports streaming and integrates cleanly with FastAPI's StreamingResponse. No subprocess overhead, no output parsing.

AgentCore Memory - The 4 Strategies

AgentCore Memory has two layers: short-term memory (STM) that handles turn-by-turn conversation within a session, and long-term memory (LTM) with four built-in strategies that extract, organize, and recall information across sessions.

What makes this interesting for a detective game is that every strategy maps naturally to how real investigators work. Detectives maintain fact files, write case summaries, track interrogation patterns, and adapt their approach based on what is working. The four LTM strategies do exactly this.

Short-Term Memory (STM)

STM captures the raw conversation - detective actions, narrator descriptions, tool calls and results - within a single session. The agent reads back the last few turns automatically so it knows what just happened.

When the detective says "examine the broken window" and then follows up with "check for fingerprints on the frame," STM is why the agent knows which window you're talking about without you having to repeat the context. STM events in this project expire after 30 days (configurable from 7 to 365 via the event_expiry_duration parameter).

Semantic Strategy - "CaseFiles"

Extracts and indexes factual information from conversations for retrieval by meaning, not keywords.

Namespace: /cases/{actorId}/facts/

This is the detective's fact file. Every time the agent learns something concrete - a suspect's alibi, a piece of evidence, a relationship between characters - the semantic strategy extracts it and stores it as a retrievable fact.

When the detective returns 3 sessions later and asks "what do we know about Helena's alibi?", the agent retrieves everything related to Helena: she claims she was at the Grand Hotel bar until midnight, the bartender says she left at 11:30 PM, there's a 17-minute gap, and the hotel security cameras had a convenient "glitch" during that window. No contradictions slip through. No established facts get lost.

Summary Strategy - "CaseNotes"

Creates condensed summaries of each session - the detective's case notes.

Namespace: /cases/{actorId}/{sessionId}/

At the end of each session, the summary strategy distills the conversation into a concise case update: what evidence was discovered, which suspects were interviewed, what leads are open, and where the investigation stands.

When the player starts a new session, the agent retrieves the last summary and opens with a case file briefing: "Case #1247 - The Blackwell Murder. Day 3. Last session you discovered the staged break-in and the 17-minute gap in Helena's alibi. Two leads remain open..." This is how real detectives work. They write case notes so they can pick up where they left off.

User Preferences Strategy - "DetectiveStyle"

Automatically identifies and tracks the player's investigation approach.

Namespace: /detectives/{actorId}/preferences/

This strategy watches how the player investigates and adapts the experience. If the player consistently chooses indirect questioning over confrontation, the narrator starts offering more subtle conversation options. If they prefer forensic evidence over witness interviews, crime scenes get richer physical detail.

It picks up on investigation style (methodical vs. intuitive), interrogation preference (confrontational, sympathetic, indirect), detail level (forensic deep-dives vs. big-picture summaries), and pacing preference (slow reveals vs. rapid progress).

The preference strategy is subtle. You don't notice it working until the third or fourth session, when the narrator's suggestions start feeling tailored to exactly how you like to play.

Episodic Strategy - "Interrogations"

Captures key interactions as structured episodes, then generates cross-session reflections.

Namespace: /episodes/{actorId}/{sessionId}/
Reflection namespace: /episodes/{actorId}/

This is the most compelling strategy, and the one that makes the detective game feel genuinely intelligent. Episodic memory doesn't just store what happened - it reflects on patterns across interactions.

An episode captures structured fields - the AWS docs define these as situation, intent, assessment, justification, and episode-level reflection. In practice, the output for this project looks like:

Situation: Interrogation of Helena Voss regarding her whereabouts
Intent: Catch Helena in a lie about the hotel bar timeline
Assessment: Presented the bartender's statement showing she left at 11:30, not midnight
Justification: Helena became defensive, changed story to "went for a walk," refused further questions

Reflections synthesize across episodes:

"This detective excels at catching timeline inconsistencies - present evidence contradictions early"
"Direct confrontation causes suspects to shut down - this player gets better results with patience"
"Helena's changing story pattern matches classic alibi fabrication - flag for cross-reference"

The practical result is detective intuition. The narrator can say things like "You have noticed Helena's story shifts every time you press on the timeline. Your instinct says the 17 minutes matter." The player didn't ask for that observation - the episodic reflection surfaced it automatically.

How the Strategies Map to the Investigation

Strategy	Detective Equivalent	What Gets Stored	When It Matters
STM	Working memory	Current conversation	Within a session - "which window?"
Semantic	Fact file	Suspects, alibis, evidence, relationships	Re-interviewing a suspect 3 sessions later
Summary	Case notes	Per-session investigation summary	Opening a new session - "where were we?"
User Preference	Detective instinct	Play style, interrogation approach	Narrator adapts tone and suggestions
Episodic	Interrogation log + intuition	Key interactions + cross-session reflections	"Helena's story keeps changing..."

A Note on Namespace Design

The AWS docs recommend a default namespace pattern like /strategy/{memoryStrategyId}/actor/{actorId}/session/{sessionId}/. This project uses custom descriptive namespaces instead - /cases/{actorId}/facts/, /detectives/{actorId}/preferences/, etc. - because they map directly to how a detective organizes information. When you're debugging and see a record in /cases/sloane/facts/, you immediately know what it is.

The tradeoff is that without {memoryStrategyId} in the path, multiple strategies could theoretically write to overlapping namespaces if you configure them carelessly. In practice, each strategy in this project has a distinct namespace root (/cases/, /detectives/, /episodes/), so there's no overlap. If you're building a system with many strategies, the AWS-recommended pattern with strategy IDs in the path is safer.

Where Does the Extraction Logic Live?

This is the thing that took me the longest to internalize: you don't write extraction logic. There's no code in this project that says "pull out facts for semantic memory" or "summarize this session." The platform does all of it.

When you define a strategy, you provide a type, a name, a description, and namespaces. That's it. The extraction pipeline reads your STM events - the raw conversation messages - and applies each strategy's built-in logic to decide what to extract. You never see the extraction prompt. For the built-in strategies used in this project, customization is limited to the strategy description field. AWS also offers built-in overrides (custom prompts, custom model selection) and self-managed strategies (full pipeline control) for deeper customization - see the AgentCore Memory documentation for details.

For built-in strategies, your actual levers for influencing LTM quality are indirect:

Strategy descriptions - the only direct hint you give the extraction model. "Extracts and indexes case facts for semantic retrieval" tells it to focus on facts. "Tracks detective communication style and investigation preferences" tells it to watch for behavioral patterns.
Your system prompt - shapes how the agent talks, which shapes what the extraction pipeline has to work with. A system prompt that produces atmospheric noir prose gives the summarization strategy rich material. A prompt that produces terse responses gives it less.
Your tools - return structured data that becomes part of the conversation. When examine_evidence returns forensic details about tool marks on a window frame, that structured output gives the semantic strategy concrete facts to extract. When interrogate_witness returns a suspect's shifting alibi, the episodic strategy captures it as a meaningful interaction.
The conversation itself - longer, richer conversations produce more extraction material. A single-turn "look at the window" produces less than a multi-turn investigation where the detective examines evidence, cross-references alibis, and confronts a suspect with contradictions.

The practical implication is that designing good prompts and tools is indirectly designing your memory. I didn't set out to optimize for LTM quality, but the debug watch tool showed me that conversations where the detective digs deeper - following up on inconsistencies, asking witnesses about specific details, comparing evidence across locations - produce significantly richer LTM records than surface-level interactions. The extraction pipeline rewards conversational depth.

Building the Agent

The agent runs on AgentCore via the Strands SDK. Three things matter: the system prompt, the tools, and the memory integration.

System Prompt - Noir Narrator Persona

The agent isn't the detective. It's the narrator - the voice in the dark that describes what the detective sees, hears, and feels. The system prompt establishes this firmly:

You are the narrator for "The Blackwell Murder," a noir detective mystery.
You speak in the style of classic noir fiction - rain-slicked streets, long
shadows, moral ambiguity, and the kind of truth that cuts deeper than any blade.
You are not the detective. You are the voice in the dark that describes what
the detective sees, hears, and feels.

The prompt also includes the full case briefing (locations, suspects, the solution), narrator rules, and memory integration instructions. Two rules that matter the most:

Never break character. The model must never mention tools, functions, errors, or its own reasoning. If a tool fails, the narrator says "the trail goes cold" - not "there was an error in the category specified."
Memory integration on session start. On the first message in a new session with no prior history, set the scene. On returning sessions where memory context is available, open with a case file briefing.

Custom Tools

Four tools drive the investigation. Each is a @tool-decorated function that returns narrative text and silently tracks state in a case file:

examine_evidence(item, method) - Three examination methods (visual, forensic, compare) reveal different details about the same evidence. The broken window looks suspicious on visual inspection, reveals tool marks under forensic analysis, and confirms the staged break-in on comparison.

interrogate_witness(witness, approach, topic) - Four interview approaches (neutral, sympathetic, confrontational, indirect) produce different responses from the same witness. Confrontation shuts Marcus down. Sympathy gets Clara to reveal the shadow she saw. Indirect questioning catches Marcus mentioning the service passage he claims was sealed.

search_location(location, area) - Five locations with multiple searchable areas. The study alone has the desk, window, bookcase, safe, and floor - each hiding different clues.

check_case_file(query, category) - The detective's notebook. Reviews all discovered evidence, suspect information, alibis, timeline events, and connections between suspects. Supports free-text search across all categories.

Every tool call that discovers something new pushes a notification to the frontend, which updates the Case Board and Persons of Interest panels in real time.

Memory Integration with Strands

The Strands SDK's AgentCoreMemorySessionManager handles the memory lifecycle:

config = AgentCoreMemoryConfig(
    memory_id=MEMORY_ID,
    session_id=session_id,
    actor_id=detective_id,
    retrieval_config={
        f"/cases/{detective_id}/facts/": RetrievalConfig(top_k=5),
        f"/detectives/{detective_id}/preferences/": RetrievalConfig(top_k=3),
        f"/episodes/{detective_id}/": RetrievalConfig(top_k=3),
    },
)

with AgentCoreMemorySessionManager(config, region_name=REGION) as session_manager:
    agent = Agent(
        model=model,
        system_prompt=SYSTEM_PROMPT,
        tools=[examine_evidence, interrogate_witness, check_case_file, search_location],
        session_manager=session_manager,
    )

The retrieval_config tells the session manager which LTM namespaces to query when loading context for a new request. Without it, the agent only gets STM conversation history - it wouldn't recall facts, preferences, or episode patterns from prior sessions.

The session manager does two things: on entry, it loads relevant memories (STM conversation history, LTM strategy results) into the agent's context. On exit, it persists the current conversation as new memory events. The actor_id is the detective's name, which namespaces all memory operations so multiple detectives could theoretically investigate the same case without cross-contamination.

Model Configuration

Nova Pro is the default because it has good narrative quality and is the most cost-effective option for iterative development. But the model is switchable at deploy time via the ACTIVE_LLM environment variable:

Model	Model ID	Use Case
Nova Pro	`us.amazon.nova-pro-v1:0`	Default - good balance of quality and cost
Nova 2 Lite	`us.amazon.nova-2-lite-v1:0`	1M context, optional extended thinking
Nova Lite	`us.amazon.nova-lite-v1:0`	Fastest, lowest cost
Claude Sonnet 4.6	`us.anthropic.claude-sonnet-4-6`	Best narrative quality
Claude Haiku 4.5	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	Fast and affordable

The difference in narrative quality between Nova Pro and Claude Sonnet is noticeable. Claude produces more atmospheric prose and stays in character more consistently. Nova Pro occasionally breaks the fourth wall by mentioning tool names or its own reasoning process - something I had to filter out in the proxy server. For a polished demo, Claude Sonnet is the better choice. For development and iteration, Nova Pro keeps costs low. A typical 15-20 minute play session (10-15 turns, 4 tool calls per session) costs roughly $0.02-0.05 in model inference alone with Nova Pro. Claude Sonnet runs about 5-10x that. Memory operations and KMS add negligible cost on top.

Infrastructure as Code

All durable infrastructure is managed by Terraform using the AWS provider (~> 6.35). The agent itself is deployed via the agentcore CLI, which handles the zip packaging and runtime provisioning. This is a clean separation: Terraform manages what persists (Memory, IAM, KMS, S3), the CLI manages what deploys (agent code, runtime configuration).

Memory and KMS

AgentCore Memory requires a KMS key for encryption. The memory resource itself is straightforward:

resource "aws_bedrockagentcore_memory" "detective" {
  name                      = "${var.memory_name}_${var.name_suffix}"
  description               = "Persistent memory for the AI detective agent"
  event_expiry_duration     = var.event_expiry_duration_days
  encryption_key_arn        = aws_kms_key.memory.arn
  memory_execution_role_arn = var.memory_execution_role_arn
}

Note the variable name event_expiry_duration_days - the Terraform attribute is event_expiry_duration (which takes a value in days), and the variable adds the _days suffix for clarity so readers don't have to guess the unit.

The KMS key policy grants three principals access: the root account for administration (full kms:*), the AgentCore service for memory encryption operations (kms:Encrypt, kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey), and the memory execution role for runtime access (same encryption actions). All policies use aws_iam_policy_document data sources - never inline JSON strings. This gives you compile-time validation and readable diffs. Note: resources = ["*"] in a KMS key policy means "this key" - it's not a wildcard across all keys.

A random_id suffix is appended to all AWS resources (S3 buckets, KMS aliases, memory names) to ensure global uniqueness. The suffix is generated once and shared across all modules.

Three Strategies via Terraform, One via CLI

Here's the real-world gotcha. The aws_bedrockagentcore_memory_strategy resource supports three of the four strategy types:

resource "aws_bedrockagentcore_memory_strategy" "case_files" {
  name        = "CaseFiles"
  memory_id   = aws_bedrockagentcore_memory.detective.id
  type        = "SEMANTIC"
  description = "Extracts and indexes case facts for semantic retrieval"
  namespaces  = ["/cases/{actorId}/facts/"]
}

resource "aws_bedrockagentcore_memory_strategy" "case_notes" {
  name        = "CaseNotes"
  memory_id   = aws_bedrockagentcore_memory.detective.id
  type        = "SUMMARIZATION"
  description = "Summarizes investigation sessions into concise case notes"
  namespaces  = ["/cases/{actorId}/{sessionId}/"]
}

resource "aws_bedrockagentcore_memory_strategy" "detective_style" {
  name        = "DetectiveStyle"
  memory_id   = aws_bedrockagentcore_memory.detective.id
  type        = "USER_PREFERENCE"
  description = "Tracks detective communication style and investigation preferences"
  namespaces  = ["/detectives/{actorId}/preferences/"]
}

The EPISODIC type isn't yet supported in the Terraform provider as of March 2026. This is tracked in terraform-provider-aws #45599. The workaround is a make target that calls the AWS CLI:

aws bedrock-agentcore-control update-memory \
  --memory-id $(MEMORY_ID) \
  --memory-strategies '{
    "addMemoryStrategies": [{
      "episodicMemoryStrategy": {
        "name": "Interrogations",
        "description": "Key interrogation episodes with cross-case reflections",
        "namespaces": ["/episodes/{actorId}/{sessionId}/"],
        "reflectionConfiguration": {
          "namespaces": ["/episodes/{actorId}/"]
        }
      }
    }]
  }'

Three things to note about the episodic strategy. First, it requires a reflectionConfiguration with its own namespace - this is where cross-session reflections are stored. Second, the reflection namespace must be at or above the episode namespace's depth - meaning reflections are less nested than episodes. In practice, this means the reflection namespace must be a prefix of the episode namespace (e.g., /episodes/{actorId}/ works as a reflection namespace for episodes stored in /episodes/{actorId}/{sessionId}/). Get this wrong and the API returns a validation error that doesn't clearly explain the constraint.

Third, because the episodic strategy lives outside Terraform, terraform destroy won't clean it up. If you destroy and recreate the infrastructure, you'll get a naming collision or an orphaned strategy. The project includes a corresponding make remove-episodic-strategy target for teardown. On the Terraform side, the memory resource's attributes don't reflect CLI-managed strategy state, so terraform plan won't show unexpected diffs after you add the episodic strategy via the CLI - no ignore_changes block is needed.

A Note on IAM Permissions

When you deploy an agent with the agentcore CLI, it auto-creates an IAM role (AmazonBedrockAgentCoreSDKRuntime-*) with a baseline policy. This policy covers what the agent needs to run - model invocation, memory read/write, and the basics. The agent works fine out of the box.

Where you will need extra IAM permissions is if you build debug tools that call the boto3 memory APIs directly - like the watch script in this project. Those tools run under your own IAM identity, not the agent's runtime role, and need explicit permissions for ListMemoryRecords, RetrieveMemoryRecords, ListEvents, and KMS decrypt on the memory encryption key. In production, create a separate narrowly-scoped IAM role for debug tools rather than granting these permissions to developer identities. Budget 15 minutes to set this up if you plan to inspect memory outside the agent.

A note on the auto-created runtime role. The agentcore CLI generates a role with broad permissions - for example, bedrock:InvokeModel with Resource: * rather than scoped to specific model ARNs. This is fine for a demo, but for production deployments, create a custom IAM role with explicitly scoped permissions. At minimum, scope bedrock:InvokeModel to the specific model ARNs your agent uses and ensure memory access policies reference only the memory resources that agent needs.

Naming Constraints

AgentCore resource names must match [a-zA-Z][a-zA-Z0-9_]{0,47} - letters, numbers, and underscores only, starting with a letter. No hyphens. This tripped me up repeatedly: case-files fails, CaseFiles works. detective-memory-abc123 fails, detective_memory_abc123 works. KMS aliases are fine with hyphens, but everything else in AgentCore isn't.

The Frontend

A noir-themed React 19 SPA with four components: the narrative log (the main detective story), the detective input, the Case Board (discovered evidence), and the Persons of Interest panel (suspect information with alibi status).

The narrative log displays the agent's noir prose as it streams in via SSE. Tool use events show as gold italic indicators - "Examining evidence...", "Interrogating witness..." - so the player knows the agent is working.

The Case Board and Persons of Interest panels update in real time as the investigation progresses. When the agent examines evidence or interviews a suspect, the tools push structured notifications through the SSE stream. New evidence items appear with an amber highlight that fades after a few seconds. Suspects show their interview count and alibi verification status (verified, contradicted, or unverified).

SSE streaming deserves a note. AgentCore returns the response as a StreamingBody - but when accessed through the invoke_agent_runtime API, the entire response arrives as a single read. The SSE events are concatenated inside it, sometimes without newline separators:

data: {"chunk": "The rain"}data: {"chunk": " hasn't stopped"}data: {"chunk": " for three days."}

The proxy server splits on data: boundaries using a regex, reassembles all chunk text, strips tags and `</tool>` XML artifacts that occasionally leak through, and re-emits clean SSE events to the browser. This filtering is model-agnostic - Nova Pro produces these artifacts most often, but any model with reasoning capabilities (including Nova 2 Lite with extended thinking enabled) can emit blocks. This two-pass approach - collect everything, clean it, then emit - is more reliable than trying to filter partial tags from individual chunks.

Demo Walkthrough

Session 1: The Crime Scene

The player opens the app. The narrator sets the scene:

The rain has not stopped for three days. You step under the police tape at 42 Ashford Lane, the collar of your coat turned up against the damp. The study door is open now - the locksmith saw to that this morning - but inside, everything is exactly as Edward Blackwell left it. Or as someone left it for him.

The player types "search the study." The agent calls search_location and describes the room - the scattered desk, the broken window, the bookcase. The Case Board lights up: "Broken Window - glass fragments outside, break pattern radiates outward."

They examine the broken window forensically. The narrator reveals the wiped frame, the tool marks. The Case Board updates: "Broken Window (forensic) - frame wiped clean, tool marks from inside."

They interview Clara Webb sympathetically. She breaks down and admits she heard arguing at 11:15 PM. The Persons of Interest panel shows Clara with her first interview note.

All of this - the evidence, the suspect information, the timeline - flows into AgentCore Memory. STM keeps the turn-by-turn conversation. The semantic strategy extracts the facts. The episodic strategy logs Clara's interrogation.

Session 2: The Case Continues

The player closes the browser, has lunch, and comes back. They start a new session with the same detective ID. The narrator opens differently now:

Case #1247 - The Blackwell Murder. Your notebook is open on the desk, the pages curling at the edges from the rain. Last time, you found the staged break-in - glass broken outward, frame wiped clean, tool marks from inside. Clara Webb heard arguing at 11:15 PM. Two voices. One was Blackwell. The other was a man she could not identify, but she said the shadow was tall, broad-shouldered. Like Marcus.

The summary strategy provided the session recap. The semantic strategy filled in the specific details. The player picks up where they left off and starts pressing on Marcus's alibi. Three sessions in, when the player consistently uses indirect questioning instead of confrontation, the narrator starts offering subtler options - the preference strategy at work.

And when the player catches Helena in another timeline inconsistency, the narrator adds: "Her story shifts every time you push on the timeline. Your instinct says the 17 minutes matter more than she is letting on." That is the episodic reflection - pattern recognition across sessions that makes the detective feel like they are building real intuition.

Observing Memory in Real Time

Understanding LTM is abstract until you watch it happen. The project includes a debug watch command that polls AgentCore Memory every 5 seconds and prints new STM events and LTM records as they appear:

make debug-memory-watch

This runs python server/debug_memory.py --watch 5, which seeds with the current state (so you only see new additions) and then streams changes. A typical session looks like this:

  Seeding current state... done (652 STM events, 156 LTM records)
  Watching for new additions...

[20:11:01] [STM] [fe400085] [user] Use a firm and aggressive approach with Clara
[20:11:11] [STM] [fe400085] [assistant] A confrontational approach with Clara Webb proves
  ineffective. She flinches at the sharp tone and retreats into monosyllables...

[20:11:59] [LTM] [USER_PREFERENCE (DetectiveStyle)]
{"context":"The user initially requested a softer approach when interrogating Clara Webb
but later explicitly requested to use a firm and aggressive approach, indicating a shift
toward more confrontational interrogation tactics with witnesses.",
"preference":"Prefers firm and aggressive interrogation approach with witnesses"}

[20:12:38] [LTM] [SUMMARIZATION (CaseNotes)]
<topic name="Witness Interview - Clara Webb (Confrontational Approach - Failed)">
Detective Sloane attempted a firm and aggressive approach with Clara Webb. The
confrontational strategy proved completely ineffective. Clara flinched at the sharp
tone and retreated into monosyllables. This failed interrogation confirms Clara's
fear is a significant barrier and indicates a gentler approach is necessary.
</topic>

STM events appear immediately as the conversation flows. LTM records follow 30-60 seconds later as the platform's extraction pipeline processes the events. You can see exactly what each strategy produces:

SEMANTIC records are plain factual statements - "Helena Blackwell was found dead in the study at 10:42 PM"
SUMMARIZATION records are topic-tagged XML with detailed session notes
USER_PREFERENCE records are structured JSON with context, preference, and categories
EPISODIC records come in two flavors: situation recaps ("situation": "A detective begins investigating...") and cross-session strategy patterns ("title": "Escalating Interrogation Pressure with Evidence Leverage")

Seeing these raw values is what made the strategies click for me. Reading the documentation, I understood that "semantic extracts facts" and "episodic captures patterns." But watching the actual output - seeing the platform independently decide that a failed interrogation was worth logging as an episode, or that a shift from soft to aggressive questioning counted as a preference change - made the system feel real. The extraction isn't just summarizing what happened. It's interpreting the conversation through each strategy's lens and producing genuinely different representations of the same events.

The watch also exposed a debugging gotcha. As of bedrock-agentcore SDK version 1.4.4, the AgentCore list_memory_records and retrieve_memory_records APIs return results under the key memoryRecordSummaries, not memoryRecords. The SDK's retrieve_memories() method handles this correctly, so the agent works fine - but if you write your own debug scripts using boto3 directly, you'll get empty results and spend hours investigating an extraction pipeline that was working all along. The watch script in this repo has the correct key. Check the latest SDK docs if you're reading this in the future - response key names can change between versions.

Other debug modes are available:

# Dump everything - strategies, STM events, and LTM records
uv run python server/debug_memory.py

# Only LTM records (skip raw conversation events)
uv run python server/debug_memory.py --ltm-only

# Only STM events
uv run python server/debug_memory.py --stm-only

# Show all sessions (default: most recent only)
uv run python server/debug_memory.py --all-sessions

What I Learned

STM vs LTM isn't either/or - they serve completely different functions. STM is working memory within a conversation. LTM is the case file that persists between sessions. You need both, and trying to use one for the other's job leads to problems. STM without LTM means the detective forgets everything between sessions. LTM without STM means the agent can't follow a multi-turn investigation within a single session.

Episodic reflections are the most compelling strategy. The semantic strategy is the workhorse - it stores facts and retrieves them reliably. But the episodic strategy's cross-session reflections are what make the agent feel genuinely intelligent. When the narrator surfaces a pattern the player didn't explicitly ask about, it creates a moment that feels like the detective is actually thinking. This is the strategy I would lead with in any demo.

Model choice matters more than I expected for character consistency. Nova Pro occasionally breaks character - mentioning tool names, exposing its reasoning process, or dropping the noir tone mid-paragraph. Claude Sonnet stays in character almost perfectly. For a narrative application where immersion matters, the model's ability to maintain a persona is as important as its raw capability. I ended up adding server-side filtering to strip \ tags and </tool> XML artifacts that leaked through from Nova Pro.

Prompt engineering is still the job - the prompt is the product. The system prompt went through more revisions than any other file in this project. The first version let the model call six tools in a single turn, drowning the player in information before they had asked a single question. Another version produced beautiful prose but kept breaking character to mention tool names. Getting the narrator to call exactly one tool per turn, stay in character when tools error, and set the scene without immediately investigating required specific, firm language - "do not chain multiple tool calls" works where "one action per turn" didn't. If you're building an agent-based application, expect to spend as much time tuning the system prompt as you do writing the code around it.

The Terraform provider gap is a real-world pattern. Three of four strategies are supported in Terraform. The fourth requires a CLI workaround. This is a common pattern with new AWS services - Terraform support lags behind the API by weeks or months. The pragmatic approach is to manage what you can in Terraform and script the rest in your Makefile, documenting the gap clearly so your future self (or your team) knows what to update when provider support arrives.

Build a memory watch tool early. The single most useful debugging aid was a script that polls memory and prints new STM events and LTM records in real time. Without it, memory's a black box - events go in, and you hope the right things come out. With it, you can see exactly what the platform extracts, how long extraction takes (30-60 seconds typically), and whether your namespace configuration is producing records where you expect them. I would build this before writing any agent code on my next project.

Going to production would add several layers. This demo runs in PUBLIC network mode with an unauthenticated local proxy. A production deployment would need: VPC mode with private subnets, VPC endpoints for Bedrock and AgentCore services (avoiding public internet for API calls), CloudFront distribution with WAF, Cognito or API key authentication on the proxy, a custom IAM role with least-privilege permissions (scoped bedrock:InvokeModel to specific model ARNs, scoped memory access to specific resources), an S3 backend for Terraform state, and Bedrock Guardrails for input validation. The architecture section of this post shows the demo setup. The production architecture is a different article.

The full source code, Terraform configurations, and Makefile workflow are available on GitHub agentcore-memory-murder-mystery. Clone the repo, run make init && make apply && make deploy-agent && make serve, and start investigating. The rain is still falling on Ashford Lane.

Connect with me on X, Bluesky, LinkedIn, GitHub, Medium, Dev.to, or the AWS Community. Check out more of my projects at darryl-ruggles.cloud and join the Believe In Serverless community.

I Injected Three Faults. The Agent Found All of Them.

Romar Cablao — Sun, 03 May 2026 14:24:37 +0000

Overview

Let's get our hands dirty. This part covers the full setup and the actual demo: deploy PayLedger to both regions, wire up Route 53 failover, configure the Agent Space, inject three simultaneous faults, and walk through exactly what the agent found.

Quick recap from Part 1: PayLedger is a demo payment ledger deployed to ap-southeast-1 (primary) and ap-northeast-1 (secondary) with Route 53 failover, DynamoDB Global Tables, and a Next.js frontend showing which region is serving. DevOps Agent sits in ap-southeast-2 monitoring both. If you haven't read the first part, you can check it out here:

Romar Cablao for AWS Community Builders

May 3

Runbooks Don't Investigate. AWS DevOps Agent Does.

#aws #devops #aiops #disasterrecovery

Comments

7 min read

Before You Start

Requirement	Notes
AWS account	IAM admin permissions
Domain in Route 53	Hosted zone for custom domain
Serverless Framework v4	`npm install -g serverless`
Python 3.12	Lambda runtime
ACM certificates	In both apse1 and apne1 for the API subdomain

New customers get a 2-month free trial for AWS DevOps Agent. After that, billing is per second when the agent is active. Support credits vary by tier.

Reference: AWS DevOps Agent Pricing

Step 1: Create the Agent Space

Before deploying anything in your workload regions, set up the Agent Space first. The webhook credentials produced here are needed later when you wire up alarm forwarding.

Switch to ap-southeast-2 in the AWS Console. Navigate to AWS DevOps Agent and create a new Agent Space. AWS creates the required IAM roles automatically:

DevOpsAgentRole-AgentSpace uses AIDevOpsAgentAccessPolicy
DevOpsAgentRole-WebappAdmin uses AIDevOpsOperatorAppAccessPolicy

Link your AWS account. Both workload regions (apse1 and apne1) are in the same account, so a single association gives the agent visibility into both.

Once the Agent Space is up, grab the webhook URL and HMAC key from the integrations page. You'll use them in Step 5.

Reference: What are DevOps Agent Spaces?

Step 2: Deploy to Both Regions

Copy .env.example to .env and fill in your values, then run:

bash scripts/setup.sh --step deploy-backend

This deploys to ap-southeast-1 first (which creates the DynamoDB table), then ap-northeast-1 (which skips table creation via a CloudFormation Condition). API Gateway IDs are auto-discovered from CloudFormation and written back to .env. No manual copy-pasting.

If you prefer to run the deploys individually:

# Primary (creates the DynamoDB table)
npx serverless deploy --stage dev --region ap-southeast-1

# Secondary (skips DynamoDB creation via CloudFormation Condition)
npx serverless deploy --stage dev --region ap-northeast-1

Verify both health endpoints are up:

curl https://<APSE1_ID>.execute-api.ap-southeast-1.amazonaws.com/health
# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}

curl https://<APNE1_ID>.execute-api.ap-northeast-1.amazonaws.com/health
# {"status": "healthy", "region": "ap-northeast-1", "service": "payledger", "timestamp": "..."}

Step 3: Enable DynamoDB Global Table

bash scripts/setup.sh --step setup-global-table

This adds the ap-northeast-1 replica and polls until it reaches ACTIVE status (typically 2-5 minutes). Under the hood it runs update-table with replica-updates Create={RegionName=ap-northeast-1} and waits.

Seed some transactions so the UI has data to show:

python scripts/seed_transactions.py

Reference: Amazon DynamoDB Global Tables

Step 4: Configure Custom Domains and Route 53 Failover

Two sub-steps here. Before running them, make sure ACM certificates exist in both regions covering the API subdomain and the failover domain.

# Create API GW custom domains + Alias A records in Route 53
bash scripts/setup.sh --step setup-custom-domains

# Create Route 53 health checks + PRIMARY/SECONDARY failover CNAME records
bash scripts/setup.sh --step setup-route53

setup-custom-domains creates the regional custom domains (apse1-api-payledger.yourdomain.com, apne1-api-payledger.yourdomain.com) and registers both with the failover domain (api-payledger.yourdomain.com) so API Gateway accepts the Host header from either path.

setup-route53 creates health checks (10s interval, FailureThreshold 2) and the PRIMARY/SECONDARY CNAME failover pair. It polls until both health checks pass before returning.

After setup, all traffic to api-payledger.yourdomain.com goes to Singapore. If the health check fails twice (around 20 seconds), Route 53 fails over to Tokyo automatically.

# Verify, should hit primary
curl https://api-payledger.yourdomain.com/health
# {"status": "healthy", "region": "ap-southeast-1", "service": "payledger", "timestamp": "..."}

Reference: Amazon Route 53 Failover Routing

Step 5: Store the DevOps Agent Webhook Credentials

The alarm notification flow uses a webhook: CloudWatch Alarm → SNS Topic → devopsAgentTrigger Lambda → DevOps Agent webhook. The setup.sh script handles this via the setup-webhook step, which stores the webhook URL and HMAC key from the DevOps Agent console in Secrets Manager.

bash scripts/setup.sh --step setup-webhook

You'll need the webhook URL and HMAC key from your Agent Space in the DevOps Agent console. Set them in your .env file first:

DEVOPS_AGENT_WEBHOOK_URL=https://event-ai.ap-southeast-2.api.aws/webhook/generic/your-webhook-id
DEVOPS_AGENT_HMAC_KEY=your-hmac-key-here

Step 6: Deploy the Frontend

bash scripts/setup.sh --step deploy-frontend

This provisions the S3 bucket and CloudFront distribution if they don't exist, registers FRONTEND_DOMAIN in Route 53, builds the Next.js app, syncs the output to S3, and invalidates the CloudFront cache. If you just want to run it locally without the cloud provisioning:

bash scripts/setup.sh --step deploy-frontend --local
# Writes frontend/.env.local only. Run with: npm run dev --prefix frontend

The UI polls /health every 5 seconds. Green banner = Singapore (PRIMARY). Amber banner = Tokyo (FAILOVER). When the region changes, a "Failover detected" banner appears automatically.

Step 7: Verify Topology

After linking the account, DevOps Agent builds the topology automatically from CloudFormation stacks. Serverless Framework deploys via CloudFormation, so all resources in both regions are discovered without manual setup.

Three views in the web app: System view (account/region boundaries), Container view (CloudFormation stacks), Resource view (full resource graph with cross-region DynamoDB relationship).

The topology is powered by the Agent Space Understanding learned skill. It auto-generates when integrations are configured and powers the Topology page.

Reference: What is a DevOps Agent Topology?

Step 8: Verify the Full Stack

Run the verify step to confirm all endpoints are reachable through the failover URL before injecting any faults:

bash scripts/setup.sh --step verify

This runs health checks against both regional endpoints directly, then tests all four endpoints through the Route 53 failover URL including a POST to /transactions. All checks should pass and return 2xx before you continue.

Optional Integrations

The Agent Space works without these, but they make findings easier to consume.

Slack

AWS DevOps Agent console -> Settings -> Communications -> Slack -> Register (OAuth)
Agent Space -> Capabilities -> Communications -> Slack -> select channel -> Create

The Agent Space web app shows all investigation findings regardless. Slack is useful if you want findings posted to a channel without keeping the web app open.

Reference: Connecting Slack

GitHub

Agent Space -> Capabilities -> Pipeline -> Connect -> GitHub
Install the AWS DevOps Agent GitHub App on your account
Grant access to the payledger-aws-devops-agent repository

The agent investigates all three faults without GitHub. The value it adds is deployment correlation. For config-related faults, the agent can correlate errors with recent config changes and deployment history.

Reference: Connecting GitHub

The Demo: Three Faults at Once

With everything set up, I ran python scripts/fault.py inject. The default mode assigns one distinct fault per service simultaneously:

python scripts/fault.py inject
# health       -> throttle   (reserved concurrency = 0)
# transactions -> envvar     (TABLE_NAME removed)
# balance      -> iam        (role swapped to fault-iam, no DynamoDB access)

The CloudWatch 5xx alarm for ap-southeast-1 fired at 21:30:02. Route 53 detected the failing health checks and routed traffic to ap-northeast-1. PayLedger continued serving from Tokyo. DevOps Agent started investigating automatically.

Here is the full failover in action. You can see the region indicator shift from Singapore to Tokyo in real time:

The Investigation

The alarm triggered at 21:30:02. The investigation completed at 21:37:05. Total time: 7 minutes and 3 seconds.

Investigation Timeline

The agent opened by reading two things before making a single AWS API call: the Agent Space Understanding skill and the PayLedger component reference file, both auto-generated learned skills from the connected account. Before any CloudWatch or CloudTrail queries had returned, the agent already had context about the service architecture.

From there it split into three parallel tracks:

Lambda logs: 11 tool calls over 1 minute, comparing a baseline window (13:00-13:05 UTC) against the incident window
CloudTrail changes: 19 tool calls over 2 minutes 4 seconds, pulling config change events for the account and region
Lambda metrics: 7 tool calls over 1 minute 43 seconds, error counts, throttle counts, duration, and invocation counts per function

By +2m16s, findings were coming back from all three tracks simultaneously.

Findings

Finding 1: listTransactions Lambda missing TABLE_NAME causing init crash

Every invocation of payledger-dev-listTransactions failed during module initialization. The agent pulled the actual log entry from CloudWatch:

[2026-05-02T13:28:06.250Z] [ERROR] KeyError: 'TABLE_NAME'
Traceback (most recent call last):
  File "/var/task/functions/list_transactions.py", line 29, in <module>
    TABLE_NAME = os.environ["TABLE_NAME"]
INIT_REPORT Phase: init  Status: error  Error Type: Runtime.Unknown

26 error records in the incident window, zero in baseline. It confirmed the missing variable by inspecting the live function configuration directly: ALLOWED_ORIGINS, POWERTOOLS_SERVICE_NAME, LOG_LEVEL, REGION were all present. No TABLE_NAME. The function was never initializing. Every cold start failed before the handler could run.

Finding 2: getBalance Lambda using fault-iam role with no DynamoDB permissions

The function was assigned payledger-dev-fault-iam, which only has AWSLambdaBasicExecutionRole. Every DynamoDB query returned AccessDeniedException. The function handled the exception gracefully, so the Lambda Errors metric showed 0. API Gateway still recorded the 500s. The agent caught this by looking at both metrics separately rather than relying on either one alone.

Finding 3: health function throttled to zero

Reserved concurrency had been set to 0, blocking all invocations before execution. 11 throttles at 13:27, 79 throttles at 13:28. Invocation count at 13:28 dropped to only 20 from the normal 90-100 per minute. The function had zero errors when it did execute, confirming it was a concurrency limit, not a code problem.

The accounting

The agent reconciled the numbers before writing the final report:

Source	Errors	Share
`health` (reserved concurrency = 0)	90 (11 + 79)	90%
`listTransactions` (missing `TABLE_NAME`)	5	5%
`getBalance` (wrong IAM role)	5	5%
Total	100	100%

100 5xx errors, all accounted for.

Root Cause

CloudTrail confirmed the trigger. All three configuration changes happened within a 2-second window:

PutFunctionConcurrency on payledger-dev-health. Reserved concurrency set to 0 (13:27:54Z)
UpdateFunctionConfiguration on payledger-dev-listTransactions. All environment variables cleared (13:27:55Z)
UpdateFunctionConfiguration on payledger-dev-getBalance. Execution role changed to payledger-dev-fault-iam, env vars cleared (13:27:56Z)

The root cause statement from the agent:

"The role name 'payledger-dev-fault-iam', the use of Boto3 scripting, and the rapid self-recovery at 13:29:00Z strongly indicate this was a deliberate chaos engineering / fault injection exercise rather than an accidental misconfiguration."

That last line: the agent identified the devopsAgentTrigger Lambda in the stack and flagged the fault as intentional. It was right.

Mitigation Plan

The agent returned: no mitigation action required.

Two things happened in parallel during this incident. Route 53 detected the failing health checks and automatically failed over to ap-northeast-1 within 20 seconds, so the service kept running throughout. That part required no intervention. On the primary region side, the faults were reversed at 13:29:00 UTC when fault.py restore ran, 2 minutes after injection. The agent saw the 5xx errors drop to 0, matched it against the CloudTrail restore events, and concluded there was nothing left to fix.

"This was a controlled chaos engineering exercise to test system resilience. The incident self-recovered at 13:29:00 UTC, indicating the configurations were reverted as part of the planned test. Since this was intentional testing and the system has already recovered, no immediate operational mitigation is required."

A system that generates restore commands for changes that have already been reverted would be wrong. The agent recognized self-recovery and didn't produce output that didn't apply.

Here is the full AWS DevOps Agent investigation in action:

Observations

The agent built its own context before touching a single API. It started by reading the Agent Space Understanding skill, which auto-generates from your connected account and maps resources, request paths, and service relationships. Before any CloudWatch or CloudTrail queries had returned, it already had the architecture context to make sense of what it was about to find.

Three root causes from one alarm. A single 5xx alarm triggered. The agent identified three distinct failure mechanisms, attributed the exact error count to each (90 throttles, 5 init crashes, 5 IAM errors), and traced all three to the same 2-second injection window in CloudTrail. That correlation is not obvious when a throttle, a KeyError, and an AccessDeniedException don't look like they came from the same event.

The empty mitigation plan was the correct answer. My expectation was restore commands. Instead the agent returned "no mitigation action required." Route 53 had already kept the service running via automatic failover. The primary region faults were reversed by fault.py restore. The agent recognized both facts in the metrics and CloudTrail, and declined to produce output that didn't apply. Knowing when not to act is more useful than generating work that doesn't exist.

It identified the test as intentional. Not just "three things broke." The agent concluded this was fault injection, named the evidence (role name, Boto3 scripting, 2-minute self-recovery), and assessed it correctly. That was not something I scripted or hinted at.

Restoring the Stack

After the demo, restore all faults:

# Restore all faults at once
python scripts/fault.py restore

# Or restore individually
python scripts/restore_fault_iam.py --stage dev
python scripts/restore_fault_throttle.py --stage dev
python scripts/restore_fault_envvar.py --stage dev

# Wait around 60s for health checks to pass
curl https://api-payledger.yourdomain.com/health
# {"status": "healthy", "region": "ap-southeast-1"}

Once the health checks recover, Route 53 routes traffic back to ap-southeast-1. The primary region is restored.

Wrapping Up

The DR Toolkit series covered Prepare. This series covered the middle: a multi-region demo app with real failover, three simultaneous faults, and AWS DevOps Agent investigating all of them from a single alarm trigger. The agent identified the root cause, recognized the service had already recovered, and correctly concluded no action was needed, because the evidence from logs, metrics, and CloudTrail told it this was an injected fault, not a real incident.

Route 53 kept the service running by routing to the healthy region. DevOps Agent used that time to find exactly what broke in the primary region. That is the relationship between the two: one buys you time, the other uses it.

The Agent Space Understanding skill was the most visible differentiator in this investigation. It auto-generated from the connected account and gave the agent architecture context before the first API call. No manual input required.

AWS DevOps Agent handles the full investigation loop on its own: topology discovery, root cause analysis, and Slack notification. If you have a previous DR Toolkit runbook, you can optionally load it as a Custom Skill to give the agent extra context. If you haven't seen the DR Toolkit series: BuildWithAI: DR Toolkit on AWS.

Try it / Fork it:

PayLedger Repo: github.com/romarcablao/payledger-aws-devops-agent

romarcablao / payledger-aws-devops-agent

DevOpsAgent: Beyond the Runbook

PayLedger — Multi-Region Serverless Payment Ledger

Multi-region serverless payment ledger for recording transactions and viewing balances with active-passive failover. Deployed across ap-southeast-1 (Singapore, primary) and ap-northeast-1 (Tokyo, secondary) using AWS Lambda, DynamoDB Global Tables, and Route 53 failover routing.

Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.

Note: PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.

Architecture

                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:

…

View on GitHub

References:

Runbooks Don't Investigate. AWS DevOps Agent Does.

Romar Cablao — Sun, 03 May 2026 13:14:15 +0000

Overview

I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post-mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not covered. That gap is what this series is about.

In the BuildWithAI: DR Toolkit on AWS series, I ran through how you can build six AI-powered tools that automate the tedious parts of DR planning, all running on serverless AWS in ap-southeast-1. Those tools handle what you do before an incident and what you do after. But the part in between, the actual incident response, none of them touch.

This series covers that middle phase using AWS DevOps Agent. The demo app is PayLedger, a multi-region serverless payment ledger built specifically for this blog. It is not a real product and contains no real user data. Part 1 maps out the gap, introduces DevOps Agent, and walks through the architecture. Part 2 covers the full setup and the actual demo, including what the agent's investigation looked like when I ran three real faults against it.

The DR Lifecycle, Mapped Out

Phase	What happens	Covered by
Prepare	Runbooks, RTO/RPO targets, DR strategy, checklists	DR Toolkit
Detect	Alarm fires, SNS notifies DevOps Agent, health check fails, DNS fails over	CloudWatch + Route 53 + SNS
Investigate	Root cause analysis, cross-region signal correlation	AWS DevOps Agent
Recover	Apply fix, bring the unhealthy region back up, validate failback	Human + runbook
Learn	Prevention recommendations, operational improvements	DevOps Agent

The DR Toolkit is solid for Prepare. CloudWatch and Route 53 handle Detect. Alarms fire and Route 53 failover routes traffic to the healthy region automatically. But Investigate is the phase with no real tooling unless someone built it themselves. Figuring out why a service running in the primary region is down, correlating signals across services, giving the team the information needed to bring that region back up.

That is what AWS DevOps Agent targets.

What is AWS DevOps Agent?

AWS DevOps Agent is a frontier agent for cloud operations. "Frontier agent" is AWS's term for autonomous systems that work independently, scale across concurrent tasks, and run persistently without constant human oversight. It starts working the moment an alarm fires, no manual trigger needed.

Three capabilities:

Autonomous incident response. When an alert comes in, the agent starts investigating immediately. It correlates signals across services and regions. If multiple alarms fire from the same root cause, it identifies them as related rather than treating each one separately. Root cause categories it investigates: system changes, input anomalies, resource limits, component failures, and dependency issues.

Proactive incident prevention. After an investigation, the agent recommends improvements in four areas: observability, infrastructure optimization, deployment pipeline, and application resilience.

On-demand SRE tasks. Conversational chat against your actual infrastructure. You can ask about resource state, alarm status, or deployment history without switching consoles.

The service uses a dual-console architecture. The AWS Console is for admin setup (Agent Space creation, integrations). A separate Agent Space web app is for day-to-day work (investigations, topology, prevention, chat).

More on features: AWS DevOps Agent features and About AWS DevOps Agent

A Note on Region Availability

As of this writing, AWS DevOps Agent is not available in ap-southeast-1 (Singapore) at GA. Supported regions are: us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-2, ap-northeast-1. AWS may add support for more regions in the future, so it is worth checking the supported regions page before you start.

The two closest for SEA builders are ap-southeast-2 (Sydney) and ap-northeast-1 (Tokyo). For this demo I used ap-southeast-2, but you can use any supported region you prefer. The Agent Space and its investigation data live there. Your workload stays wherever it is. Cross-region monitoring means the agent discovers and monitors resources across any linked AWS account regardless of region.

The Agent Space region is where your investigation data is stored, not where your app runs. For this demo, a single Agent Space in ap-southeast-2 monitors resources in both ap-southeast-1 and ap-northeast-1.

Reference: AWS DevOps Agent Supported Regions

The Demo App: PayLedger

Note: PayLedger is a demo project built solely for this blog series. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information. All data is synthetic and generated by a seed script.

A payment ledger is a practical choice for a DR demo because the requirements are clear. Any outage means transactions fail and balances go stale. The multi-region setup is the right response to that, not over-engineering.

PayLedger has four endpoints: record a transaction, list recent transactions, get the current balance, and a health check. Deployed to two regions with Route 53 active-passive failover and DynamoDB Global Tables for data replication.

                    payledger.yourdomain.com (CloudFront + S3)
                              |
                         Next.js UI
                         (balance, transactions, region indicator)
                              | calls
                              v
                    api-payledger.yourdomain.com
                              |
                         Route 53 (failover routing)
                         |-- PRIMARY  -> ap-southeast-1 (Singapore)
                         +-- SECONDARY -> ap-northeast-1 (Tokyo)

    ap-southeast-1                         ap-northeast-1
    +-- API Gateway                        +-- API Gateway
    +-- Lambda: createTransaction          +-- Lambda: createTransaction
    +-- Lambda: listTransactions           +-- Lambda: listTransactions
    +-- Lambda: getBalance                 +-- Lambda: getBalance
    +-- Lambda: health                     +-- Lambda: health
    +-- Lambda: devopsAgentTrigger         +-- Lambda: devopsAgentTrigger
    +-- DynamoDB <-- Global Table -->      +-- DynamoDB (replica)
    +-- SNS Topic (alarm notifications)    +-- SNS Topic (alarm notifications)
    +-- CloudWatch alarms                  +-- CloudWatch alarms

                    ap-southeast-2 (Sydney)
                    +-- AWS DevOps Agent
                        +-- Agent Space
                        +-- Slack (optional)
                        +-- GitHub (optional)

Layer	Service	Notes
Frontend	Next.js (static) + S3 + CloudFront	payledger.yourdomain.com
DNS	Route 53	Failover routing + health checks
Compute	Lambda (Python 3.12)	5 functions per region
API	API Gateway (HTTP API, regional)	Custom domain per region
Database	DynamoDB Global Tables	Multi-region replication
Observability	CloudWatch	Alarms in both regions

Route 53 checks /health every 10 seconds. If the health check fails twice (around 20 seconds), DNS fails over to Tokyo automatically. Traffic routes to the healthy region while the team investigates and works to restore the primary. The frontend polls /health every 5 seconds and shows which region is serving: green for Singapore (PRIMARY), amber for Tokyo (FAILOVER).

DynamoDB Global Tables replicate data between both regions. After failover, the balance and transaction history are intact in Tokyo. Same data, just a different region serving it. That is the whole point of the architecture.

How the Demo Works

When faults are injected into ap-southeast-1, the health check starts failing. Route 53 detects the failure and routes traffic to ap-northeast-1 within around 20 seconds. Users continue to be served from Tokyo while DevOps Agent investigates in the background. Once the agent identifies the root causes and the team applies the fixes, the primary region recovers and Route 53 fails back.

This is the core of the DR story: failover keeps the service running; the investigation tells you what broke so you can fix it.

Three Fault Scenarios

In Part 2, I inject three faults against the primary region using fault.py, a Python script for fault injection and restoration. Each represents a common real-world serverless incident.

#	Fault	How it breaks	Root cause category
1	IAM permission denied	Role swapped to fault role with no DynamoDB access	System change
2	Lambda throttling	Reserved concurrency = 0, 429 before function runs	Resource limits
3	Missing environment variable	TABLE_NAME removed, KeyError at module load	Code/config change

What makes this interesting: all three run simultaneously using python scripts/fault.py inject (the default mode assigns one distinct fault per service). One alarm fires in ap-southeast-1, three different root causes show up in the investigation, and DevOps Agent has to untangle all of them in a single run. That is a harder test than running each fault separately.

Where This Fits in the DR Lifecycle

The DR Toolkit covered the Prepare phase. This series covers Investigate and Recover. The part that happens after the alarm fires.

DevOps Agent does not need the DR Toolkit to investigate. It reads your topology, correlates signals across services, identifies root causes, and posts findings to Slack on its own. AWS DevOps Agent is capable enough to detect, investigate, root cause, and even generate post-mortem inputs without any external tool.

The connection here is context: if you want to give the agent extra architecture knowledge upfront, you can optionally load a runbook generated by the DR Toolkit as a Custom Skill.

What's Next?

In Part 2, we'll get our hands dirty with the full setup and the demo: deploying PayLedger to both regions, configuring Route 53 failover, setting up the Agent Space, and then running the faults. I'll walk through the actual investigation the agent ran: the timeline, the findings, the root cause, and what it concluded about mitigation.

Try it / Fork it:

PayLedger Repo: github.com/romarcablao/payledger-aws-devops-agent

romarcablao / payledger-aws-devops-agent

DevOpsAgent: Beyond the Runbook

PayLedger — Multi-Region Serverless Payment Ledger

Built as a demonstration platform for disaster recovery testing with AWS DevOps Agent.

Note: PayLedger is a demo project. It is not affiliated with any real business, does not process real transactions, and contains no personally identifiable information.

Architecture

                    payledger.yourdomain.com (CloudFront + S3)
                              │
                         Next.js static UI (balance, transactions, region indicator)
                              │
                              ▼
                    api-payledger.yourdomain.com
                              │
                    Route 53 failover routing
                    ├── PRIMARY   ──▶ apse1-api-payledger.yourdomain.com  ← health check
                    └── SECONDARY ──▶ apne1-api-payledger.yourdomain.com  ← health check
                    TTL: 60s | health check: 10s interval, 2 failures to trip
                              │
               ┌──────────────┴──────────────┐
               │                             │
    ap-southeast-1 (Singapore)     ap-northeast-1 (Tokyo)
    ├── API Gateway (regional)     ├── API Gateway (regional)
    ├── Lambda: createTransaction  ├── Lambda: createTransaction
    ├── Lambda: listTransactions   ├── Lambda:

…

View on GitHub

References:

Turn WebSockets into Async/Await Requests (AWS WebSocket API Gateway + Lambda)

Rishi — Sun, 03 May 2026 09:46:46 +0000

Some time ago, I was building a chat application using AWS Websocket API gateway. Things were going smoothly. I created a WebSocket API Gateway, added $connect, $disconnect, and sendMessage/addGroup routes. From the frontend (React) side, everything was fire-and-forget. You send a message, and the onMessageHandler takes care of it 💪🏼

But then a new requirement of uploading files using S3 signed URLs came up. That's where I needed the Async/Await promise pattern. Now, one option was to create an HTTP API gateway and use it. But that meant a new connection, a new authorizer, and more setup. At that moment, I wished there was a way to use this existing WebSocket connection to get the signed URL ⭐

And that’s how this library "ws-await" was born!

It lets you:

establish a WebSocket connection
send normal fire-and-forget messages
send messages and wait for the response using async/await
handle reconnection with exponential backoff
auto-send heartbeat messages to keep the connection alive

How does it work?

For the async/await pattern, messages from the frontend are sent with a unique requestId. The client keeps a map of pending promises. On the lambda side, the backend reads the requestId and sends it back in the response.

Received messages are analyzed for requestId, and if requestId matches the id of any pending promise in the map, that promise is resolved. If a promise sits idle in the map for a period of more than ~30 seconds, it is rejected.

Steps to use:

Step 1: Install the library in your React project

npm install @tricksumo/ws-await zustand

Step 2: Import createSocket() and establish the connection. Then call ws.send("action") for fire and forget messages and await ws.request("action") for async/await pattern.

import { createSocket } from '@tricksumo/ws-await'
import { useEffect } from 'react'
const ws = createSocket({
  url: 'wss://id.execute-api.us-east-1.amazonaws.com/prod',
})
function App() {
  useEffect(() => {
    ws.connect()
    return () => {
      ws.disconnect()
    }
  }, [])
  const handleGetSignedURL = async () => {
  try {
    const response = await ws.request('getSignedURL', { fileType: 'image/png' })
    console.log('Signed URL:', response)
  } catch (err) {
    console.error('Request failed:', err)
  }
}
  return (
      <div>
        <button onClick={handleGetSignedURL}>
          Click to get signed URL
        </button>
      </div>
  )
}

export default App

Step 3: Your Lambda must echo the requestId back in its response.

export const handler = async (event) => {

  const { requestId, fileName, fileType } = JSON.parse(event.body || '{}')

  const signedUrl = await getPresignedUrl(fileName, fileType)

  return {
    statusCode: 200,
    body: JSON.stringify({
      signedUrl,
      requestId, // ← REQUIRED: echo it back or the Promise never resolves
    }),
  }

}

Reading the connection state in the frontend.

import { useSocket } from '@tricksumo/ws-await'

function StatusBar() {
  const { isConnected, isConnecting, error } = useSocket()

  if (isConnecting) return <p>Connecting...</p>
  if (!isConnected)  return <p>Disconnected — {error?.message}</p>
  return <p>Connected</p>
}

Conclusion

Initially, this logic was part of my chat application named Chatlings. But I thought it might help others, so I extracted it to create my first ever library 🙌🏼

Stateful MCP Servers on ECS Fargate: What Happens When You Deploy

Avinash Dalvi — Sun, 03 May 2026 02:46:00 +0000

A few weeks back I was working on a PoC with Bedrock AgentCore Runtime. While doing that I came across multiple blogs and discussions around MCP server hosting on AWS. Most of them were pointing to either Bedrock AgentCore or Lambda. Very few talked about ECS Fargate.

That got me thinking. I have been using Fargate for containerised workloads for a while now. It is my go-to when a team needs containers without managing the underlying infrastructure. So the question came naturally — can Fargate host a stateful MCP server? And more importantly, what happens when you actually deploy it in a real scenario?

As an architect I believe you should know all the options before recommending one. Not just what the docs say — what actually happens when you run it. So I decided to test it myself.

This blog is what I found. Specifically what happens when you run a stateful MCP server on ECS Fargate and then do a rolling deployment while a session is active. The results were not what I expected.

MCP hosting on AWS — what are your options?

Before jumping into the experiment, let me give some context on why Fargate and not the other options.

When it comes to hosting MCP servers on AWS you have three realistic paths:

Bedrock AgentCore Runtime is AWS's managed MCP hosting service. You write your MCP server, deploy it, and AgentCore handles session isolation at the platform level. It supports both stateless and stateful MCP servers. By default stateless mode is recommended — AgentCore automatically adds an Mcp-Session-Id header and manages connection continuity at the platform level. For multi-turn interactions that need session state preserved across requests, stateful mode (stateless_http=False) is available and the runtime handles session preservation within the same invocation. The key difference from running stateful MCP on Fargate yourself: AgentCore manages the session layer for you regardless of mode. You are not responsible for sticky sessions, deregistration delays, or what happens to your session during a platform update. That operational burden stays with AWS.

AWS Lambda comes in two modes now and the difference matters for MCP.

Standard Lambda is stateless by nature. Cold starts are a latency concern — and since August 2025 also a cost concern, as AWS now bills the INIT phase the same as invocation duration. For lightweight or infrequent MCP tool calls this is still simple and cost-effective. But for agent workloads where a session expects low latency tool calls, standard Lambda cold starts can be disruptive.
Lambda Managed Instances (LMI) changes the picture. LMI runs your Lambda functions on EC2 instances in your own account — AWS still manages the instance lifecycle, patching and scaling, but your functions run on longer-lived compute. The result: no cold starts at all, multi-concurrency support where each execution environment handles multiple invocations simultaneously, and EC2-based pricing which can be significantly cheaper for steady-state workloads.

For MCP specifically, LMI is an interesting option for lighter workloads that need low latency tool calls without cold start risk, while keeping the serverless programming model. The constraint is the same as standard Lambda — stateless by nature, so session context still has to live somewhere else. But the cold start objection largely disappears with LMI.

LMI is designed for steady-state predictable workloads — it scales more gradually than standard Lambda and does not burst instantly. If your MCP workload has very spiky or unpredictable traffic, standard Lambda or Fargate may still be better suited.

ECS Fargate gives you your own container, your own session model, your own trade-offs. Fits teams already running Fargate workloads, teams with compliance or data residency requirements, or teams building something the managed service does not support yet. More control, more responsibility.

I chose Fargate because I already use it and wanted to understand what it actually does with stateful MCP under real conditions — not a happy path demo.

Setting up the experiment— with help from Kiro

When I started looking at the AWS sample repository for stateful MCP on ECS — aws-samples/sample-serverless-mcp-servers — I found it was SAM based. It also expected VPC, CIDR, ALB and other networking prerequisites to be in place before running sam deploy. That meant doing a lot of manual setup before I could even start the experiment.

I did not want to spend my weekend debugging SAM prerequisites. I wanted to get to the actual experiment.

So I decided to build the infrastructure from scratch. And this is where Kiro helped. I used Kiro — AWS's agentic IDE — to scaffold the entire experiment setup: the FastMCP server, the CDK infrastructure including VPC, ALB, ECS cluster and Fargate task definition, and the test client.

Here is what I built:

A stateful FastMCP server in Python holding session state in memory
ALB with sticky sessions enabled — lb_cookie type, 1 day duration
ECS Fargate service with 2 tasks and rolling deployment configured
A test client using httpx with a persistent cookie jar, making continuous tool calls every 5 seconds
Task ID instrumented in every tool response by fetching from the ECS container metadata endpoint

import httpx, os

metadata = httpx.get(os.environ["ECS_CONTAINER_METADATA_URI_V4"] + "/task").json()
TASK_ID = metadata["TaskARN"].split("/")[-1]  # fetched once at server startup

I deliberately did not add a SIGTERM handler, did not externalise session state, and did not add any retry logic. I wanted to observe the default — what the pattern actually does out of the box before any hardening. The test client ran two operations per cycle: set_session_value to write state, followed immediately by get_session_state to read it back and confirm. Session state accumulated across calls — seq_1, seq_2, seq_3 and so on — so any loss of state would be immediately visible.

What I observed

The setup confirmed

Before triggering any deployment I confirmed the ALB configuration:

Everything correctly configured as the documentation recommends. If you are new to sticky sessions and why they matter for stateful workloads, the AWS Prescriptive Guidance on load balancer stickiness is a good starting point.

I then started the test client and let it run. Then mid-session I triggered a forced rolling deployment:

aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service YOUR_MCP_SERVICE \
  --force-new-deployment

Here is what the logs showed.

Finding 1: The cookie rotation red herring

The first thing I noticed in the logs was the AWSALB cookie changing on every single response — from call 1, before any deployment was triggered.

{"call_number": 1, "tool": "set_session_value", "http_status": 200,
 "task_id": "507bf31b..."}
{"event": "session_cookie_changed", "cookie_name": "AWSALB",
 "old_cookie": "OWXI55yd...", "new_cookie": "XItyHvSg..."}

{"call_number": 2, "tool": "set_session_value", "http_status": 200,
 "task_id": "507bf31b..."}
{"event": "session_cookie_changed", "cookie_name": "AWSALB",
 "old_cookie": "XItyHvSg...", "new_cookie": "md8WEI7W..."}

Cookie changing every call. Naturally the first instinct is — stickiness is broken. Requests are bouncing between tasks.

But look at the task ID. 507bf31b — same on every single successful call across all 39 calls before failure. The ALB was routing to the same task the entire time despite the cookie changing.

What is actually happening: the ALB re-encrypts the sticky cookie token on every response even when routing to the same target. The cookie value rotates but the target it encodes stays the same. This is normal ALB behaviour — it is not routing instability.

Engineering judgment: If you see cookie rotation in your logs and start debugging stickiness, you will spend days on the wrong problem. The cookie value is irrelevant. The target it encodes is what matters. Verify using task ID in your responses, not by watching the cookie.

This also has an important implication: any MCP client that captures the sticky cookie once at session initialisation and reuses it without updating — which is the natural implementation — will break stickiness the moment it sends a stale cookie value. The ALB will treat it as a new session and route via round-robin. With 2 tasks running that means a 50% chance of landing on the wrong task on every call.

My test client used httpx.Client with a persistent cookie jar that automatically updates on every response. That is what kept the session alive across 39 calls. The aws-samples repo mentions patching for cookie handling — but does not explain why updating on every response is critical, not just at session init.

Finding 2: The atomic failure

This is the central finding.

At 15:20:12 UTC, call number 39's set_session_value succeeded:

{
  "timestamp": "2026-04-26T15:20:12.067393+00:00",
  "call_number": 39,
  "tool": "set_session_value",
  "http_status": 200,
  "task_id": "507bf31beb2f41abae593f5cfd023b5e",
  "state": {"seq_1": "call_1", "...": "...", "seq_39": "call_39"}
}

Five seconds later, call number 39's get_session_state failed:

{
  "timestamp": "2026-04-26T15:20:17.134593+00:00",
  "call_number": 39,
  "tool": "get_session_state",
  "http_status": 404,
  "error": {"code": -32600, "message": "Session not found"}
}

Same call number. Same MCP session ID. Same logical operation — write then read. The write succeeded on task 507bf31b. The read landed on the new task. The new task had no knowledge of that session. 404.

The gap was 5 seconds. In those 5 seconds the deregistration delay expired, the old task was removed from the ALB target group, and the next request was routed to the replacement task.

This is not an eventual consistency problem. This is an atomic operation split across a task boundary.

An AI agent that writes state and immediately reads it back to confirm — which is the natural pattern for any tool that modifies and verifies — cannot do so safely across a deployment boundary. The write may have landed on a task that no longer exists by the time the read arrives. The agent cannot tell the difference between "session not found because I have a bug" and "session not found because my task was replaced 5 seconds ago." It cannot retry safely. It cannot roll back. The state is in an unknown condition.

Finding 3: Your monitoring will show nothing

This is what makes this failure mode operationally dangerous.

During the entire failure sequence — calls 39 through 50, all returning 404 — here is what your monitoring shows:

ALB: healthy targets, no 5xx errors
ECS service: desired count met, tasks running
CloudWatch alarms: nothing triggered
ECS service events: deployment completed successfully

The failure is code: -32600 "Session not found" — a JSON-RPC application error, not an HTTP error. Your ALB access logs show 404 responses but 404 is not typically alarmed in most setups. And even if it is, the error message is indistinguishable from a bug in your tool implementation.

Your on-call engineer will look at the infrastructure dashboard and see green. Your application engineer will look at the error and check their code. Both will find nothing wrong. The failure lives in the gap between the deployment event and the application layer — and nothing connects them automatically.

Engineering judgment: If you are running stateful MCP on Fargate you need an application-level alarm specifically on -32600 errors correlated with deployment events. Infrastructure health checks will not catch this.

One more safety net that will not help here: the ECS deployment circuit breaker. The circuit breaker triggers on tasks that fail to reach RUNNING state or fail health checks. In this failure mode your new task is RUNNING, your health check passes, and ECS considers the deployment successful. The circuit breaker has no visibility into whether active MCP sessions were lost during the transition. The failure passes every gate AWS provides automatically.

Finding 4: The deregistration delay is your session cliff timer

AWS documents the deregistration delay as a connection draining setting. For stateful MCP on Fargate it is actually your session survival window — the countdown timer between when a deployment starts and when your session dies.

Across my runs with different configurations:

Run	Tasks	Deregistration delay	Session survived until
1	1	300s (default)	Call 47 — ~61s after trigger
2	1	Changed	Call 48 — ~4 min after trigger
3	2	300s	Call 39 — ~3.5 min after trigger
4	2	300s	Call 39 — atomic failure at 5s gap

The deregistration delay controlled the survival window in every run. Not the stickiness duration (86400 seconds — that number is fiction during deployments). Not the task count. The deregistration delay alone.

But here is the honest conclusion: no value of deregistration delay removes the failure. It only changes when the cliff arrives. A 30 second delay means your session cliff is 30 seconds after deployment. A 900 second delay means your session survives longer but your old tasks linger for 15 minutes, slowing rollbacks and increasing cost. You are not solving the problem — you are choosing when to accept the loss.

One more thing worth noting here: Fargate's default stopTimeout is 30 seconds (AWS reference). If you do not set a SIGTERM handler and raise this value, ECS will SIGKILL the container within 30 seconds of sending SIGTERM — regardless of your deregistration delay. So even if you set a 300 second deregistration delay, an unhandled SIGTERM means your session gets a hard kill within 30 seconds. The deregistration delay and stopTimeout work together — both need to be tuned, not just one.

A minimal SIGTERM handler in FastMCP looks like this:

import signal, time

def handle_sigterm(signum, frame):
    print("SIGTERM received — draining active sessions")
    time.sleep(25)  # stay alive within stopTimeout window before exit
    exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

The sleep value must be less than your stopTimeout setting. If stopTimeout is 30 seconds (default) and you sleep 25, the handler completes cleanly. If you forget to raise stopTimeout above 30 seconds and sleep longer, SIGKILL fires before the handler finishes.

One related consideration worth flagging: if your health check endpoint and MCP handler run in separate processes or on different ports, a new task can pass the ALB health check before the MCP handler is fully initialised — ECS has no native readiness probe separation the way Kubernetes does. In my implementation both run in the same uvicorn process on port 8000, so if the health check passes the MCP handler is already up. But if your setup is different, design for this explicitly.

What this means architecturally

You have three honest options. I will be clear about which ones I have tested and which are architectural paths for a follow-up.

Option A — Design for the failure

Make your MCP tools idempotent. If a write-then-read pair fails, the client can retry the full operation safely without risk of duplicate side effects. This works for tools that are naturally idempotent — read-heavy tools, query tools, lookup tools. It fails for tools that modify external state once — sending a message, creating a record, triggering a payment. If your agent workflow has side effects, idempotency alone is not enough.

Option B — Externalise session state

Move session storage to ElastiCache Redis or DynamoDB. The session is no longer tied to a specific task — any task can serve any session. Rolling deployments become safe because the new task can find the session in the external store. This eliminates the failure mode entirely.

The cost: the MCP SDK does not support external session persistence natively. You need to patch the session layer. Every tool call now has an external store read/write on the hot path — latency increases. Operational complexity increases. This is the right answer for multi-turn agent workflows that genuinely cannot tolerate session loss. I have not built this yet — it is the subject of a follow-up experiment.

Option C — Go stateless, let the platform handle sessions

This is what Bedrock AgentCore chose. Stateless MCP server, session isolation managed at the platform layer. The application never owns session state — the infrastructure does. Zero risk of the failure mode I described above.

The cost: you give up control over the session model. You take on the constraints of the managed service. If you have compliance requirements around data residency or need session behaviour the platform does not support, this path is not available to you.

So is Fargate a good fit for stateful MCP?

It depends — but not in the vague way that phrase usually means. Here is a more specific answer:

Fargate is a good fit if your MCP tools are idempotent and session loss during deployments is acceptable or recoverable in your workflow.

Fargate with externalised session state is a good fit if you need stateful multi-turn sessions, have compliance or control requirements that rule out managed services, and are willing to own the additional complexity.

Fargate with in-memory stateful sessions and the default configuration is not production-ready for agent workloads that cannot tolerate session loss. The AWS sample pattern works. Until you deploy. And in production, you deploy all the time.

If you are building something lighter — a few tools, mostly stateless, occasional multi-turn — Fargate is capable and operationally straightforward. If you are building something larger — long-running agent sessions, complex state, frequent deployments — you need to solve the session persistence problem before you go to production.

That is the answer I was looking for when I started this. Now I have it.

What is next

The experiment is not finished. The next step is to actually build Option B — externalise session state to Redis, run the same deployment experiment, and show whether the atomic failure disappears. That blog will have the same structure: real logs, real task IDs, real failure or real fix.

If you are trying to make this decision for a real workload and want to talk through it, find me on X or LinkedIn.

All experiment code is available on Stateful MCP Server on ECS Fargate - GitHub. The test client, CDK infrastructure, and FastMCP server with task ID instrumentation are in the repository linked below.

The Council has Decided

mgbec — Sat, 02 May 2026 23:14:11 +0000

Some of the most interesting developments of recent Generative AI implementations are all the different ways we can ask models and agents to work together to come up with solutions for our tasks. We have orchestration, choreography, and every permutation we can think of.

One of the concepts that many of us have experimented with is the LLM Council pattern from Andrej Karpathy at https://github.com/karpathy/llm-council. This project sets up three configurable models and asks each the users’ questions. The answers from each model go through peer review and ranking. Finally, the chairman of the LLM Council compiles the responses into a final judgement.

Why would we choose this framework? Each model has a set of unique combinations of strengths and weaknesses. We can come up with more accurate, more diverse, and more complete answers by trying to combine the best of each.

I built a variant of this using AWS AgentCore https://github.com/mgbec/Council-agents. I substituted a few of Andrej Karpathy’s components with AgentCore elements:

Instead of OpenRouter + FastAPI + JSON files, this version uses:

Amazon Bedrock for multi-model access (Claude, Llama, Mistral, etc.)
AgentCore Runtime for serverless hosting with session management
AgentCore Memory for conversation persistence across sessions

I did substitute some of models with different versions(easy to change in config.py):

COUNCIL_MODELS =

“us.anthropic.claude-sonnet-4–20250514-v1:0”

“us.meta.llama4-maverick-17b-instruct-v1:0”

“mistral.mistral-large-2411-v1:0”

CHAIRMAN_MODEL = “us.anthropic.claude-sonnet-4–20250514-v1:0”

The basic functions are still the same, however:

Ask question and receive individual responses:

Peer ranking:

Then a final Council decision is made:

This is all hosted on AWS with a React frontend. The workflow keeps credentials server-side, authenticates users via Cognito, and serves the React app from CloudFront.

Some of the learning opportunities I had:

* API Gateway REST APIs have a hard 29-second timeout, but the council takes 30–90 seconds. To work around this, the system uses an async pattern: the frontend submits a request (instant response with a request ID), then polls for the result every 5 seconds. The heavy work runs in a separate SQS-triggered Lambda with no timeout constraint.

*I originally tried a Lambda Function URL to work around the API Gateway timeout. It would have worked, but the way I had it implemented was not very secure. First, the Lambda function was set up as public, which was not safe at all. My second attempt was having the Lambda itself validate the Cognito JWT on every request. Validation would check token structure, expiration, issuer, app client ID and that the key ID (kid) exists in your Cognito JSON Web Key Set. It did not do RSA signature verification, however, and I scrapped that plan for an async pattern with API Gateway, Lambdas, DynamoDB, and SQS. The full architecture is here, https://github.com/mgbec/Council-agents/blob/master/architecture.md, but a quick synopsis of the part in question:

* For the AgentCore deployment, we can use CodeZip and upload to S3, or Docker image and push it to ECR. In the past I have used the Docker/ECR method, but Kiro told me that the best option for this project is the CodeZip method. “For this project, CodeZip (S3) is the right choice — it’s pure Python with pip-installable dependencies, nothing exotic in the runtime. Container mode is more useful when you need system-level packages, custom binaries, or a specific OS setup.”

*The Lambda is used as a thin proxy that calls InvokeAgentRuntime, keeping AgentCore ARN and AWS credentials server-side, never exposed to the browser. Lambda then uses the Cognito sub claim to namespace AgentCore sessions, so users have memory isolation.

* I really enjoy AgentCore Observability in all of its detail. For this project I didn’t see any sessions captured in the Observability Dashboard. I saw plenty of traces but no sessions at all. I asked Kiro about that, and the answer was “The issue is that our agent code (main.py) uses raw boto3 calls via bedrock_client.py rather than the Strands Agent framework. When you use a Strands Agent() with the BedrockAgentCoreApp, the framework automatically propagates session context into the OTEL spans. Our code bypasses that — it just calls boto3.client(“bedrock-runtime”).converse() directly, so the traces show the Bedrock calls but don’t associate them with the AgentCore session.”

Kiro suggested two possible fixes to see the sessions in AgentCore Observability. The agent code would need to use one of these options:

-Use a Strands Agent with session management (the framework handles OTEL context automatically)

-Manually inject the session ID into the OTEL span attributes

I did attempt to refactor to use the Strands Agent session management but this created a metastasizing string of errors. I also tried to manually inject the session ID into the span attributes, which also did not work. Finally I tried something I saw in this document about OTEL baggage:

https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html

I had no luck with that either so sessions in the Observability Dashboard are a problem for another day.

*Kiro was great at fixing the Dependabot vulnerabilities when asked to:

“All 9 vulnerabilities fixed — npm audit fix updated 13 packages and now shows 0 vulnerabilities. Let me verify the build still works, then commit:”

This was a fun way to implement Andrej Karpathy’s LLM Council idea. The next steps for me might be fixing the session observability, speeding up the responses, or trying a cheaper model. I asked my council to recommend a cost effective model for a chairman, and this was actually a snappy response. It recommended Claude 3 Haiku for the reasons shown below:

I’m looking forward to all the creativity, new arrangements and workflows we will see in the future. Thanks for reading!

AGENTS.md, SKILL.md, DESIGN.md: How AI Instructions Split into Three Layers

Kento IKEDA — Sat, 02 May 2026 21:35:11 +0000

In April 2026, Google Labs released a spec called DESIGN.md. It's a design system specification readable by AI agents, packaged with a CLI validator: npx @google/design.md lint.

With DESIGN.md in the picture, we now have three different file types for instructing AI agents. AGENTS.md has been spreading as an industry standard since 2025 (jointly developed by OpenAI, Google, Sourcegraph, Cursor, and Factory; donated to the Linux Foundation in December 2025). SKILL.md sits at the core of Anthropic's Claude Skills. And now DESIGN.md. The three handle different concerns and don't overlap.

This article is for developers using coding agents like Claude Code, Cursor, or Codex in their work, and for tech leads operating natural-language instruction files like CLAUDE.md and style guides. If your team is doing Spec-Driven Development (SDD), this should also reach you.

What I want to lay out is two things: how AI instructions are starting to split across three layers — behavior, individual tasks, and visual appearance — and how that connects with SDD as a parallel movement.

The Old Pattern: Natural-Language Documents

A few years into the ChatGPT era, most engineers have written some form of "rules I want the AI to follow" in a Markdown file. CLAUDE.md, styleguide.md, CONTRIBUTING.md, internal coding conventions. The locations vary, but the format is roughly the same: unstructured natural language.

A writing-style-guide.md file I've been building over the past few months is a typical example. It's a style guide I use when writing technical articles with Claude — a list of patterns common in AI-generated text, written down as forbidden phrases. By making Claude Desktop read it every session, the tone of my output stays consistent. It's part of a personal repository (ikenyal-ai-agents) I use as the harness for my business automation agents — the one I covered in my previous post.

https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b

The file contains roughly 150 lines: rules like "don't use em dashes," "avoid invitations like 'let's try…!'," "drop AI-style preambles like 'what's interesting is…'." The same repository has 15 instruction files under agents/, organized by team and role: executive-assistant.md, sre-support.md, qa-support.md, accounting.md. Each describes "the assumptions to operate under as this role" in plain natural language.

This approach has clear benefits. You can articulate tone, stance, and implicit rules. New team members can read the files and pick up the expectations. With CLAUDE.md, Claude Code reads it every session, so persona-level instructions land consistently.

There are limits, too. First, validation falls on humans. Whether a rule was followed or not gets decided by a human reading the output. Second, individual judgment leaks in. "Write politely" means different things to different reviewers.

The third limit is the actual subject of this article. Rules that are formally verifiable (forbidden phrases, em-dash usage, specific pattern matches) and rules that require judgment (tone, structural choices, how to open with empathy) sit in the same file. So even the verifiable parts end up depending on human review. That's the problem the three new file types are addressing.

New Type 1: How DESIGN.md (Google Labs) Specifies Visual Appearance

On April 10, 2026, Google Labs published the DESIGN.md specification at google-labs-code/design.md. As of early May, the repo has over 11,000 stars. It's the reference implementation for Google Stitch (stitch.withgoogle.com), an AI-driven UI generation product.

https://github.com/google-labs-code/design.md

The specification doc lives on the Stitch side.

https://stitch.withgoogle.com/docs/design-md/specification

What DESIGN.md covers is the design system specification. You write machine-readable design tokens in YAML at the top of the file (colors, typography, spacing, components), and human-readable design intent in the Markdown body underneath. Both live in the same file.

---
name: Heritage
colors:
  primary: "#1A1C1E"
  tertiary: "#B8422E"
typography:
  h1:
    fontFamily: Public Sans
    fontSize: 3rem
---

## Overview

Architectural Minimalism meets Journalistic Gravitas.

## Colors

- Primary (#1A1C1E): Deep ink for headlines and core text.
- Tertiary (#B8422E): "Boston Clay", the sole driver for interaction.

The headline feature of this format is the CLI validator that ships with it.

npx @google/design.md lint DESIGN.md

This checks token reference integrity, WCAG contrast ratios, and structural rule compliance, returning the result as JSON. Wire it into CI and you can verify design system consistency on every pull request. There's also a diff command that compares two DESIGN.md files and returns token-level changes in a structured form. Design system version control — historically a manual process — gains a verifiable layer.

For Japanese UIs, the Google Labs spec alone falls short. It doesn't define the typography requirements specific to Japanese (CJK font fallback chains, line height, letter-spacing, kinsoku shori, mixed typesetting). The gap is filled by kzhrknt/awesome-design-md-jp, which publishes Japan-localized DESIGN.md files for over 10 services including Apple Japan, SmartHR, freee, note, MUJI, Mercari, LINE, and Toyota. For Japanese products, using both the Google Labs spec and the Japan edition together is the practical approach.

https://github.com/kzhrknt/awesome-design-md-jp

What DESIGN.md carries is the design system that used to be scattered across Figma files and style guide PDFs, now consolidated into a single file with both machine-readable and human-readable parts. Think of it as the spec foundation that lets AI agents generate UIs with a consistent look every time.

New Type 2: How SKILL.md (Anthropic) and AGENTS.md Specify Behavior

While DESIGN.md covers "appearance," SKILL.md and AGENTS.md cover "behavior" — defining what the agent is trying to do, how it should proceed, and what it must not do.

SKILL.md is the file format standardized by agentskills.io as part of the Agent Skills open standard. Anthropic's Claude Skills is one implementation of this standard; the same SKILL.md works across Claude Code, Claude.ai, and the Agent SDK. Because it's standards-compliant, the same file is also readable by other agents like OpenClaw and Hermes. The structure: declare metadata (skill name, description, allowed tools) in the YAML at the top of the file, and write the task procedure or domain knowledge in the Markdown body below.

https://agentskills.io/home

A clear example of SKILL.md is conorbronsdon/avoid-ai-writing. It's an English-only skill that detects and rewrites AI patterns in English text — transition phrases like "Moreover," significance inflation like "watershed moment," and roundabout verb constructions like "serves as." It uses a 100+ word replacement table organized into 3 tiers (Tier 1 always replaces, Tier 2 flags when 2+ words appear in the same paragraph, Tier 3 flags only at high density), and audits 36 pattern categories. Two modes: detect and rewrite.

https://github.com/conorbronsdon/avoid-ai-writing

What sets it apart from a one-shot prompt is the structured audit it returns. In rewrite mode, you get four discrete sections: identified issues, the rewritten text, a summary of changes, and a second-pass audit. What changed and why becomes transparent.

AGENTS.md covers the agent's overall behavior. Project assumptions, roles, prohibitions, escalation rules. As I mentioned at the top, it started with the Amp team at Sourcegraph; today OpenAI, Google, Cursor, and Factory jointly drive it, and it was donated to the Linux Foundation in December 2025. Think of CLAUDE.md as the Claude-specific version of AGENTS.md. Claude Code reads CLAUDE.md rather than AGENTS.md in its spec, but the pattern recommended by agents.md is to make AGENTS.md the actual file and symlink CLAUDE.md to it. In the personal repository I introduced earlier, the files under agents/ belong to this layer.

SKILL.md and AGENTS.md cover different ranges. AGENTS.md handles "overall context and boundaries." SKILL.md handles "an executable unit for a specific task."

The avoid-ai-writing English style auditor I mentioned is a specific task, so it ships as SKILL.md. A file like agents/genda/qa-support.md, which describes the assumptions and engagement style of a QA role, defines the agent's boundary — that goes on the AGENTS.md side.

The shared concern of these formats is "behavior and procedure," not visual appearance. What the agent knows, what it's tasked with, what it must avoid. That's a movement to fix these in a verifiable form.

The Three-Layer Split

Lining up the three file types, the layers each one handles become clear.

Layer	Format	What it carries	Examples
Behavior	`AGENTS.md` / `CLAUDE.md` (natural language + rules)	Overall context, roles, prohibitions	`CLAUDE.md`, role-specific files like `agents/genda/qa-support.md`
Individual task	`SKILL.md` (YAML at top + Markdown body)	Reusable tasks, procedures, domain knowledge	avoid-ai-writing, in-house procedure skills
Appearance	`DESIGN.md` (YAML at top + Markdown body)	Design system spec, verifiable visual rules	The Google Labs reference, individual service files in `kzhrknt/awesome-design-md-jp`

The three are complementary, not competing. CLIs like bergside/typeui are emerging as tools that can generate or update either SKILL.md or DESIGN.md, depending on what you choose — a sign of tooling that assumes the division of labor.

https://github.com/bergside/typeui

What's actually different across the layers is "where to place the balance between machine-readable and human-readable." AGENTS.md skews almost entirely human-readable; over-structuring it would block the contextual judgment and nuance it needs to convey. SKILL.md is partially structured by the YAML at the top, but the body stays human-readable — task granularity has to be readable by humans before it can be instructed. DESIGN.md puts machine-readable design tokens in the top YAML and human-readable design intent in the body, with the two cleanly separated.

The center of gravity between "machine-readable" and "human-readable" sits in different places per layer. That's just the standard structuring principle — "manage things at different layers in different files" — applied to AI agents. The file names themselves spell out the division: AGENTS.md ("instructions to the agent"), SKILL.md ("a reusable skill"), DESIGN.md ("the design system"). The names match what each one carries.

Teams that have been packing all their "AI rules" into a single CLAUDE.md now face a split decision. Open up your CLAUDE.md and run these questions against it — splits start to surface:

Is there a section writing design system rules? → If yes, that goes to DESIGN.md
Are specific task procedures in there (monthly aggregation, test review, contract review)? → If yes, those go to SKILL.md
What's left is overall agent context and boundaries (roles, prohibitions, escalation criteria) → that's the AGENTS.md equivalent that stays

The three-layer split works as a framework for splitting your file.

Connecting with SDD

Stepping back to look at the bigger picture: how does the three-layer split relate to the broader movement of "specs for AI"?

SDD is a development style where you write the spec — requirements, design, tasks, implementation — before generating the code. The underlying idea: "specs aren't disposable scaffolding, they're executable artifacts that produce code." AWS's Kiro provides a workflow that generates requirements.md, design.md, and tasks.md in order under .kiro/specs/{feature}/. GitHub's Spec Kit (over 90,000 stars) supports the same flow with slash commands like /specify, /plan, /tasks, /implement. The EARS notation (Easy Approach to Requirements Syntax) used by Kiro reduces ambiguity by formatting requirements into 5 fixed templates. SDD has spread quickly between 2025 and 2026.

https://kiro.dev/

https://github.com/github/spec-kit

The three-layer split (AGENTS.md / SKILL.md / DESIGN.md) and SDD look like separate movements on the surface. The SDD community concentrates on Kiro and spec-kit usage; the DESIGN.md side concentrates on formal specs and validation tooling. You don't see many articles bridging the two.

But put their philosophies side by side and the overlap is striking.

#	Shared philosophy	SDD (Kiro etc.)	`DESIGN.md` / `SKILL.md` / `AGENTS.md`
1	Specify before implementing	requirements → design → tasks → implementation	behavior → implementation, appearance → implementation
2	Mix machine-readable + human-readable	`requirements.md` (EARS notation) + natural language	YAML at top + Markdown body
3	Persistent context for the AI	reference `.kiro/specs/{feature}/` every time	reference `DESIGN.md` / `AGENTS.md` every time
4	Reduce ambiguity through structured syntax	EARS notation structures requirements (5 templates)	`lint` validates WCAG contrast ratios and structural rules
5	Fix "decisions made" as a place	spec files are where decisions live	spec files are where decisions live

Both sit inside the larger "specs for AI" movement and share the same underlying philosophy.

That said, they're not the same thing. The biggest difference, in one phrase: time horizon.

#	Axis	SDD	`DESIGN.md` / `SKILL.md` / `AGENTS.md`
1	Time horizon	Describes "what to build next"	Describes "rules that already exist"
2	Scope	Single feature / project lifecycle	Persistent rules and styles
3	Update rhythm	New per feature → consume → archive	Long-term maintenance, gradual growth
4	Subject	Requirements, design, tasks (procedure for action)	Rules for behavior, individual tasks, appearance

SDD specs describe "what we're going to build." requirements.md is "what this feature needs to satisfy"; design.md is "how to implement this feature"; tasks.md is "how to break the feature into work." Once the feature ships, they finish their job and get archived.

The three-layer specs describe "what should always hold." DESIGN.md provides the color and typography rules every time you generate a UI; AGENTS.md provides the agent's assumptions across every session. They get maintained long-term and grow incrementally.

This time-horizon difference is why the two don't compete. Transient specs and persistent specs coexist in the same project. They can also reference each other. Imagine writing "use {colors.tertiary} for the button" inside .kiro/specs/checkout-feature/design.md — that lets a transient feature spec reference a color token from a persistent DESIGN.md. The pattern isn't widely established yet, but the structure fits cleanly.

One thing worth noting: as of May 2026, the active areas of SDD (the Kiro community and similar) and the active areas of DESIGN.md / SKILL.md / AGENTS.md haven't really crossed paths. The SDD side concentrates on "how to build a feature"; the three-layer side concentrates on "how to deliver the rules."

You don't have to be doing SDD to start with the three-layer split — the split alone gets you to the door of "specs for AI." If your team is already on SDD, start referencing DESIGN.md tokens from inside your feature specs and you avoid maintaining the same rules in two places. The two movements look set to converge in the next phase.

Not Everything Becomes a Spec

The discussion of the three-layer split tends to drift toward "shouldn't we just spec everything," but in practice, that doesn't happen.

Rules that can't be formally verified stay as natural-language documents. Tone, structural choices, cultural nuance. Things like "how to open an article with empathy" or "how to give an ending the right amount of resonance" — judgment-based qualities. The cost of speccing them isn't the issue; the essence falls out when you try.

The judgment is straightforward: "is this formally verifiable?"

Color contrast ratios (verifiable) → DESIGN.md
Word substitutions like "leverage → use" (verifiable) → SKILL.md
Tone (soft assertions, not textbook-sounding), overall stance (not teaching, just organizing) and similar (not verifiable) → stays in AGENTS.md / CLAUDE.md

For small teams, "one natural-language file" is often enough. If CLAUDE.md alone is keeping things running, there's no need to force a split. The trade-off between the cost of speccing and the load of operating it depends on team size and how long the operation has to last.

The three-layer split is something you adopt incrementally, just like SDD — you don't need to spec everything at once. Start with the complex areas, the areas where verification helps most.

In other words, the three-layer split isn't a goal. It's an option you adopt when the situation calls for it.

Where to Start

A few options come into view from this overview.

A reasonable first move is to open your CLAUDE.md or style guide and sort it into "formally verifiable" and "judgment-based" sections. Color and typography rules, word substitution lists, structural rules. If a useful amount of verifiable content sits there, pick one to break out into either DESIGN.md (appearance) or SKILL.md (task). Don't try to split everything at once — start with the most independent piece.

Pulling in external skills is another route. Drop a ready-made SKILL.md like avoid-ai-writing into ~/.claude/skills/ and your stance as a writer doesn't change — only the verification gets handed off to the machine.

Teams already running Kiro or spec-kit are probably at the stage where they could try referencing DESIGN.md tokens from inside .kiro/specs/{feature}/design.md. The cross-reference between feature specs and persistent specs is still a thin area in terms of public examples.

The shared stance: don't try to spec everything at once. Document split → operational trial → speccing — staged migration is the realistic path. The three-layer split isn't a finished form. It's a movement still in progress, and that's the safer way to read it.

AI rules started splitting from a single natural-language document into three spec formats. That's another side of the same movement as SDD.

Not everything becomes a spec, but managing different roles in different files — that ordinary structuring is starting to apply to AI agents, too.

What Is Apache Polaris? Why Open Data Catalogs Matter and How to Use Them with AWS

Aki — Sat, 02 May 2026 06:27:16 +0000

Original Japanese article: Apache Polarisとは何か？オープンなデータカタログが求められる理由とAWSとの組み合わせ方を整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In recent years, lakehouse architectures centered around Apache Iceberg have been rapidly expanding.

By placing Iceberg tables on object storage such as S3, it has become possible to query the same data from multiple engines such as Athena, Snowflake, Spark, Trino, and Dremio.
As a result, the discussion has shifted from “Where should data be placed, and which engine should be used for analysis?” to “Where should data ownership reside, and which catalog should be used to unify governance?”

Amid this trend, Apache Polaris has been attracting attention in recent years.
Apache Polaris is an open-source implementation of the Iceberg REST Catalog, led by Snowflake and donated to the Apache Software Foundation.

Multiple vendors—including Dremio, AWS, Google, Microsoft, and Confluent—are contributing to it, and it is positioned as an “open catalog” that enables cross-platform management of Iceberg tables while avoiding vendor lock-in.

In this article, I would like to think through the following:

What Apache Polaris is
Why open data catalogs are required
Differences from AWS Glue Data Catalog
Differences from Snowflake Horizon Catalog
How responsibilities should be divided when combining with AWS

In conclusion, Apache Polaris is not something that competes with AWS Glue Catalog or Snowflake Horizon Catalog; rather, they are catalogs that operate at different layers.

It may be easier to understand Apache Polaris as a component that enables an architecture such as:
“The data itself resides in AWS, the catalog is open, and analysis engines are selected based on use cases.”

What is Apache Polaris?

Apache Polaris is an open-source catalog implementation compliant with the Apache Iceberg REST Catalog specification.
It was announced by Snowflake in 2024 and later became an incubation project under the Apache Software Foundation.

The project has now graduated from incubation and has been promoted to a top-level Apache project.

Official site:
https://polaris.apache.org/

What Polaris aims to achieve is a common metadata and governance foundation in a lakehouse centered around Iceberg tables.

A major characteristic is that it is not tied to any specific query engine or cloud vendor, and anyone can access it using the same specification via REST APIs.

Key Features of Apache Polaris

Feature	Description
Implementation of Iceberg REST Catalog	Accessible via standardized REST APIs. Can be directly used from engines such as Spark, Trino, Flink, Snowflake, and Dremio
Multi-catalog architecture	Multiple catalogs can be defined within a single Polaris instance. Enables separation and management by team or business domain
RBAC (Role-Based Access Control)	Provides a permission model combining principals, principal roles, and catalog roles
External catalog integration	Can connect to other catalogs compliant with the Iceberg REST specification (e.g., Nessie, Gravitino)
OSS / Managed support	Can be self-hosted as OSS, or used as managed offerings such as Snowflake Open Catalog or Dremio Catalog

What Apache Polaris Solves

As Apache Iceberg has become more widely adopted, multiple Iceberg-compatible catalogs have emerged, including Hive Metastore, JDBC, Nessie, AWS Glue, and Snowflake.

Since each has its own client libraries and interfaces, the following challenges have arisen:

The need to implement catalog clients for each programming language
Inconsistent access control specifications across catalogs
Difficulty enforcing governance across multiple catalogs
As a result, the overall architecture becomes constrained by the chosen catalog

To solve these challenges, the Iceberg REST Catalog specification was introduced.
Apache Polaris is an open-source implementation of that specification, further enhanced with multi-catalog support and RBAC.

In other words, you can think of it as an open catalog for Apache Iceberg.

Polaris Security Model

The Polaris security model can be organized into the following three concepts:

Principal: An entity representing a user or service. Accesses Polaris via client ID/secret, etc.
Principal Role: A grouping of multiple catalog roles. Assigned to principals
Catalog Role: A set of permissions within a specific catalog. Includes permissions such as TABLE_READ_DATA, TABLE_CREATE, and NAMESPACE_LIST

For example, you can design it such that:

The data_engineer principal role is assigned both write access to prod_catalog and administrative access to dev_catalog
The data_analyst principal role is assigned only read access to prod_catalog

An important point is that RBAC is centralized on the catalog side, eliminating the need to implement access control separately for each engine.

Why Open Data Catalogs Are Required

Let us first consider why open data catalogs are required in the first place.

Separation of Data and Engines Has Become a Premise

The greatest value of open table formats such as Apache Iceberg is the ability to separate data storage from query engines.

It has become possible to freely choose engines such as Athena, Glue, Spark, Snowflake, Dremio, and DuckDB depending on the use case when querying Iceberg tables on S3.

As a result, the key question in data platforms has shifted from “Which product should we use?” to “Where should data ownership reside, and who should be responsible for governance at which layer?”

However, while engines can now be freely selected, the remaining challenge is the catalog.

What Happens When Catalogs Are Tied to Engines

When using catalogs tightly coupled with query engines, the following situations tend to occur:

The data itself is open (S3 + Iceberg), but the catalog is tied to a specific engine
You want to reference the same table from another engine, but the catalog does not support it
Access control is fragmented across engines, making governance difficult
Every time the catalog is changed, all engine-side configurations must be redone

In other words, even if storage and formats are open, a closed catalog significantly reduces the benefits of a lakehouse.

Especially in today’s environments where multi-cloud, multiple products, and multiple engines are commonly combined, how to unify catalogs becomes a key challenge.

Requirements for an Open Catalog

Based on this background, lakehouse catalogs are expected to meet the following requirements:

Requirement	Description
Compliance with standard APIs	Support vendor-neutral APIs such as the Iceberg REST Catalog specification
Multi-engine support	Usable across engines such as Spark, Trino, Flink, Snowflake, and Dremio
Centralized RBAC	Define permissions at the catalog level and apply consistent governance across all engines
Multi-cloud / hybrid	Not dependent on a specific cloud and capable of running on-premises when necessary
OSS sustainability	Not discontinued based on vendor decisions; continuously developed in a community-driven manner

Apache Polaris is a catalog designed to satisfy these requirements.

Differences from AWS Glue Data Catalog

When building on AWS, AWS Glue Data Catalog is often positioned as the central data catalog.
Here, we will organize the differences between AWS Glue Data Catalog and Apache Polaris.

Positioning of AWS Glue Data Catalog

AWS Glue Data Catalog is a core metadata management service in AWS.

It is natively integrated with AWS analytics services such as Athena, Glue, Redshift Spectrum, and EMR, and plays the role of managing data on S3 as a catalog.

As discussed in previous articles, Glue Data Catalog is an excellent technical catalog used by data platforms.

Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies

Functional Comparison

Aspect	AWS Glue Data Catalog	Apache Polaris
Offering	AWS-managed (closed)	OSS / Managed (Snowflake Open Catalog, Dremio Catalog, etc.)
API	AWS proprietary API (recently also provides Iceberg REST compatibility)	Iceberg REST Catalog specification (open)
Cloud support	AWS	Multi-cloud / on-prem
Engines	Athena, Glue, Redshift, EMR, Spark	Spark, Trino, Flink, Snowflake, Dremio, StarRocks, DuckDB
Multi-catalog	Account-level (logical separation via Lake Formation)	Native support for multiple catalogs within a single instance
Access control	IAM + Lake Formation	Built-in RBAC (Principal / Principal Role / Catalog Role)
External catalog integration	Limited	Can integrate with Iceberg REST-compliant catalogs (Nessie, Gravitino, etc.)
Non-Iceberg formats	Supports Hive, JSON, CSV, Parquet, etc.	Currently Iceberg-centric (Generic Table support on roadmap)

How to Interpret the Difference

Rather than being in a competitive relationship, it is easier to understand them as catalogs with different roles.

AWS Glue Data Catalog: Strong integration with AWS services, making it the primary choice for workloads completed within AWS. It supports a wide range of data lake formats beyond Iceberg and features such as S3 crawling.
Apache Polaris: A catalog that enables governance across multiple engines and clouds based on the industry-standard Iceberg REST API. It is effective when you want to enforce consistent RBAC across engines outside AWS (e.g., Snowflake, Dremio).

In summary:

If your use case is AWS-contained and includes formats beyond Iceberg, Glue Data Catalog is a practical choice
If you want common management of Iceberg across multiple engines and a vendor-neutral catalog layer, Polaris is suitable

Differences from Snowflake Horizon Catalog

This is often confused, so let’s clarify the difference between Snowflake Horizon Catalog and Apache Polaris.
Note that it is different from “Snowflake Open Catalog,” despite the similar name.

What is Snowflake Horizon Catalog?

Snowflake Horizon Catalog is a data governance and discovery suite provided by Snowflake.

For data managed within Snowflake (Snowflake-managed tables, stages, views, shared data, etc.), it provides:

Data discovery (search, tagging, descriptions)
Lineage
Data quality monitoring
Masking policies and row access policies
Automatic classification of sensitive data
Compliance management

In terms of positioning, it is similar to Amazon DataZone + Lake Formation + Glue Data Quality in AWS.

In other words, it is the layer responsible for cataloging and governance so that people can discover, understand, and trust data.

What is Snowflake Open Catalog (Relation to Polaris)

On the other hand, Snowflake Open Catalog is a managed offering of Apache Polaris.

Although the name is confusing, this is the lakehouse catalog that serves as an Iceberg REST Catalog.

In Snowflake’s model:

Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake-managed data
Snowflake Open Catalog (= Apache Polaris): Lakehouse catalog layer for open table formats such as Iceberg

Functional Comparison

Aspect	Snowflake Horizon Catalog	Apache Polaris
Primary target	Data in Snowflake (internal tables, shared data, etc.)	Iceberg (Generic Table support for other formats is planned)
Layer	Business catalog / governance layer	Lakehouse catalog layer (technical catalog)
Offering	Built into Snowflake (closed)	OSS / Managed
API	Snowflake proprietary	Iceberg REST Catalog specification (open)
Data location	Snowflake internal storage or recognized external data	Iceberg tables on cloud storage
Scope	Within Snowflake organizations	Across multiple engines and clouds

How to Interpret the Difference

Again, these are not in opposition but complementary.

Snowflake Horizon Catalog: Upper layer that provides data to business users, handling discovery, quality, masking, etc.
Apache Polaris: Lower layer (metadata foundation) that exposes Iceberg tables to multiple engines

Conceptually, the structure looks like this:

┌──────────────────────────────────────────────┐
│  Business Catalog / Governance Layer         │ ← Snowflake Horizon Catalog
│  (Discovery / Lineage / Quality / Masking)   │   Amazon DataZone, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Lakehouse Catalog Layer                     │ ← Apache Polaris
│  (Iceberg REST Catalog / RBAC)               │   AWS Glue Data Catalog, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Data Lake (S3 / GCS / Azure Blob)           │
│  Iceberg / Parquet                           │
└──────────────────────────────────────────────┘

If you think of Snowflake Horizon Catalog and Apache Polaris as “choosing one or the other,” it feels unnatural, but when organized as different layers, the division of responsibilities becomes clear.

How to Combine with AWS

From here, we will consider cases where Apache Polaris is introduced into an AWS environment.
Since AWS already has a powerful catalog called Glue Data Catalog, it is important to clarify how Polaris should be positioned and who is responsible for what.

Expected Architecture

Representative configurations can be organized into the following three patterns.

Pattern 1: AWS-only (Glue Data Catalog-centered)

This is the simplest configuration.
It is a typical setup using S3 + Iceberg + Glue Data Catalog, along with Athena / Glue / Redshift Spectrum.

Catalog: AWS Glue Data Catalog
Governance: IAM + Lake Formation
Query engines: Athena, Redshift Spectrum, Glue ETL, EMR

If everything is completed within AWS and there is no strong need to share with external engines, this configuration remains the most practical.
There is no need to forcibly introduce Apache Polaris.

Pattern 2: AWS + Snowflake (Using Polaris as a shared catalog foundation)

This configuration is effective when you want to reference the same Iceberg tables from both AWS (e.g., Athena) and Snowflake.

Data storage: S3 + Iceberg
Catalog: Apache Polaris (OSS self-hosted or Snowflake Open Catalog)
AWS side: Reference Polaris as an Iceberg REST Catalog (via Spark or third-party tools)
Snowflake side: Connect to Polaris using External Volume and Catalog Integration (CATALOG_SOURCE = POLARIS)

From the Snowflake side, Polaris can be referenced directly as follows:

CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog_int
  CATALOG_SOURCE = POLARIS
  TABLE_FORMAT = ICEBERG
  REST_CONFIG = (
    CATALOG_URI = 'https://<polaris-host>/api/catalog'
    CATALOG_NAME = '<your_polaris_catalog>'
  )
  REST_AUTHENTICATION = (
    TYPE = OAUTH
    OAUTH_CLIENT_ID = '<polaris_client_id>'
    OAUTH_CLIENT_SECRET = '<polaris_client_secret>'
    OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
  )
  ENABLED = TRUE;

Pattern 3: Multi-engine / multi-cloud configuration

In addition to Snowflake, this configuration includes multiple engines such as Dremio, Databricks, Trino, and Flink.

In this case, all engines reference Polaris as a common Iceberg REST Catalog.

Data storage: S3 (and other cloud storage if needed)
Catalog: Apache Polaris (center of governance)
Query engines: Snowflake, Dremio, Spark, Trino, Flink, etc.
Governance: Polaris provides unified RBAC across all engines

How to Think About Responsibility Separation

This is the key point.
When combining Polaris, AWS, Snowflake, and others, it is important to clearly define who is responsible for which layer.

Layer	Primary Owner	Notes
Data storage (files)	AWS (S3)	Storage location of the data. Single Source of Truth
Storage access control	AWS (IAM)	Access permissions to S3 buckets/prefixes are defined on the AWS side
Table metadata	Apache Polaris	Source of Truth for Iceberg metadata such as schema, snapshots, partitions
Table-level RBAC	Apache Polaris	Applies consistent permission rules across engines
ETL / pipelines	AWS Glue / Lambda / EMR / Spark	Responsible for ingestion and transformation
Query execution	Athena / Snowflake / Dremio / Spark	Engines selected based on use case
Business catalog / discovery	Snowflake Horizon Catalog / Amazon DataZone	Higher-layer features for search, lineage, quality for users
Data quality	Glue Data Quality / Snowflake DMF	Implemented at engine or quality service layer

What is especially important is the three-layer separation:

Data resides in AWS, the catalog is Polaris, and usage is handled by each engine

By making this separation explicit:

AWS can focus on storage and IAM management
Polaris can focus on metadata and access control
Each query engine can focus on its strengths

Considerations When Adopting Polaris

Polaris is powerful, but there are also important considerations:

Operational cost when self-hosting OSS: Running on EKS or EC2 requires a metastore (e.g., PostgreSQL), authentication infrastructure, monitoring, and upgrade handling
Managed services are often more practical: Using Snowflake Open Catalog or Dremio Catalog significantly reduces operational burden
Less seamless integration with AWS services compared to Glue: For AWS-native services such as Athena, Redshift, and QuickSight, using Glue Data Catalog is far more straightforward
Need to avoid double governance: If IAM policies on S3 and RBAC in Polaris are inconsistent, troubleshooting becomes complex

In other words, when deciding whether to adopt Apache Polaris in an AWS environment, it is realistic to evaluate based on:

Whether multi-engine requirements exist
The organization’s stance on vendor lock-in
Whether operational cost is acceptable (or managed services can be used)

A Practical Approach

Personally, when considering Polaris in an AWS environment, the following phased approach is practical:

Build a lakehouse within AWS using Glue Data Catalog + Iceberg
When integration with other engines such as Snowflake becomes necessary, consider introducing an Iceberg REST layer
At that point, compare “Glue Iceberg REST endpoint,” “Apache Polaris OSS,” and “Snowflake Open Catalog” based on requirements
If multi-engine / multi-cloud requirements become clear, redesign with Polaris (especially managed) at the center

Rather than designing with Polaris from the beginning, it is often more practical to replace the catalog layer with an open one when requirements mature.

Conclusion

In this article, we organized the key points around Apache Polaris.

In the world of data platforms, while storage and formats have become open, a closed catalog reduces the benefits of a lakehouse by half.

Therefore, there is a need for an open catalog that complies with the Iceberg REST Catalog specification and enables unified governance across multiple engines and clouds.
Apache Polaris is designed to fulfill exactly that role.

However, it is important to think not in terms of “which one to choose” among Polaris, AWS Glue Data Catalog, and Snowflake Horizon Catalog, but rather which layer each is responsible for:

AWS Glue Data Catalog: Technical catalog within AWS (still the primary choice for AWS-only workloads)
Apache Polaris: Lakehouse catalog centered on Iceberg, shared across multiple engines
Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake users

Even when combining with AWS, by consciously separating responsibilities as
“data in AWS, catalog in Polaris, analytics in engines, business catalog in another layer”,
you can design an architecture that leverages the strengths of each.

Going forward, lakehouse architectures are expected to increasingly adopt vendor-neutral designs.
Apache Polaris is likely to become an important component supporting that openness.

I hope this article will be helpful for those considering Apache Polaris or designing lakehouse architectures across multiple platforms such as AWS and Snowflake.

Vibe Coding Forem: AWS Community Builders

The Model Is the Brain. The Harness Is the Body. Here's Why That Matters

What Changed (and What Didn't)

The Harness Is the New Battleground

How AgentCore Harness Works

Getting Started (I Actually Ran This)

Production Considerations

When NOT to Use Harness

Bottom Line

Serverless Workflow Decomposition: When a Step Function Becomes a Monolith

Why this matters

What I mean by a "Step Function monolith"

Signs of workflow sprawl

1) One workflow owns too many domains

2) The ASL definition becomes hard to reason about

3) Payloads become "workflow-shaped" instead of domain-shaped

4) Change blast radius is too large

5) Execution histories are huge and troubleshooting is slow

6) Reuse pressure leads to copy/paste orchestration

7) Mixed execution profiles are forced into one workflow

Decomposition principles I use

Principle 1: Keep the parent workflow focused on orchestration decisions

Principle 2: Split by domain or stable subprocess boundary

Principle 3: Define explicit input and output contracts

Principle 4: Decompose to reduce blast radius, not to maximize nesting

Principle 5: Preserve the business narrative

Parent-child workflow patterns

Pattern A: Synchronous child workflow (request/response style orchestration)

Pattern B: Asynchronous child workflow (fire and track)

Pattern C: Parallel child workflows for independent branches

Pattern D: Domain subprocess library

Contracting inputs and outputs (the most important part)

What a good child contract looks like

Contract boundaries I define explicitly

Keep transformation logic close to the boundary

Versioning workflows safely

My rule: version the workflow and the contract

Safe versioning practices I use

1) Invoke child workflows through aliases

2) Use immutable workflow versions behind aliases

3) Keep contract compatibility during rollout windows

4) Prefer additive contract changes

5) Test parent-child compatibility explicitly

Reference Architecture

End-to-end walkthrough: decomposing an Order Processing workflow

The original monolithic workflow (before)

The decomposed target architecture (after)

Why this split works

Architecture and flow (walkthrough narrative)

1) API receives CreateOrder request

2) Parent workflow performs lightweight order validation

3) Parent invokes PaymentProcessingWorkflow as a synchronous child

4) Parent invokes InventoryReservationWorkflow

5) Parent branches based on combined business outcomes

6) Parent starts FulfillmentSubmissionWorkflow

7) Notifications and analytics are triggered

8) Parent publishes final order status and completes

Implementation discussion

Parent workflow (ASL) using nested child workflows

Why this parent is easier to maintain

Child workflow example: PaymentProcessingWorkflow

Design choice I recommend

TypeScript contract definitions (shared library)

CDK wiring example (parent and child aliases)

What I pay attention to in deployment pipelines

IAM and permissions for nested workflows (important operational detail)

Observability after decomposition

What I propagate into every child

What I log in each child

How to split by domain and subprocess in practice

Prompt 1: Which parts change for different business reasons?

Prompt 2: Which parts require different failure semantics?

Prompt 3: Which parts are reusable?

Prompt 4: Which parts have different owners/on-call teams?

Prompt 5: Which parts make the parent harder to read than the business process itself?

Migration strategy: from one monolith workflow to decomposed workflows safely

Step 1: Identify one extraction candidate

Step 2: Define the contract before extracting

Step 3: Extract the logic into a child workflow

Step 4: Update parent to call child via alias

1) API receives `CreateOrder` request

3) Parent invokes `PaymentProcessingWorkflow` as a synchronous child

4) Parent invokes `InventoryReservationWorkflow`

6) Parent starts `FulfillmentSubmissionWorkflow`

Child workflow example: `PaymentProcessingWorkflow`