Abstract
Sentinel is iSimplifyMe's production AI operations layer — a fleet of investigate-only Claude agents running on AWS Bedrock that monitor iSM's own infrastructure for regressions, anomalies, and operational incidents. It is internal infrastructure, not a customer-facing product, and runs to keep the rest of the platform honest. The same architecture is offered to clients as a productized Sentinel-pattern monitoring retainer.
Problem
Production AI infrastructure has more silent failure modes than monitorable ones. A Bedrock model that responds with semantically wrong answers passes a 200 OK health check. A retrieval pipeline that surfaces stale data clears every uptime probe.
Manual log review does not scale across a multi-site network with thirty in-production engagements. Status pages tell you what is on; they do not tell you what is wrong.
Approach
The agent topology
Each Sentinel agent is an investigate-only Bedrock-hosted workload with a discrete surveillance scope and a defined cadence. The agent reads from a constrained set of operational signals — logs, recent error events, model-call traces — runs a Claude Sonnet 4.6 inference pass via the us. US-bounded inference profile to classify the situation, and decides whether the finding warrants escalation. No agent writes to client systems; the architecture is investigate-and-notify only.
Slack as the approval gate
When an agent identifies something worth escalating, it posts a Block Kit card to the appropriate channel with the diagnosis, the recommended remediation, and a small set of action buttons. A human reviewer clicks one. Only then does any remediation fire.
The design rule came directly from a 2026 incident where an unguarded automated drip in the Retell Phone Bridge sent the same follow-up email 48 times to three leads. Sentinel's discipline since then: automated detection is fine, automated remediation requires a human in the loop.
Workload #1: Diagnostics Agent
The Diagnostics Agent investigates client tenant sites that have failed three consecutive uptime checks — running curl, dig, and Cloudflare 5xx-breakdown probes via custom Bedrock tools — then files a markdown bug-report ticket with timeline, root cause, evidence, and recommended fix. Verified cost: $0.06 per incident on synthetic test cases (Claude Sonnet 4.6, ~50 seconds active runtime, ~28k tokens).
Workload #2: GH Triage Agent
The GH Triage Agent polls iSimplifyMe org repository workflow runs every fifteen minutes, detects failures, and runs an inference pass classifying root cause across eight categories: test_flake, regression, infrastructure, auth, dependency, lint_or_typecheck, build_config, and unknown. Output is a structured ticket with markdown body covering Failure Summary, Classification, Failed Jobs, Recent Commits, and investigator Notes. Verified cost: $0.065 per run (Claude Sonnet 4.6, ~44 seconds active runtime).
Idempotent — once a failed run is investigated, a 24-hour DDB lock prevents re-investigation, so flapping CI does not produce duplicate tickets.
Workload #3: Pipeline Hang Detector
The Pipeline Hang Detector watches the iSM multi-site content pipeline for anomaly states — stuck topic-proposal runs, malformed MDX rejections, frontmatter envelope drift, write-post Lambda failures — and runs an inference pass to classify the cause and identify the affected tenants. Output uses the same structured ticket format as the other agents and routes to a content-pipeline-specific Slack channel.
The three workloads share infrastructure: one generic SQS-triggered runner Lambda dispatches the right agent based on a SENTINEL_AGENT_SLUG kickoff message, an atomic conditional-write lock at INCIDENT#OPEN race-protects parallel detection paths, and the same file_ticket and notify_slack tools serve all three. Adding a new Sentinel workload is a registry entry plus a detector handler; everything else is shared.
Eat-our-own-dogfood proof point
Sentinel runs on the same AWS Bedrock substrate (BedrockRuntimeClient + ConverseStreamCommand + DynamoDB ticket store + EventBridge cron + SQS queue + IAM-scoped Bedrock perms) that iSimplifyMe deploys for client validator-architecture engagements. iSM operates Sentinel as a production proof point of the architecture it proposes for regulated-industry clients — every workload type is in production at iSM before being offered to clients.
Status
- Sentinel runs in production on AWS Bedrock as iSM's internal AI operations infrastructure. Three workloads are live: Diagnostics Agent, GH Triage Agent, and Pipeline Hang Detector.
- Total platform cost: under $50/month across all three workloads at current activity volume.
- Architecture is investigate-only by design — no agent writes to client systems, no agent fires remediation without human approval through the Slack gate.
Roadmap
Sentinel's roadmap continues across two tracks: additional workloads against iSM's own properties, and productization as a client-facing service line.
iSM property monitoring (internal expansion)
- Lighthouse regression detector — nightly Lighthouse audits across the iSM editorial atlas network (Marque Cars, Subdial, Eldercare Atlas, RoofingTechPro) and client websites; threshold-based detection of performance regressions before they affect AEO rankings.
- AEO drift and citation surveillance — schedule-driven probes against ChatGPT, Gemini, AI Overview, and Perplexity for the citation-protected substrate pages currently cited as authoritative sources; alerts on framing or citation drift.
- Cost anomaly detector — CloudWatch billing and Cost Explorer probes for AWS spend spikes across the iSM project portfolio.
- Weekly audit agent — cross-repo health rollups across the thirty-seven iSimplifyMe org repositories.
- DNS watcher — Cloudflare zone monitoring for the brand-citation infrastructure across all iSM-managed domains.
Client engagements (productized)
The Sentinel architecture is available to client engagements as a productized retainer: Sentinel-pattern monitoring. iSM operates the same investigate-only agent topology on the buyer's AWS Bedrock infrastructure to provide continuous validator-gate hit/miss telemetry, drift detection, and incident response. The retainer pairs with the Validator Architecture build engagement — audit, architecture, then operate — and is priced as a custom monthly retainer sized to scope.
The pattern is repeatable per-client: discovery (which validator gates does the buyer deploy?), Sentinel deployment (investigate-only Claude agents on the buyer's Bedrock account), Slack-gated escalation (findings route to a buyer-designated channel; remediation requires human approval), and quarterly reviews against iSM's reference architecture. Mid-market and enterprise regulated industries only.