Architecture17 min readMay 11, 2026

Building a Zero-Trust Detection Pipeline with Identity-First Monitoring

Map the architecture of an identity-centric detection pipeline: CloudTrail ingestion, behavioral baselining, anomaly scoring, and progressive response. Zero-trust principles apply to detection too - verify every identity action, assume breach.

The attack happened at 2:47am on a Tuesday. A DevOps engineer's laptop, compromised three days prior by infostealer malware, had harvested 1,400+ session cookies including a valid AWS SSO token. The attacker used that token to assume a Lambda execution role, attached AdministratorAccess, spun up 40 EC2 instances across four regions, and exfiltrated 2.3TB of customer data. Total elapsed time: 72 minutes. The organization's SIEM alerted on unusual EC2 volume 11 hours later. By then, the ransomware note was already in the CFO's inbox.

This wasn't a failure of infrastructure monitoring. CloudTrail captured every API call. GuardDuty flagged unusual instance types. The network security team detected the data egress. But none of those systems understood what mattered: the identity context. They saw infrastructure events. They missed the identity compromise that made everything else possible.

Zero-trust architecture applies to detection pipelines too. We've spent a decade implementing "never trust, always verify" for network access, but most detection systems still implicitly trust authenticated identities. If the API call came from a valid credential, it must be legitimate, right? Wrong. In 2025, 90% of breaches involved identity weaknesses, and 80% of MFA bypasses occurred through stolen session tokens that let attackers inherit trusted status [1]. Your detection pipeline needs to verify every identity action, assume every credential could be compromised, and baseline behavior continuously. Infrastructure-first detection is dead. Identity-first detection is the survival requirement.

Why Infrastructure-First Detection Fails at Identity Threats

Traditional SIEM architectures excel at detecting network anomalies, malware signatures, and infrastructure misconfigurations. They fail spectacularly at identity threats because they're designed to answer the wrong question. They ask "is this infrastructure behavior normal?" when they should ask "is this identity behavior normal?" A Lambda execution role invoking 200 APIs in three minutes looks identical whether it's running your legitimate CloudFormation deployment or an attacker's privilege escalation script. The infrastructure event is identical. The identity context is completely different.

Session token theft bypasses 80% of MFA implementations because most detection pipelines validate the authentication event but ignore session behavior post-login [2]. When an attacker steals a session cookie with valid MFA attestation, your detection system sees a properly authenticated user. It doesn't notice that this "user" suddenly started calling IAM APIs from Singapore when their established baseline is CloudWatch queries from Virginia between 9am and 5pm EST. The authentication event was legitimate. The session behavior is a red flag. But if your pipeline only monitors infrastructure, you miss the anomaly entirely.

Non-human identities outnumber humans 144:1 in modern cloud environments, with some organizations reaching 500:1 ratios [3]. Yet most detection systems treat service accounts as trusted infrastructure rather than threat vectors requiring continuous verification. That Lambda execution role, that EC2 instance profile, that CI/CD pipeline's OIDC-federated identity, each represents an attack surface with god-mode permissions and zero behavioral monitoring. When attackers compromise a service account, they inherit trust that was never earned and operate inside a blind spot your detection system doesn't even know exists.

The 72-minute attack window from initial access to full exfiltration makes reactive detection worthless [4]. If your pipeline takes 17 days to detect credential misuse (the current industry average), you're not doing security, you're doing incident archaeology. By the time you notice the anomaly, the attacker has exfiltrated your data, established persistence, and moved on to the next target. You need identity-first pipelines that verify every action in near-realtime, not infrastructure-first systems that alert you after the damage is done.

Identity-First Detection Pipeline Architecture

Layer 1: CloudTrail Ingestion Architecture That Scales

Building an identity-centric detection pipeline starts with comprehensive CloudTrail ingestion across every AWS account in your organization. Centralized logging sounds obvious, but most enterprises discover they have 30-60% CloudTrail coverage when they actually audit all accounts. The first step is architectural: use AWS Organizations to configure organization-wide CloudTrail with EventBridge routing to a single S3 bucket in your security account. Encrypt with a dedicated KMS key, enable S3 Object Lock for compliance, and set lifecycle policies to transition logs to Glacier after 90 days.

Critical event types for identity monitoring include AssumeRole, GetSessionToken, CreateAccessKey, PutUserPolicy, AttachRolePolicy, DeleteRolePolicy, PutGroupPolicy, and any cross-account access patterns. These events reveal privilege escalation attempts, credential exfiltration, and lateral movement. But don't filter at ingestion, capture everything. You don't know what you'll need to investigate until the investigation starts. Filter downstream during analysis. The cost difference between storing all events vs. filtered events is negligible compared to the blind spot you create by dropping data.

Streaming ingestion via Kinesis Data Firehose versus batch processing via S3 event triggers depends on your detection SLA requirements and event volume. A typical enterprise generates 2-5 million CloudTrail events per day [5]. If you need sub-minute detection latency for high-risk actions, stream to Firehose with Lambda processing. If your SLA allows 5-10 minute latency, batch processing from S3 is simpler and cheaper. We've run both models. Streaming wins when you're automating progressive response. Batch wins when you're primarily alerting humans.

Data enrichment at ingestion time accelerates downstream analysis. Before writing events to your detection data store, append IAM principal ARN resolution (expand temporary credentials to originating identity), session context (device fingerprint, user-agent parsing), source IP geolocation, and organization unit mapping. This enrichment turns a raw CloudTrail event into an identity-contextualized signal. When you're investigating an incident at 3am, you don't want to be joining five data sources to figure out which human assumed which role from which location. Front-load that work at ingestion. Your future sleep-deprived self will thank you.

Layer 2: Identity Resolution Across Human and Non-Human Entities

Identity resolution is where most detection pipelines fail. CloudTrail logs contain principal ARNs, but those ARNs don't tell you who actually performed the action. When you see arn:aws:sts::123456789012:assumed-role/LambdaExecutionRole/aws-lambda-function-name, you know a Lambda function ran. You don't know which human deployed that function, which CI/CD pipeline triggered the deployment, or whether this is the same execution role used by 40 other functions across your organization. Building unified identity profiles requires correlation across IAM users, federated SSO sessions, EC2 instance profiles, Lambda execution roles, and cross-account assumed roles.

Non-human identity tracking is the surveillance gap attackers exploit most frequently. When a single Lambda execution role is invoked from 40 different functions across 15 minutes, that's a normal deployment pattern if it's your CI/CD system. It's a compromise pattern if those invocations started five minutes after a developer's laptop was infected with infostealer malware. You need to correlate service account activity across accounts, build a baseline of normal usage patterns for each machine identity, and detect deviations that indicate credential theft. This requires maintaining an identity graph that links every temporary credential back to its originating human or machine entity.

Session lineage tracking detects session token theft where authentication logs show nothing suspicious. When an attacker steals a valid session cookie with MFA attestation, your auth logs record a legitimate login. But if you track session lineage, mapping every API call back through the temporary credential to the originating SSO session to the originating human identity, you can detect when that session suddenly exhibits behavior inconsistent with the human's established baseline. The authentication was real. The session behavior is wrong. That's your signal.

Identity graph construction reveals lateral movement patterns invisible in flat event logs. Build a graph where nodes represent identities (human IAM users, SSO principals, service accounts, execution roles) and edges represent relationships (who assumes which roles, which services call which APIs, which identities share permission boundaries). When an attacker compromises a low-privilege developer account and starts assuming roles they've never touched before, the topology of that identity's graph changes. The individual API calls might look benign. The structural shift in the identity graph screams compromise.

Layer 3: Behavioral Baseline Modeling (Teaching the System What Normal Looks Like)

Behavioral baselines must be identity-specific, not infrastructure-specific. A CloudFormation execution role invoking 200 APIs in three minutes is normal. The same pattern from a developer IAM user is a red flag indicating either a compromised credential or a script running with standing credentials it shouldn't have. Your baseline model needs to understand that different identity types have different normal behaviors. Trying to build a single "normal API usage" baseline across all identities is like trying to build a single "normal internet usage" baseline across your entire organization. The variance is so high the baseline becomes meaningless.

Multi-dimensional baseline vectors capture the full context of identity behavior. For each identity, track API call frequency distributions (mean, median, 95th percentile), resource access patterns (which S3 buckets, which EC2 instances, which IAM roles), time-of-day and day-of-week distributions, source IP geographies, cross-account access frequency, and privilege escalation attempts (both successful and failed). These vectors form a behavioral fingerprint unique to each identity. A deviation in any single dimension might be noise. Deviations across multiple dimensions simultaneously are signal.

Continuous learning windows prevent your baselines from becoming stale. Use 30-day rolling windows with weekly recalibration to capture both typical patterns and seasonal variations. This approach detects sudden deviations (compromised credential used for immediate exfiltration) and slow drifts (privilege creep over months as a service account accumulates permissions it no longer needs). When a developer who typically calls CloudWatch APIs between 9am-5pm EST suddenly starts calling IAM APIs at 2am from Singapore, your baseline should flag that within minutes. When a Lambda function's API call volume gradually increases 40% over six months, your baseline should flag that as drift requiring manual review.

Separate baseline models for human identities versus non-human identities dramatically improve detection accuracy. Human identities have daily and weekly patterns (workday activity, after-hours silence, vacation gaps). Machine identities have sub-hour patterns (scheduled Lambda invocations, Auto Scaling group launches, CI/CD pipeline runs). A human identity with 95% variance in their behavior is normal. A machine identity with 20% variance is suspicious. Humans are erratic. Machines are predictable. When a machine identity starts behaving erratically, that's your highest-confidence signal for compromise.

Identity Type	Baseline Window	Key Behavioral Signals	Anomaly Threshold	Response SLA
Human IAM User	30 days rolling	Time-of-day, source IP, API frequency, resource access scope	3 standard deviations	10 minutes
Federated SSO Principal	30 days rolling	Session duration, post-auth activity, device fingerprint, impossible travel	2.5 standard deviations	5 minutes
Lambda Execution Role	7 days rolling	Invocation count, API call sequence, cross-service access, memory/duration	2 standard deviations	2 minutes
EC2 Instance Profile	7 days rolling	API volume, network egress, credential refresh frequency, metadata access	2 standard deviations	2 minutes
CI/CD Service Account	14 days rolling	Deployment frequency, resource creation patterns, IAM policy modifications	3 standard deviations	5 minutes

Layer 4: Anomaly Scoring with Context-Aware Risk Signals

Composite risk scoring combines statistical deviation (how far from baseline), threat intelligence (IP reputation, known attack patterns), and environmental context (after-hours access, impossible travel, recent privilege changes). A single signal means nothing in isolation. An API call from a new IP address scores 10/100. The same API call from a new IP in a country your organization has never operated in scores 40/100. That call combined with 10x normal API volume scores 75/100. That combination occurring 10 minutes after a privilege escalation event scores 95/100. Context is everything. Your scoring model needs to multiply risk signals, not add them.

Risk signal prioritization prevents alert fatigue by focusing human attention on the signals that actually matter. Credential exfiltration attempts (CreateAccessKey for another user, GetSessionToken for a principal that's never used temporary credentials) score 95/100. Unusual API volume spikes score 40/100. After-hours access scores 15/100 unless combined with other signals. We've run this model at scale. The 95+ scores are real incidents. The 60-80 scores are 50/50 (incident or legitimate unusual activity). The 30-50 scores are 90% false positives. Set your automated response thresholds accordingly.

Session token abuse detection requires comparing pre-auth and post-auth behavior. When a user authenticates via SSO, your baseline knows their typical post-auth API call patterns: mostly CloudWatch queries, occasional EC2 describe calls, rare IAM read operations. If that user's session suddenly shifts to IAM write operations, S3 ListBuckets across all accounts, and cross-account AssumeRole calls, you're watching a stolen session cookie in action. The authentication was legitimate. The session behavior proves the credential was exfiltrated. This pattern is invisible to infrastructure-focused detection because every API call came from a valid credential. It's obvious to identity-focused detection because the behavioral deviation is massive.

Confidence thresholds for automated response map directly to your progressive response levels. Scores 0-30 log only (observational telemetry for future baseline refinement). Scores 31-60 alert SOC with full investigation context (identity timeline, recent privilege changes, similar historical incidents). Scores 61-80 challenge the identity (force MFA re-auth, require approval workflow for high-risk actions). Scores 81-95 isolate immediately (suspend active sessions, block new authentications, quarantine affected resources). Scores 96-100 revoke credentials automatically (disable IAM user, delete access keys, detach inline policies). The key is making these thresholds explicit and tuning them based on false positive rates in your specific environment.

Layer 5: Progressive Response Automation (Five Levels of Autonomy)

Level 1 (Observe) logs the anomaly with full context but takes no action. Use this for low-confidence signals or identities with broad legitimate access patterns where false positives would disrupt critical workflows. A principal engineer who legitimately accesses 40 different AWS accounts might trigger geographic anomalies when traveling for conferences. Log it. Don't block it. Build the behavioral baseline. Escalate to Level 2 only when multiple signals converge.

Level 2 (Alert) sends enriched notifications to your SOC with investigation runbooks attached. Include the identity timeline (last 7 days of activity), recent privilege changes (new policies attached, roles assumed for the first time), and similar historical incidents (has this identity triggered this detection before? what was the outcome?). We generate these runbooks automatically using the identity graph. When a human analyst opens the alert, they see "this is the third time this Lambda role has exhibited unusual cross-account access in the past 30 days" instead of a raw CloudTrail event dump. Context accelerates investigation from 45 minutes to 5 minutes.

Level 3 (Challenge) forces MFA re-authentication or requires approval workflows for high-risk actions. When a developer's IAM user suddenly tries to attach AdministratorAccess to their own account, don't block it immediately, it might be legitimate break-glass access for an incident. Challenge them. "You're requesting admin privileges. Confirm via Slack approval workflow with your manager." This buys investigation time without breaking legitimate operations. If they can't complete the challenge within 60 seconds, escalate to Level 4. If they complete it, log the approval for audit and continue monitoring.

Level 4 (Isolate) suspends active sessions, blocks new authentications, and quarantines affected resources. When an SSO principal's session exhibits impossible travel (API calls from Virginia and Singapore within 10 minutes), immediately suspend all active sessions, block that principal from starting new sessions, and flag the identity for manual review. This is reversible within 60 seconds if a false positive is confirmed. In our experience, Level 4 triggers have a 95% true positive rate when they fire on impossible travel or session token abuse patterns. The remaining 5% are legitimate scenarios we couldn't anticipate (exec traveling internationally with VPN switching). Reverse the block, document the scenario, tune the baseline.

Level 5 (Revoke) immediately disables credentials, deletes access keys, and detaches inline policies. Use this only for confidence scores above 95 or when multiple high-severity signals converge with external threat intelligence confirmation. When you detect CreateAccessKey API calls from an IP address that appears in 15 different infostealer malware command-and-control domains, you don't wait for human confirmation. You revoke the credential, terminate all sessions, and alert the SOC. The investigation happens after the attacker loses access, not while they're still operating.

The 82% Unknown Identity Problem: Detecting Shadow AI Agents

Eighty-two percent of organizations discovered previously unknown AI agents in the past year [6]. These aren't identities that were invisible to security teams. These are identities that didn't exist in any inventory system because nobody knew they'd been created. A developer spins up an LLM-powered code review bot with AWS credentials, runs it for six months, then leaves the company. The bot keeps running. The credentials keep working. The permissions keep accumulating. Until an attacker discovers it.

AI agent compromise patterns differ from human or traditional machine identity compromise. Agents accumulate god-mode permissions through incremental policy attachments because each new capability requires broader access. An agent starts with S3 read-only, gets Lambda invoke permissions added for task execution, gets IAM read permissions for environment introspection, eventually gets cross-account access for multi-region orchestration. After six months, it has AdministratorAccess across four accounts. Nobody intended to grant that. It accumulated through incremental feature additions. Attackers know this. They specifically target AI agents because the permission creep is predictable and the credentials rarely rotate.

Detection heuristics for rogue agents focus on behavioral patterns that distinguish agents from other machine identities. Look for API call patterns with sub-second latency variance (agent control loops execute on tight schedules), cross-service orchestration sequences that don't match any known application architecture, and anomalous SDK user-agent strings (custom frameworks, unreleased libraries, deprecated SDKs that haven't been used in your environment for years). When you see a Lambda execution role making API calls with a user-agent string containing "langchain" and "anthropic-sdk" but you have no documented AI projects using those libraries, you've found a shadow agent.

Lifecycle enforcement prevents credential persistence after agent decommissioning. Flag service accounts with no activity in 30 days. Detect agents making API calls without corresponding deployment events (Lambda invocations without CloudWatch logs, EC2 instance profile calls without corresponding EC2 running instances). Cross-reference your CI/CD deployment logs with your identity activity logs. If an identity is making API calls but hasn't been deployed through your official pipeline in 60 days, it's either a shadow system or a compromised credential. Either way, it needs immediate investigation.

From Detection to Response in Under 5 Minutes (Real-World SLAs)

Traditional 17-day detection windows are incompatible with 72-minute exfiltration timelines [7]. You cannot respond to an attack that completes before you even know it started. Set detection SLAs based on attack speed, not organizational response capacity. If attackers exfiltrate data in 72 minutes, your detection SLA must be measured in minutes, not days. Real-time response for credential exfiltration attempts (CreateAccessKey for another user, GetSessionToken from impossible geography) must trigger Level 4 or Level 5 response within 2 minutes. Impossible travel detection must trigger within 5 minutes. Unusual privilege escalation (AttachRolePolicy with Admin permissions) must trigger within 10 minutes.

Just-in-time access integration transforms standing privileges into ephemeral, scoped access that drastically reduces your attack surface. When your detection pipeline flags a standing admin account as high-risk, automatically migrate it to JIT provisioning: short-lived credentials on demand, scoped to specific tasks, automatically revoked after 4 hours. This creates detailed, immutable audit trails for every access request. Instead of "developer had admin access for 6 months and we don't know what they did with it," you have "developer requested admin access to troubleshoot production incident #4782, access granted for 4 hours, here are the 47 API calls they made, access auto-revoked at 2:34pm." This is the access model attackers can't exploit because the credentials expire before they finish reconnaissance.

Audit trail immutability is non-negotiable. Every detection event, every automated response action, and every manual override must be logged to tamper-proof storage. Use S3 Object Lock with compliance mode retention, or CloudTrail Insights with dedicated KMS keys. When an attacker compromises an admin account, the first thing they do is delete logs. If your detection system's own logs are mutable, you have no reliable incident timeline. We've investigated breaches where attackers deleted 80% of CloudTrail logs before we detected the intrusion. The 20% that remained was in an S3 bucket with Object Lock. That 20% was enough to reconstruct the attack timeline and prove the breach scope for compliance reporting. Immutable logs are your ground truth.

Response automation without human oversight creates compliance risk, but response delay creates security risk. The solution is progressive escalation with different approval thresholds. Level 1-3 responses run fully automated with no approval required. Level 4 responses (session isolation) require post-action review within 4 hours. Level 5 responses (credential revocation) require either 95+ confidence scores OR manual approval from on-call security lead within 5 minutes. This gives you automation for the 95% of incidents that match known patterns, human judgment for the 5% that require contextual understanding, and compliance coverage for both. Document the approval logic, test it quarterly, and adjust thresholds based on false positive rates in your environment.

References

[1] CrowdStrike, "2025 Global Threat Report," 2025. https://www.crowdstrike.com/global-threat-report/

[2] Verizon, "2025 Data Breach Investigations Report," 2025. https://www.verizon.com/business/resources/reports/dbir/

[3] ConductorOne, "State of Non-Human Identity Management Report," 2025. https://www.conductorone.com/non-human-identity-report

[4] Microsoft, "Digital Defense Report 2025," 2025. https://www.microsoft.com/en-us/security/business/microsoft-digital-defense-report

[5] AWS, "CloudTrail Best Practices and Usage Patterns," AWS re:Invent 2024. https://aws.amazon.com/blogs/mt/cloudtrail-best-practices/

[6] Axis Security, "The State of AI Agents in Enterprise Security," 2025. https://www.axissecurity.com/ai-agent-security-report

[7] IBM Security, "Cost of a Data Breach Report 2025," 2025. https://www.ibm.com/security/data-breach