AI Observability Engineering: Why Traditional Monitoring Misses 90% of Agent Risks

Traditional observability misses 90% of AI agent security risks. Microsoft's updated Secure Development Lifecycle (SDL) reveals why logs, metrics, and t...

By Rajesh Beri·March 29, 2026·13 min read
Share:

THE DAILY BRIEF

AI SecurityObservabilityAI GovernanceMicrosoft SDLCTOCISO

AI Observability Engineering: Why Traditional Monitoring Misses 90% of Agent Risks

Traditional observability misses 90% of AI agent security risks. Microsoft's updated Secure Development Lifecycle (SDL) reveals why logs, metrics, and t...

By Rajesh Beri·March 29, 2026·13 min read

Your traditional observability stack—built for uptime, latency, and error rates—cannot detect the most dangerous AI agent failures. Microsoft's March 18, 2026, update to its Secure Development Lifecycle (SDL) reveals why: traditional monitoring measures what goes wrong with infrastructure, but AI systems fail when trust boundaries between agents and external content get compromised. And those failures don't show up as errors. They show up as agents behaving exactly as designed… just under attacker control.

Here's the scenario Microsoft's security team uses to explain the gap. An email agent asks a research agent to look up something on the web. The research agent fetches a page containing hidden instructions and passes the poisoned content back to the email agent as trusted input. The email agent, now operating under attacker influence, forwards sensitive documents to unauthorized recipients. Data exfiltration complete.

Traditional health metrics stay green. No failures. No errors. No alerts. The system worked perfectly. Except an attacker just stole your data through indirect prompt injection that your observability stack never detected because it wasn't looking for the right signals.

⚠️ The AI Observability Gap

Traditional observability: "Is the system up?" → Health metrics, latency, errors

AI observability: "Is the system doing what we intended?" → Context assembly, trust boundaries, multi-turn behavior, tool invocation patterns

This isn't theoretical. Microsoft updated its SDL—the same framework used internally for Azure, Microsoft 365, and Agent 365—because traditional monitoring practices don't work for agentic AI systems. And if you're deploying AI agents in production without AI-native observability, you're flying blind.

Why Traditional Observability Fails for AI Systems

Traditional software observability was built on three pillars: logs, metrics, and traces. These work well for deterministic systems where code paths follow predictable flows. API call comes in, business logic executes, response goes out. Measure latency, count errors, trace request paths. Done.

AI systems break this model in three fundamental ways.

AI systems are probabilistic, not deterministic. Traditional software follows if-then logic. AI systems evaluate natural language inputs and return probabilistic results that can differ—subtly or significantly—from execution to execution. You can't define success and failure modes as a finite set of error codes. The same prompt can generate different outputs depending on context assembled at runtime.

Context is assembled dynamically, not schema-defined. Traditional services treat inputs as bounded and schema-defined: JSON payloads with known fields, validated at entry. AI systems assemble context from multiple sources: system instructions, conversation history, retrieved documents, tool outputs, web page scrapes. Each component has a different trust level and provenance, but traditional observability doesn't track this. It just sees "request in, response out."

Failures unfold across multiple turns, not single requests. Traditional observability is optimized for request-level correlation: one request maps to one outcome. AI agent failures can take dozens of turns to manifest. Multi-turn jailbreaks like Crescendo start with seemingly harmless prompts and escalate conversation by conversation until the system produces disallowed output. Each individual turn looks fine. The pattern across 20 turns reveals the attack.

Microsoft's SDL update makes this explicit: logs, metrics, and traces still apply to AI systems, but what gets captured within them must change. Traditional observability measures infrastructure health. AI observability measures whether the system is doing what you intended—or what an attacker intended.

Photo by Slejven Djurakovic on Pexels

Traditional vs. AI Observability: What Changes

The core observability components—logs, metrics, traces—remain, but AI systems require two additional components: evaluation and governance. Here's what changes:

Component Traditional Observability AI Observability
Logs Request ID, timestamp, status code, error messages User prompts, model responses, retrieval provenance, tool invocations, permission context
Metrics Latency, throughput, error rates, CPU/memory Token usage, agent turns, retrieval volume, tool call frequency, evaluation score distributions
Traces Request-level correlation (one request = one trace) Agent lifecycle-level correlation (conversation ID across turns, end-to-end context propagation)
Evaluation N/A (quality = uptime) Response quality, grounding accuracy, tool usage correctness, instruction alignment
Governance N/A (compliance = access controls) Policy enforcement verification, auditability, accountability, behavioral baselining

Logs must capture full context assembly. User prompts and model responses are often the earliest signal of novel attacks before signatures exist. They're essential for identifying multi-turn escalation, verifying whether attacks changed system behavior, adjudicating safety detections, and reconstructing attack paths. Microsoft's guidance is explicit: log which data sources were consulted, which tools were invoked with what arguments, and what permissions were in effect. This detail distinguishes a model error from an exploited trust boundary.

Metrics need behavioral baselines, not just static thresholds. Traditional metrics alert when error rates cross 5% or latency exceeds 500ms. AI metrics need to capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—and alert on meaningful departures from those baselines. An agent that suddenly doubles its web scraping volume or triples token usage might be under attacker control, not experiencing a performance issue.

Traces must span agent lifecycles, not individual requests. Request-level correlation works when each request is independent. AI agents maintain state across conversations. A stable conversation identifier must propagate across turns, preserving trace context end-to-end, so outcomes can be understood within the full conversational narrative. Without this, debugging a multi-turn jailbreak means guessing which of 30 individual turns went wrong.

Evaluation measures response quality in production. Traditional observability assumes that if the system returns HTTP 200, the response is correct. AI systems can return perfectly formatted responses that are factually wrong, ungrounded in source material, or violate safety policies. Evaluation gives teams measurable signals to understand agent reliability, instruction alignment, and operational risk over time.

Governance verifies policy enforcement using observable evidence. Compliance in traditional systems means access controls and audit logs. Governance for AI systems means the ability to measure, verify, and enforce acceptable system behavior using telemetry and control plane mechanisms. This ensures that policies aren't just configured—they're actually working in production.

Microsoft's 5-Step SDL Framework for AI Observability

Microsoft's updated SDL provides a formal mechanism for operationalizing AI observability. The framework embeds observability as a release requirement, not a post-deployment add-on. Here's the 5-step process:

5-Step AI Observability Framework (Microsoft SDL)

  1. Incorporate AI observability into secure development standards. Observability for GenAI and agentic AI systems should be codified requirements within your development lifecycle—not discretionary practices left to individual teams. This means updating your SDL, security reviews, and architecture sign-off processes to include AI-native telemetry requirements before code review.
  2. Instrument from the start of development. Build AI-native telemetry into your system at design time, not after release. Aligning with industry conventions like OpenTelemetry (OTel) and its GenAI semantic conventions improves consistency and interoperability across frameworks. For Microsoft stacks, this means using Microsoft Foundry agent tracing for runtime diagnostics and the Microsoft Agent 365 Observability SDK for tenant-level governance.
  3. Capture the full context. Log user prompts and model responses, retrieval provenance, tool invocations with arguments, and permission context. This detail helps security teams distinguish model errors from exploited trust boundaries and enables end-to-end forensic reconstruction. What to capture should be governed by clear data contracts that balance forensic needs against privacy, data residency, retention requirements, and compliance obligations.
  4. Establish behavioral baselines and alert on deviation. Capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—through Azure Monitor, Application Insights, or similar services. Alert on meaningful departures from those baselines rather than relying solely on static error thresholds. An agent that suddenly starts calling external APIs 10x more frequently might be compromised, not just busy.
  5. Manage enterprise AI agents centrally. Observability alone cannot answer every question. Technology leaders need to know how many AI agents are running, whether those agents are secure, and whether compliance and policy enforcement are consistent. Observability coupled with unified governance (Microsoft Foundry Control Plane, Microsoft Agent 365) consolidates inventory, compliance verification, and security into one role-aware interface.

The critical insight in Microsoft's framework: if you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system is not ready for production. AI observability should be a release gate, not a nice-to-have.

What Enterprise Security Teams Gain From AI Observability

Making enterprise AI systems observable transforms opaque model behavior into actionable security signals. This strengthens both proactive risk detection (catching attacks before damage) and reactive incident investigation (reconstructing what happened after the fact).

Proactive risk detection: behavioral baselines catch novel attacks. Traditional security tools rely on signatures—known attack patterns. AI attacks evolve faster than signature databases. Behavioral baselining lets security teams detect anomalies in agent behavior that don't match any known attack but clearly deviate from normal patterns. An agent that starts retrieving documents from unexpected SharePoint sites or calling tools in sequences it's never used before gets flagged for investigation before data leaves the perimeter.

Reactive incident investigation: full context enables forensic reconstruction. When an AI agent does something suspicious, security teams need to reconstruct the entire sequence: what prompted the behavior, which context influenced the decision, which tools were called, what data was accessed, and where outputs were sent. Without AI-native logs and traces, this is impossible. With full context capture, security teams can replay the agent's decision-making process step by step and determine whether the system was compromised, misconfigured, or functioning as designed.

Inference-time protections get measurable validation. Many organizations deploy guardrails at inference time—content filters, prompt injection detectors, policy enforcement layers. Observability complements these protections by enabling fast incident reconstruction, clear impact analysis, and measurable improvement over time. Security teams can evaluate whether controls are working as intended, which attacks they catch, and which ones slip through. This feedback loop drives continuous improvement in AI security posture.

Compliance and governance become verifiable, not aspirational. Traditional compliance means proving you configured the right access controls. AI governance means proving your AI systems behave according to policy in production. Observability provides the evidence: logs show which permissions were in effect, traces show which tools were invoked, evaluation scores show whether outputs met quality standards, and behavioral metrics show whether agents stayed within acceptable bounds. This makes audit conversations concrete instead of theoretical.

Implementing AI Observability: What CTOs and CISOs Should Do Now

The SDL framework provides the roadmap, but execution requires coordination across engineering, security, and operations teams. Here's the implementation sequence:

Start with OpenTelemetry for AI-native instrumentation. OpenTelemetry's GenAI semantic conventions provide a vendor-neutral standard for capturing AI-specific telemetry. This means logs, metrics, and traces collected from your AI systems remain portable across observability platforms (Datadog, Dash0, Azure Monitor, Splunk). For enterprise teams already using OTel for traditional services, extending it to AI systems maintains consistency across your observability stack.

Define data contracts early in the development lifecycle. What gets logged? User prompts and model responses (essential for attack detection but sensitive from a privacy perspective). Retrieval provenance (which documents influenced the output). Tool invocations and arguments (which external systems did the agent touch). Permission context (who authorized the action). Define these contracts before development starts, not during incident response.

Implement behavioral baselining before production deployment. Capture 2-4 weeks of agent behavior in staging environments to establish normal patterns. This gives security teams a reference point for detecting anomalies in production. Without baselines, alerts become noisy—every deviation triggers an alert, and real attacks get buried in false positives.

Integrate with existing security tooling. AI observability shouldn't exist in isolation. Feed AI-native telemetry into your SIEM (Security Information and Event Management) platform so security teams can correlate AI agent behavior with network traffic, authentication events, and data access patterns. An agent exfiltrating data will generate signals across multiple systems—observability platforms, network logs, identity systems. Correlating these signals reveals the full attack path.

Test detectability as part of security validation. Before releasing AI systems to production, run penetration tests that specifically target AI vulnerabilities: indirect prompt injection, multi-turn jailbreaks, tool-mediated data exfiltration. Verify that your observability stack detects these attacks and generates actionable alerts. If security testing reveals blind spots, instrument additional telemetry before go-live.

The Bottom Line for Enterprise Leaders

Traditional observability was built for deterministic systems where success and failure modes are predictable. AI systems—especially agentic AI that autonomously retrieves data, calls tools, and collaborates across agents—break this model. Microsoft's SDL update acknowledges this reality and provides a framework for adapting observability practices to non-deterministic systems.

The core shift: observability moves from measuring infrastructure health to measuring system intent. Is the AI agent doing what we designed it to do, or is it following instructions embedded in external content? Traditional monitoring can't answer this question. AI observability can.

For CTOs: AI observability is not optional infrastructure—it's a release requirement. If you cannot reconstruct agent behavior from logs and traces, you cannot debug failures, respond to incidents, or verify compliance. Embed AI-native telemetry requirements into your SDL now, before production deployments create blind spots you can't close later.

For CISOs: Traditional security controls—firewalls, access controls, content filters—remain necessary but insufficient for AI systems. Observability provides the visibility layer that lets security teams detect novel attacks, reconstruct incident timelines, and verify that inference-time protections are working as intended. Without AI observability, you're defending against attacks you can't see.

For CFOs: The cost of AI observability infrastructure (OpenTelemetry, storage for logs/traces, behavioral analytics tooling) is a fraction of the cost of an undetected data breach or compliance violation caused by a compromised AI agent. Treat AI observability as risk mitigation, not overhead.

Microsoft's SDL framework isn't theoretical—it's the same framework used internally for Azure, Microsoft 365, and Agent 365. If the organization building the AI platforms your enterprise relies on requires AI observability before production, your organization should too.

The question isn't whether to implement AI observability. It's whether to do it before or after your first AI agent incident.


How is your organization approaching AI observability? Connect with me on LinkedIn, Twitter/X, or via the contact form.

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

AI Observability Engineering: Why Traditional Monitoring Misses 90% of Agent Risks

Photo by Markus Spiske on Pexels

Your traditional observability stack—built for uptime, latency, and error rates—cannot detect the most dangerous AI agent failures. Microsoft's March 18, 2026, update to its Secure Development Lifecycle (SDL) reveals why: traditional monitoring measures what goes wrong with infrastructure, but AI systems fail when trust boundaries between agents and external content get compromised. And those failures don't show up as errors. They show up as agents behaving exactly as designed… just under attacker control.

Here's the scenario Microsoft's security team uses to explain the gap. An email agent asks a research agent to look up something on the web. The research agent fetches a page containing hidden instructions and passes the poisoned content back to the email agent as trusted input. The email agent, now operating under attacker influence, forwards sensitive documents to unauthorized recipients. Data exfiltration complete.

Traditional health metrics stay green. No failures. No errors. No alerts. The system worked perfectly. Except an attacker just stole your data through indirect prompt injection that your observability stack never detected because it wasn't looking for the right signals.

⚠️ The AI Observability Gap

Traditional observability: "Is the system up?" → Health metrics, latency, errors

AI observability: "Is the system doing what we intended?" → Context assembly, trust boundaries, multi-turn behavior, tool invocation patterns

This isn't theoretical. Microsoft updated its SDL—the same framework used internally for Azure, Microsoft 365, and Agent 365—because traditional monitoring practices don't work for agentic AI systems. And if you're deploying AI agents in production without AI-native observability, you're flying blind.

Why Traditional Observability Fails for AI Systems

Traditional software observability was built on three pillars: logs, metrics, and traces. These work well for deterministic systems where code paths follow predictable flows. API call comes in, business logic executes, response goes out. Measure latency, count errors, trace request paths. Done.

AI systems break this model in three fundamental ways.

AI systems are probabilistic, not deterministic. Traditional software follows if-then logic. AI systems evaluate natural language inputs and return probabilistic results that can differ—subtly or significantly—from execution to execution. You can't define success and failure modes as a finite set of error codes. The same prompt can generate different outputs depending on context assembled at runtime.

Context is assembled dynamically, not schema-defined. Traditional services treat inputs as bounded and schema-defined: JSON payloads with known fields, validated at entry. AI systems assemble context from multiple sources: system instructions, conversation history, retrieved documents, tool outputs, web page scrapes. Each component has a different trust level and provenance, but traditional observability doesn't track this. It just sees "request in, response out."

Failures unfold across multiple turns, not single requests. Traditional observability is optimized for request-level correlation: one request maps to one outcome. AI agent failures can take dozens of turns to manifest. Multi-turn jailbreaks like Crescendo start with seemingly harmless prompts and escalate conversation by conversation until the system produces disallowed output. Each individual turn looks fine. The pattern across 20 turns reveals the attack.

Microsoft's SDL update makes this explicit: logs, metrics, and traces still apply to AI systems, but what gets captured within them must change. Traditional observability measures infrastructure health. AI observability measures whether the system is doing what you intended—or what an attacker intended.

Data center servers with monitoring displays Photo by Slejven Djurakovic on Pexels

Traditional vs. AI Observability: What Changes

The core observability components—logs, metrics, traces—remain, but AI systems require two additional components: evaluation and governance. Here's what changes:

Component Traditional Observability AI Observability
Logs Request ID, timestamp, status code, error messages User prompts, model responses, retrieval provenance, tool invocations, permission context
Metrics Latency, throughput, error rates, CPU/memory Token usage, agent turns, retrieval volume, tool call frequency, evaluation score distributions
Traces Request-level correlation (one request = one trace) Agent lifecycle-level correlation (conversation ID across turns, end-to-end context propagation)
Evaluation N/A (quality = uptime) Response quality, grounding accuracy, tool usage correctness, instruction alignment
Governance N/A (compliance = access controls) Policy enforcement verification, auditability, accountability, behavioral baselining

Logs must capture full context assembly. User prompts and model responses are often the earliest signal of novel attacks before signatures exist. They're essential for identifying multi-turn escalation, verifying whether attacks changed system behavior, adjudicating safety detections, and reconstructing attack paths. Microsoft's guidance is explicit: log which data sources were consulted, which tools were invoked with what arguments, and what permissions were in effect. This detail distinguishes a model error from an exploited trust boundary.

Metrics need behavioral baselines, not just static thresholds. Traditional metrics alert when error rates cross 5% or latency exceeds 500ms. AI metrics need to capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—and alert on meaningful departures from those baselines. An agent that suddenly doubles its web scraping volume or triples token usage might be under attacker control, not experiencing a performance issue.

Traces must span agent lifecycles, not individual requests. Request-level correlation works when each request is independent. AI agents maintain state across conversations. A stable conversation identifier must propagate across turns, preserving trace context end-to-end, so outcomes can be understood within the full conversational narrative. Without this, debugging a multi-turn jailbreak means guessing which of 30 individual turns went wrong.

Evaluation measures response quality in production. Traditional observability assumes that if the system returns HTTP 200, the response is correct. AI systems can return perfectly formatted responses that are factually wrong, ungrounded in source material, or violate safety policies. Evaluation gives teams measurable signals to understand agent reliability, instruction alignment, and operational risk over time.

Governance verifies policy enforcement using observable evidence. Compliance in traditional systems means access controls and audit logs. Governance for AI systems means the ability to measure, verify, and enforce acceptable system behavior using telemetry and control plane mechanisms. This ensures that policies aren't just configured—they're actually working in production.

Microsoft's 5-Step SDL Framework for AI Observability

Microsoft's updated SDL provides a formal mechanism for operationalizing AI observability. The framework embeds observability as a release requirement, not a post-deployment add-on. Here's the 5-step process:

5-Step AI Observability Framework (Microsoft SDL)

  1. Incorporate AI observability into secure development standards. Observability for GenAI and agentic AI systems should be codified requirements within your development lifecycle—not discretionary practices left to individual teams. This means updating your SDL, security reviews, and architecture sign-off processes to include AI-native telemetry requirements before code review.
  2. Instrument from the start of development. Build AI-native telemetry into your system at design time, not after release. Aligning with industry conventions like OpenTelemetry (OTel) and its GenAI semantic conventions improves consistency and interoperability across frameworks. For Microsoft stacks, this means using Microsoft Foundry agent tracing for runtime diagnostics and the Microsoft Agent 365 Observability SDK for tenant-level governance.
  3. Capture the full context. Log user prompts and model responses, retrieval provenance, tool invocations with arguments, and permission context. This detail helps security teams distinguish model errors from exploited trust boundaries and enables end-to-end forensic reconstruction. What to capture should be governed by clear data contracts that balance forensic needs against privacy, data residency, retention requirements, and compliance obligations.
  4. Establish behavioral baselines and alert on deviation. Capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—through Azure Monitor, Application Insights, or similar services. Alert on meaningful departures from those baselines rather than relying solely on static error thresholds. An agent that suddenly starts calling external APIs 10x more frequently might be compromised, not just busy.
  5. Manage enterprise AI agents centrally. Observability alone cannot answer every question. Technology leaders need to know how many AI agents are running, whether those agents are secure, and whether compliance and policy enforcement are consistent. Observability coupled with unified governance (Microsoft Foundry Control Plane, Microsoft Agent 365) consolidates inventory, compliance verification, and security into one role-aware interface.

The critical insight in Microsoft's framework: if you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system is not ready for production. AI observability should be a release gate, not a nice-to-have.

What Enterprise Security Teams Gain From AI Observability

Making enterprise AI systems observable transforms opaque model behavior into actionable security signals. This strengthens both proactive risk detection (catching attacks before damage) and reactive incident investigation (reconstructing what happened after the fact).

Proactive risk detection: behavioral baselines catch novel attacks. Traditional security tools rely on signatures—known attack patterns. AI attacks evolve faster than signature databases. Behavioral baselining lets security teams detect anomalies in agent behavior that don't match any known attack but clearly deviate from normal patterns. An agent that starts retrieving documents from unexpected SharePoint sites or calling tools in sequences it's never used before gets flagged for investigation before data leaves the perimeter.

Reactive incident investigation: full context enables forensic reconstruction. When an AI agent does something suspicious, security teams need to reconstruct the entire sequence: what prompted the behavior, which context influenced the decision, which tools were called, what data was accessed, and where outputs were sent. Without AI-native logs and traces, this is impossible. With full context capture, security teams can replay the agent's decision-making process step by step and determine whether the system was compromised, misconfigured, or functioning as designed.

Inference-time protections get measurable validation. Many organizations deploy guardrails at inference time—content filters, prompt injection detectors, policy enforcement layers. Observability complements these protections by enabling fast incident reconstruction, clear impact analysis, and measurable improvement over time. Security teams can evaluate whether controls are working as intended, which attacks they catch, and which ones slip through. This feedback loop drives continuous improvement in AI security posture.

Compliance and governance become verifiable, not aspirational. Traditional compliance means proving you configured the right access controls. AI governance means proving your AI systems behave according to policy in production. Observability provides the evidence: logs show which permissions were in effect, traces show which tools were invoked, evaluation scores show whether outputs met quality standards, and behavioral metrics show whether agents stayed within acceptable bounds. This makes audit conversations concrete instead of theoretical.

Implementing AI Observability: What CTOs and CISOs Should Do Now

The SDL framework provides the roadmap, but execution requires coordination across engineering, security, and operations teams. Here's the implementation sequence:

Start with OpenTelemetry for AI-native instrumentation. OpenTelemetry's GenAI semantic conventions provide a vendor-neutral standard for capturing AI-specific telemetry. This means logs, metrics, and traces collected from your AI systems remain portable across observability platforms (Datadog, Dash0, Azure Monitor, Splunk). For enterprise teams already using OTel for traditional services, extending it to AI systems maintains consistency across your observability stack.

Define data contracts early in the development lifecycle. What gets logged? User prompts and model responses (essential for attack detection but sensitive from a privacy perspective). Retrieval provenance (which documents influenced the output). Tool invocations and arguments (which external systems did the agent touch). Permission context (who authorized the action). Define these contracts before development starts, not during incident response.

Implement behavioral baselining before production deployment. Capture 2-4 weeks of agent behavior in staging environments to establish normal patterns. This gives security teams a reference point for detecting anomalies in production. Without baselines, alerts become noisy—every deviation triggers an alert, and real attacks get buried in false positives.

Integrate with existing security tooling. AI observability shouldn't exist in isolation. Feed AI-native telemetry into your SIEM (Security Information and Event Management) platform so security teams can correlate AI agent behavior with network traffic, authentication events, and data access patterns. An agent exfiltrating data will generate signals across multiple systems—observability platforms, network logs, identity systems. Correlating these signals reveals the full attack path.

Test detectability as part of security validation. Before releasing AI systems to production, run penetration tests that specifically target AI vulnerabilities: indirect prompt injection, multi-turn jailbreaks, tool-mediated data exfiltration. Verify that your observability stack detects these attacks and generates actionable alerts. If security testing reveals blind spots, instrument additional telemetry before go-live.

The Bottom Line for Enterprise Leaders

Traditional observability was built for deterministic systems where success and failure modes are predictable. AI systems—especially agentic AI that autonomously retrieves data, calls tools, and collaborates across agents—break this model. Microsoft's SDL update acknowledges this reality and provides a framework for adapting observability practices to non-deterministic systems.

The core shift: observability moves from measuring infrastructure health to measuring system intent. Is the AI agent doing what we designed it to do, or is it following instructions embedded in external content? Traditional monitoring can't answer this question. AI observability can.

For CTOs: AI observability is not optional infrastructure—it's a release requirement. If you cannot reconstruct agent behavior from logs and traces, you cannot debug failures, respond to incidents, or verify compliance. Embed AI-native telemetry requirements into your SDL now, before production deployments create blind spots you can't close later.

For CISOs: Traditional security controls—firewalls, access controls, content filters—remain necessary but insufficient for AI systems. Observability provides the visibility layer that lets security teams detect novel attacks, reconstruct incident timelines, and verify that inference-time protections are working as intended. Without AI observability, you're defending against attacks you can't see.

For CFOs: The cost of AI observability infrastructure (OpenTelemetry, storage for logs/traces, behavioral analytics tooling) is a fraction of the cost of an undetected data breach or compliance violation caused by a compromised AI agent. Treat AI observability as risk mitigation, not overhead.

Microsoft's SDL framework isn't theoretical—it's the same framework used internally for Azure, Microsoft 365, and Agent 365. If the organization building the AI platforms your enterprise relies on requires AI observability before production, your organization should too.

The question isn't whether to implement AI observability. It's whether to do it before or after your first AI agent incident.


How is your organization approaching AI observability? Connect with me on LinkedIn, Twitter/X, or via the contact form.

Share:

THE DAILY BRIEF

AI SecurityObservabilityAI GovernanceMicrosoft SDLCTOCISO

AI Observability Engineering: Why Traditional Monitoring Misses 90% of Agent Risks

Traditional observability misses 90% of AI agent security risks. Microsoft's updated Secure Development Lifecycle (SDL) reveals why logs, metrics, and t...

By Rajesh Beri·March 29, 2026·13 min read

Your traditional observability stack—built for uptime, latency, and error rates—cannot detect the most dangerous AI agent failures. Microsoft's March 18, 2026, update to its Secure Development Lifecycle (SDL) reveals why: traditional monitoring measures what goes wrong with infrastructure, but AI systems fail when trust boundaries between agents and external content get compromised. And those failures don't show up as errors. They show up as agents behaving exactly as designed… just under attacker control.

Here's the scenario Microsoft's security team uses to explain the gap. An email agent asks a research agent to look up something on the web. The research agent fetches a page containing hidden instructions and passes the poisoned content back to the email agent as trusted input. The email agent, now operating under attacker influence, forwards sensitive documents to unauthorized recipients. Data exfiltration complete.

Traditional health metrics stay green. No failures. No errors. No alerts. The system worked perfectly. Except an attacker just stole your data through indirect prompt injection that your observability stack never detected because it wasn't looking for the right signals.

⚠️ The AI Observability Gap

Traditional observability: "Is the system up?" → Health metrics, latency, errors

AI observability: "Is the system doing what we intended?" → Context assembly, trust boundaries, multi-turn behavior, tool invocation patterns

This isn't theoretical. Microsoft updated its SDL—the same framework used internally for Azure, Microsoft 365, and Agent 365—because traditional monitoring practices don't work for agentic AI systems. And if you're deploying AI agents in production without AI-native observability, you're flying blind.

Why Traditional Observability Fails for AI Systems

Traditional software observability was built on three pillars: logs, metrics, and traces. These work well for deterministic systems where code paths follow predictable flows. API call comes in, business logic executes, response goes out. Measure latency, count errors, trace request paths. Done.

AI systems break this model in three fundamental ways.

AI systems are probabilistic, not deterministic. Traditional software follows if-then logic. AI systems evaluate natural language inputs and return probabilistic results that can differ—subtly or significantly—from execution to execution. You can't define success and failure modes as a finite set of error codes. The same prompt can generate different outputs depending on context assembled at runtime.

Context is assembled dynamically, not schema-defined. Traditional services treat inputs as bounded and schema-defined: JSON payloads with known fields, validated at entry. AI systems assemble context from multiple sources: system instructions, conversation history, retrieved documents, tool outputs, web page scrapes. Each component has a different trust level and provenance, but traditional observability doesn't track this. It just sees "request in, response out."

Failures unfold across multiple turns, not single requests. Traditional observability is optimized for request-level correlation: one request maps to one outcome. AI agent failures can take dozens of turns to manifest. Multi-turn jailbreaks like Crescendo start with seemingly harmless prompts and escalate conversation by conversation until the system produces disallowed output. Each individual turn looks fine. The pattern across 20 turns reveals the attack.

Microsoft's SDL update makes this explicit: logs, metrics, and traces still apply to AI systems, but what gets captured within them must change. Traditional observability measures infrastructure health. AI observability measures whether the system is doing what you intended—or what an attacker intended.

Photo by Slejven Djurakovic on Pexels

Traditional vs. AI Observability: What Changes

The core observability components—logs, metrics, traces—remain, but AI systems require two additional components: evaluation and governance. Here's what changes:

Component Traditional Observability AI Observability
Logs Request ID, timestamp, status code, error messages User prompts, model responses, retrieval provenance, tool invocations, permission context
Metrics Latency, throughput, error rates, CPU/memory Token usage, agent turns, retrieval volume, tool call frequency, evaluation score distributions
Traces Request-level correlation (one request = one trace) Agent lifecycle-level correlation (conversation ID across turns, end-to-end context propagation)
Evaluation N/A (quality = uptime) Response quality, grounding accuracy, tool usage correctness, instruction alignment
Governance N/A (compliance = access controls) Policy enforcement verification, auditability, accountability, behavioral baselining

Logs must capture full context assembly. User prompts and model responses are often the earliest signal of novel attacks before signatures exist. They're essential for identifying multi-turn escalation, verifying whether attacks changed system behavior, adjudicating safety detections, and reconstructing attack paths. Microsoft's guidance is explicit: log which data sources were consulted, which tools were invoked with what arguments, and what permissions were in effect. This detail distinguishes a model error from an exploited trust boundary.

Metrics need behavioral baselines, not just static thresholds. Traditional metrics alert when error rates cross 5% or latency exceeds 500ms. AI metrics need to capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—and alert on meaningful departures from those baselines. An agent that suddenly doubles its web scraping volume or triples token usage might be under attacker control, not experiencing a performance issue.

Traces must span agent lifecycles, not individual requests. Request-level correlation works when each request is independent. AI agents maintain state across conversations. A stable conversation identifier must propagate across turns, preserving trace context end-to-end, so outcomes can be understood within the full conversational narrative. Without this, debugging a multi-turn jailbreak means guessing which of 30 individual turns went wrong.

Evaluation measures response quality in production. Traditional observability assumes that if the system returns HTTP 200, the response is correct. AI systems can return perfectly formatted responses that are factually wrong, ungrounded in source material, or violate safety policies. Evaluation gives teams measurable signals to understand agent reliability, instruction alignment, and operational risk over time.

Governance verifies policy enforcement using observable evidence. Compliance in traditional systems means access controls and audit logs. Governance for AI systems means the ability to measure, verify, and enforce acceptable system behavior using telemetry and control plane mechanisms. This ensures that policies aren't just configured—they're actually working in production.

Microsoft's 5-Step SDL Framework for AI Observability

Microsoft's updated SDL provides a formal mechanism for operationalizing AI observability. The framework embeds observability as a release requirement, not a post-deployment add-on. Here's the 5-step process:

5-Step AI Observability Framework (Microsoft SDL)

  1. Incorporate AI observability into secure development standards. Observability for GenAI and agentic AI systems should be codified requirements within your development lifecycle—not discretionary practices left to individual teams. This means updating your SDL, security reviews, and architecture sign-off processes to include AI-native telemetry requirements before code review.
  2. Instrument from the start of development. Build AI-native telemetry into your system at design time, not after release. Aligning with industry conventions like OpenTelemetry (OTel) and its GenAI semantic conventions improves consistency and interoperability across frameworks. For Microsoft stacks, this means using Microsoft Foundry agent tracing for runtime diagnostics and the Microsoft Agent 365 Observability SDK for tenant-level governance.
  3. Capture the full context. Log user prompts and model responses, retrieval provenance, tool invocations with arguments, and permission context. This detail helps security teams distinguish model errors from exploited trust boundaries and enables end-to-end forensic reconstruction. What to capture should be governed by clear data contracts that balance forensic needs against privacy, data residency, retention requirements, and compliance obligations.
  4. Establish behavioral baselines and alert on deviation. Capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—through Azure Monitor, Application Insights, or similar services. Alert on meaningful departures from those baselines rather than relying solely on static error thresholds. An agent that suddenly starts calling external APIs 10x more frequently might be compromised, not just busy.
  5. Manage enterprise AI agents centrally. Observability alone cannot answer every question. Technology leaders need to know how many AI agents are running, whether those agents are secure, and whether compliance and policy enforcement are consistent. Observability coupled with unified governance (Microsoft Foundry Control Plane, Microsoft Agent 365) consolidates inventory, compliance verification, and security into one role-aware interface.

The critical insight in Microsoft's framework: if you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system is not ready for production. AI observability should be a release gate, not a nice-to-have.

What Enterprise Security Teams Gain From AI Observability

Making enterprise AI systems observable transforms opaque model behavior into actionable security signals. This strengthens both proactive risk detection (catching attacks before damage) and reactive incident investigation (reconstructing what happened after the fact).

Proactive risk detection: behavioral baselines catch novel attacks. Traditional security tools rely on signatures—known attack patterns. AI attacks evolve faster than signature databases. Behavioral baselining lets security teams detect anomalies in agent behavior that don't match any known attack but clearly deviate from normal patterns. An agent that starts retrieving documents from unexpected SharePoint sites or calling tools in sequences it's never used before gets flagged for investigation before data leaves the perimeter.

Reactive incident investigation: full context enables forensic reconstruction. When an AI agent does something suspicious, security teams need to reconstruct the entire sequence: what prompted the behavior, which context influenced the decision, which tools were called, what data was accessed, and where outputs were sent. Without AI-native logs and traces, this is impossible. With full context capture, security teams can replay the agent's decision-making process step by step and determine whether the system was compromised, misconfigured, or functioning as designed.

Inference-time protections get measurable validation. Many organizations deploy guardrails at inference time—content filters, prompt injection detectors, policy enforcement layers. Observability complements these protections by enabling fast incident reconstruction, clear impact analysis, and measurable improvement over time. Security teams can evaluate whether controls are working as intended, which attacks they catch, and which ones slip through. This feedback loop drives continuous improvement in AI security posture.

Compliance and governance become verifiable, not aspirational. Traditional compliance means proving you configured the right access controls. AI governance means proving your AI systems behave according to policy in production. Observability provides the evidence: logs show which permissions were in effect, traces show which tools were invoked, evaluation scores show whether outputs met quality standards, and behavioral metrics show whether agents stayed within acceptable bounds. This makes audit conversations concrete instead of theoretical.

Implementing AI Observability: What CTOs and CISOs Should Do Now

The SDL framework provides the roadmap, but execution requires coordination across engineering, security, and operations teams. Here's the implementation sequence:

Start with OpenTelemetry for AI-native instrumentation. OpenTelemetry's GenAI semantic conventions provide a vendor-neutral standard for capturing AI-specific telemetry. This means logs, metrics, and traces collected from your AI systems remain portable across observability platforms (Datadog, Dash0, Azure Monitor, Splunk). For enterprise teams already using OTel for traditional services, extending it to AI systems maintains consistency across your observability stack.

Define data contracts early in the development lifecycle. What gets logged? User prompts and model responses (essential for attack detection but sensitive from a privacy perspective). Retrieval provenance (which documents influenced the output). Tool invocations and arguments (which external systems did the agent touch). Permission context (who authorized the action). Define these contracts before development starts, not during incident response.

Implement behavioral baselining before production deployment. Capture 2-4 weeks of agent behavior in staging environments to establish normal patterns. This gives security teams a reference point for detecting anomalies in production. Without baselines, alerts become noisy—every deviation triggers an alert, and real attacks get buried in false positives.

Integrate with existing security tooling. AI observability shouldn't exist in isolation. Feed AI-native telemetry into your SIEM (Security Information and Event Management) platform so security teams can correlate AI agent behavior with network traffic, authentication events, and data access patterns. An agent exfiltrating data will generate signals across multiple systems—observability platforms, network logs, identity systems. Correlating these signals reveals the full attack path.

Test detectability as part of security validation. Before releasing AI systems to production, run penetration tests that specifically target AI vulnerabilities: indirect prompt injection, multi-turn jailbreaks, tool-mediated data exfiltration. Verify that your observability stack detects these attacks and generates actionable alerts. If security testing reveals blind spots, instrument additional telemetry before go-live.

The Bottom Line for Enterprise Leaders

Traditional observability was built for deterministic systems where success and failure modes are predictable. AI systems—especially agentic AI that autonomously retrieves data, calls tools, and collaborates across agents—break this model. Microsoft's SDL update acknowledges this reality and provides a framework for adapting observability practices to non-deterministic systems.

The core shift: observability moves from measuring infrastructure health to measuring system intent. Is the AI agent doing what we designed it to do, or is it following instructions embedded in external content? Traditional monitoring can't answer this question. AI observability can.

For CTOs: AI observability is not optional infrastructure—it's a release requirement. If you cannot reconstruct agent behavior from logs and traces, you cannot debug failures, respond to incidents, or verify compliance. Embed AI-native telemetry requirements into your SDL now, before production deployments create blind spots you can't close later.

For CISOs: Traditional security controls—firewalls, access controls, content filters—remain necessary but insufficient for AI systems. Observability provides the visibility layer that lets security teams detect novel attacks, reconstruct incident timelines, and verify that inference-time protections are working as intended. Without AI observability, you're defending against attacks you can't see.

For CFOs: The cost of AI observability infrastructure (OpenTelemetry, storage for logs/traces, behavioral analytics tooling) is a fraction of the cost of an undetected data breach or compliance violation caused by a compromised AI agent. Treat AI observability as risk mitigation, not overhead.

Microsoft's SDL framework isn't theoretical—it's the same framework used internally for Azure, Microsoft 365, and Agent 365. If the organization building the AI platforms your enterprise relies on requires AI observability before production, your organization should too.

The question isn't whether to implement AI observability. It's whether to do it before or after your first AI agent incident.


How is your organization approaching AI observability? Connect with me on LinkedIn, Twitter/X, or via the contact form.

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Newsletter

Stay Ahead of the Curve

Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.

Subscribe