From Compute to Congress: March Was All About Agents. And So Are We.

Why evaluating non-deterministic systems requires a fundamentally different approach.

March was a month of agents. The White House released its National Policy Framework for AI, pushing for innovation and American AI dominance, announced on the heels of the Cyber Strategy for America and an Executive Order on Combating Cybercrime. And NIST joined in with an AI Agent Security RFI (NIST-2025-0035) on the security challenges unique to agentic systems. All of this led up to the RSA Conference, where AI-enabled cyber developments dominated the agenda. Dreadnode met the moment with its launch of Platform 2.0, a complete infrastructure platform for building, evaluating, and deploying security agents, and with a response to the NIST RFI laying out what agent security actually requires.

Read our NIST RFI Response [PDF]

Agentic capabilities are actively changing the security landscape faster than the policy infrastructure can adapt. AI agents now solve complex security challenges in minutes, where humans might need hours or days. And the frontier is moving faster than that statistic suggests. Consider Anthropic’s Claude Mythos Preview. Mythos, a frontier model not specifically trained for cybersecurity, autonomously discovered thousands of critical zero-day vulnerabilities across every major operating system and browser, including flaws that survived decades of human review. The boundary between bounded enterprise deployments and open-internet exposure is dissolving, and our security frameworks are not keeping pace.

Dreadnode has been building in this space since our inception. 2.0, the NIST RFI response, and the research we publish all come from a single conviction: agent security can’t be bolted on after the fact, and the evaluation frameworks we use to measure it need to move at the speed of the systems themselves. What follows is our read on where the policy conversation stands, and what it will take to close the gap.

What Exactly is an Agent, and Why Does it Matter for Cyber?

An AI agent is a goal-driven system where a generative AI model plans, reasons, and takes actions by using tools within an environment, operating with partial autonomy. It can interact with external systems such as APIs, code execution environments, and other agents to accomplish tasks.

This is a fundamentally different security proposition than a chatbot. As the Dreadnode team articulated in our NIST RFI response, a core challenge in agentic AI is that the language models process instructions and data within the same channel, without a reliable separation boundary. When these systems are connected to real tools and data, a successful adversarial attack can lead to unauthorized actions. This type of architecture demands comprehensive defenses, which most current deployments lack.

What makes this particularly acute is non-determinism. Unlike traditional software, AI agents are probabilistic: the same prompt can yield different behaviors across runs. As a result, agents can’t be patched in the same way as traditional software vulnerabilities. Mitigations can reduce undesired behaviors but do not eliminate them completely, and emergent behaviors arising from the combination of reasoning autonomy and tool access are difficult to model, sandbox, or fully constrain using conventional security controls.

The Policy Gap: Static Frameworks for a Non-Deterministic World

Historically, cyber has been governed by static, procedural frameworks subject to manual and compliance-based review. In practice, that means the Federal Information Security Modernization Act or the Cybersecurity Maturity Model Certification, which demand mountains of paperwork and expensive review cycles before anything is actually secured. Even FedRAMP, before 20x, took over a year to authorize a single cloud service. These frameworks assume deterministic systems with predictable behaviors and reproducible test cases.

Agents break that assumption at a foundational level. In some ways, they exercise something closer to judgment than traditional software does: weighing context, making trade-offs, adapting to novel situations. Non-determinism requires a kind of governance that operates on a spectrum rather than a checklist, and it’s only logical to evaluate non-deterministic capabilities on their own terms rather than forcing them into binary pass or fail models designed for a different era.

This is easier said than done. Cyber and AI frameworks are largely developed in isolation: the cybersecurity community builds its frameworks, the AI safety community builds its frameworks, and the two rarely talk to each other. One exception is the NIST Cyber AI Profile, which attempts to bridge the gap. But without a way to filter which aspects are relevant to your role, organization, or deployment context, it’s an overwhelming 107 page document that tries to do everything at once.

What’s missing is an evaluation model that can operate at the speed and scale of the systems it’s meant to govern. We already know that speed is possible, and the FedRAMP 20x pilot is proof: machine-readable compliance, automated evidence, continuous monitoring. Now that the technology barrier has been eliminated, the remaining question is whether we apply that same model to agent security, or let the precedent sit unused.

AI-Powered Governance: Evaluating Agents with Agents

If AI-enabled cyber policy were to take shape, it would look like AI-powered governance. Our team has pursued this thesis directly. In PentestJudge, we demonstrated that large language models can independently measure and assess the quality of cyber operations. The system uses rubric-based evaluation of agent trajectories to judge whether a penetration testing agent met its operational requirements. The results suggest LLM-based judges can evaluate complex, multi-step security tasks with reliability competitive with human expert assessors, at a fraction of the cost and at meaningfully greater scale.

This matters for policy because it points toward a future where evaluations aren’t a bottleneck. If AI systems can assess other AI systems rigorously, continuously, and at scale, then the compliance model can function as a living feedback loop. But that requires a clearer delineation of which capabilities and thresholds we’re seeking to measure, and a shared framework for expressing those requirements in machine-readable terms.

This capability evaluation gap is particularly urgent in the current policy environment. The White House’s National AI Framework pushes for regulatory sandboxes and innovation acceleration. The NIST RFI signals that the standards community recognizes the need for agent-specific security guidance. The window to shape these frameworks is open, but it won’t stay open indefinitely. Capabilities are advancing on a timeline measured in months: Cybench CTF scores doubled in six months and the gap between frontier and open-weight model capabilities is approximately eight months, according to METR. Evaluations that can’t keep pace with that velocity risk becoming historical artifacts.

What Comes Next

Dreadnode built 2.0 to close the gap between proof-of-concept agents and production-ready security tools, with hosted evaluations, AI red teaming, optimizations, training, and capability registries designed specifically for the security domain. That infrastructure exists because we believe the same systems driving the threat landscape forward should be the ones we use to defend it.

Building on 2.0 and the agent security work we’ve been digging into, Dreadnode is convening security practitioners, AI developers, and policymakers. The goal is to demonstrate what agents can actually do and identify minimally viable security baselines for digital resilience. What can AI agents be tasked to do, and how do we measure success or failure when setting cybersecurity standards? We’ll publish our policy takeaways from that session in a follow-up piece.

In the meantime, our full response to the NIST AI Agent Security RFI is available here. It covers the threat landscape, security controls, evaluation methodology, deployment guidance, and specific policy recommendations in detail. Whether you’re building agents, deploying agents, or writing the rules for agents, it’s worth the read.