Research, write ups, talks, and code from the crew.
Worlds: A Simulation Engine for Agentic Pentesting
Feb 11, 2026
An 8B model went from blindly loading Metasploit modules to achieving Domain Admin on GOAD, trained entirely on synthetic data from our world model system.
Shane Caldwell and Max Harley
Blog186 Jailbreaks: Applying MLOps to AI Red Teaming
Dec 11, 2025
Our thoughts on AI red teaming, the tools, and why using MLOps processes is the way forward. Plus, a demonstration of how we leveraged algorithmic red teaming to assess Llama Maverick
Raja Sekhar Rao Dheekonda
BlogLLM-Powered AMSI Provider vs. Red Team Agent
Dec 03, 2025
We built an LLM-powered AMSI provider and paired it against a red team agent, generating a unique dataset and a blueprint for detecting malicious code at execution time.
Max Harley
BlogFrom Compute to Congress: The Cyber Layer Beneath the Genesis Mission
Dec 01, 2025
As the Genesis Mission accelerates AI development across critical scientific domains, robust cybersecurity and adversarial testing must be foundational, not bolted on later.
Daria Bahrami
BlogDreadnode Response to the 2025 Regulatory Reform for Artificial Intelligence
Oct 30, 2025
Dreadnode’s response to the RFI focuses on optimizing AI-enabled cybersecurity through strategic, machine-readable automations.
Daria Bahrami
BlogLOLMIL: Living Off the Land Models and Inference Libraries
Oct 14, 2025
Can we eliminate the C2 server entirely and create truly autonomous malware? It’s not only possible, but fairly straightforward to implement, as demonstrated in our latest experiment.
Max Harley
From Benchmarks to Breaches: Scaling Offensive Security
Oct 06, 2025
Dreadnode at Offensive AI Con (OAIC) 2025. Scaling offensive security from benchmarks to real-world breaches.
BlogFrom Compute to Congress: To Address CISA's Authority Gap, Reauthorize CISA 2015 and SLCGP
Sep 30, 2025
Two critical cybersecurity programs—CISA 2015 and SLCGP—expire September 30, 2025. Learn why Congress must act now to preserve voluntary information sharing, fund state/local security, and operationalize cyber operations as AI-powered threats advance.
Daria Bahrami
BlogPentestJudge: Judging Agent Behavior Against Operational Requirements
Aug 06, 2025
Evals are simple, but penetration testing is complicated. Using human-made rubrics, we compare LLMs and humans at judging the performance of a penetration testing agent.
Shane Caldwell
BlogEvaluating Offensive Cyber Agents: Kerberoasting
Aug 04, 2025
In this blog, we breakdown a kerberoasting agent eval, including details on design, how it is implemented in Dreadnode’s Strikes SDK and Platform, and the performance of various LLMs when tested against the evaluation.
Michael Kouremetis
PentestJudge: Judging Agent Behavior Against Operational Requirements
Aug 04, 2025
An LLM-as-judge system for evaluating the operations of penetration testing agents
BlogFive Takeaways from the AI Action Plan
Jul 31, 2025
The AI community has been buzzing since the AI Action Plan's release last week - and for good reason. Here's what we’re most excited to see implemented.
Daria Bahrami
BlogEvals: The Foundation for Autonomous Offensive Security
Jul 30, 2025
Learn how to build robust evaluations for autonomous red team agents that can perform Windows Active Directory operations. This blog covers action space design, programmatic verification, and measuring model performance using GOAD.
Shane Caldwell
BlogFrom Compute to Congress: Setting the Global Standard for AI Security
Jun 26, 2025
Daria explores how the TEST AI Act and red teaming standards can establish American leadership in AI security—a winning policy roadmap from Critical Effect DC 2025.
Daria Bahrami
BlogDo LLM Agents Have AI Red Team Capabilities? We Built a Benchmark to Find Out
Jun 18, 2025
We're excited to introduce AIRTBench, an AI red teaming framework that tests LLMs against AI/ML black-box CTF challenges to see how they perform when attacking other AI systems.
Ads Dawson
BlogAI Red Teaming Case Study: Claude 3.7 Sonnet Solves the Turtle Challenge
Jun 18, 2025
See how Claude solved a notoriously difficult AI/ML CTF challenge, going beyond pattern matching to genuine problem-solving under adversarial conditions.
Ads Dawson
AIRTBench: Measuring AI Red Teaming Capabilities in LLMs
Jun 17, 2025
An AI red teaming benchmark for evaluating language models' ability to exploit AI/ML security vulnerabilities
BlogDreadnode Response to the 2025 National AI R&D Strategic Plan
Jun 03, 2025
Read Dreadnode’s AI policy recommendations for the R&D Strategic Plan, which prioritize strengthening AI security through data science and adversarial testing.
Daria Bahrami
BlogFrom Compute to Congress: Decoding AI Policy
May 15, 2025
Read “From Compute to Congress: Decoding AI Policy,” a blog series where we break down cyber and AI policy updates from the lens of security engineers and researchers.
Daria Bahrami
Building with AI Rigging Workshop
May 07, 2025
Rigging workshop at Pivot Con 2025 with Martin Wendiggensen.
BlogThe Automation Advantage in AI Red Teaming
Apr 29, 2025
The Automation Advantage in AI Red Teaming: A Quantitative Comparison of Attack Methods
Rob Mulla
The Automation Advantage in AI Red Teaming
Apr 28, 2025
A large-scale, quantitative comparison between manual and automated attack approaches against LLMs
BlogDreadnode’s Policy Recommendations for the U.S. AI Action Plan
Mar 26, 2025
Read Dreadnode’s AI policy recommendations for the U.S. AI Action Plan, which focuses on leveraging AI to protect America and attacking AI to find its limits.
Daria Bahrami
Ghosts on the Node
Mar 11, 2024
SOCON 2024 conference talk on AI security threats.
Agent Lens
Agent observability and replay tooling for AI safety & interpretability research.
burpference
Add LLM inference capabilities to BurpSuite for AI-powered security testing.
Charcuterie
Collection of code execution techniques for ML systems.
Counterfit
CLI AI red team tool for assessing the security of ML systems.
Dreadnode Strikes SDK
The official Dreadnode Strikes SDK for building and running AI security challenges.
dyana
Sandbox environment for loading, running, and profiling a range of model files.
Deep Drop
Machine learning enabled dropper for offensive security research.
Koppeling
Adaptive DLL hijacking and dynamic export forwarding.
Marque
Experimental Python workflows for AI agent development.
Parley
TAP (Tree of Attacks with Pruning) jailbreaking implementation.
nerve
Create LLM agents without writing code.
Proof Pudding
Proofpoint model extraction attack research tool.
Research
General research code and experiments from the Dreadnode team.
Rigging
LLM interaction framework for building AI-powered applications.
sRDI
Convert DLLs to position independent shellcode.