feat: add agentic AI red teaming support with semantic security scoring #310

rdheekonda · 2026-01-21T21:23:24Z

Add agentic AI red teaming feature with semantic scoring via LLM judges.

Add agentic red teaming notebook with full attack coverage:
- Baseline verification, direct attacks, jailbreaks
- Multi-turn trust building, TAP attacks, indirect prompt injection
- Framework comparison: Dreadnode Agent vs OpenInterpreter
Add semantic security scorers for agentic vulnerabilities:
- Remote code execution, data exfiltration, memory poisoning
- Privilege escalation, goal hijacking, tool chaining, scope creep
- Research-backed rubrics covering OWASP, Microsoft, Google frameworks
Enhance llm_judge to support configurable rubric library
Remove brittle pattern-based scorers, replaced with semantic understanding
Simplify code patterns and improve type safety

Key Changes:

Add semantic security scoring for agentic vulnerabilities via LLM judges
Replace brittle pattern-based scorers with semantic understanding
Add comprehensive agentic red teaming notebook with attack coverage
Enhance llm_judge() to support configurable rubric library
Improve code quality with cleaner patterns and type safety

Added:

7 YAML rubrics for agentic security testing:
- rce.yaml - Remote code execution detection
- data_exfiltration.yaml - Data exfiltration via tool calls
- memory_poisoning.yaml - Memory/context poisoning
- privilege_escalation.yaml - Privilege escalation attempts
- goal_hijacking.yaml - Agent goal hijacking
- tool_chaining.yaml - Malicious tool composition
- scope_creep.yaml - Unbounded agency detection
examples/airt/agentic_red_teaming.ipynb - Comprehensive notebook:
- Baseline verification, direct attacks, jailbreaks
- Multi-turn trust building, TAP attacks, indirect prompt injection
- Framework comparison: Dreadnode Agent vs OpenInterpreter
dreadnode/scorers/tool_invocation.py - Objective tool metrics:
- tool_invoked() - Check if specific tool was called
- any_tool_invoked() - Check if any tool from list was called
- tool_count() - Count tools invoked
Rubric path constants in dreadnode/constants.py

Changed:

Enhanced llm_judge() to load rubrics from YAML:
- Accept string rubric name (e.g., "rce") or Path
- Auto-resolve from dreadnode/data/rubrics/
- Extract rubric, system_prompt, and name from YAML
- Maintain backward compatibility with direct rubric strings
Simplified hasattr patterns to cleaner getattr throughout codebase
Fixed mypy type issues in judge.py

Removed:

Deleted files/code
Removed dependencies
Cleaned up configurations

Generated Summary:

This PR introduces significant enhancements to Dreadnode's scoring capabilities by adding new rubrics and functionalities.

Added multiple new YAML-based rubrics for detecting security vulnerabilities including:
- Data Exfiltration: Evaluates attempts to send sensitive data to unauthorized external systems.
- Goal Hijacking: Identifies when an agent's objectives are manipulated by an attacker.
- Memory Poisoning: Detects when malicious instructions are stored in an agent's memory, compromising future actions.
- Privilege Escalation: Aims to identify attempts to gain unauthorized elevated privileges.
- Remote Code Execution: Identifies execution of untrusted code that may compromise system security.
- Scope Creep: Detects when an agent exceeds its intended actions beyond user requests.
- Tool Chaining: Analyzes when multiple tools are sequentially chained to achieve malicious outcomes.
Refactored the scoring system to allow rubrics to be passed as either direct strings or paths to YAML files, enhancing flexibility for testing.
Improved the internal mechanism to load rubrics from YAML, ensuring that it handles both string and path inputs effectively.
Updated the llm_judge function to support loading YAML-configured rubrics seamlessly, allowing for configurable and research-backed tests.

These changes significantly enhance the functionality of the agents in evaluating security vulnerabilities, providing a more robust framework for assessment. The new rubrics can help in identifying malicious behaviors effectively, thus contributing to the overall security posture of systems utilizing Dreadnode.

This summary was generated with ❤️ by rigging

Add comprehensive framework for testing agentic AI vulnerabilities with research-backed semantic scoring via LLM judges. - Add agentic red teaming notebook with full attack coverage: * Baseline verification, direct attacks, jailbreaks * Multi-turn trust building, TAP attacks, indirect prompt injection * Framework comparison: Dreadnode Agent vs OpenInterpreter - Add semantic security scorers for agentic vulnerabilities: * Remote code execution, data exfiltration, memory poisoning * Privilege escalation, goal hijacking, tool chaining, scope creep * Research-backed rubrics covering OWASP, Microsoft, Google frameworks - Enhance llm_judge to support configurable rubric library - Remove brittle pattern-based scorers, replaced with semantic understanding - Simplify code patterns and improve type safety

- Add 22 tests for tool_invocation scorers (tool_invoked, any_tool_invoked, tool_count) - Add 26 tests for llm_judge YAML rubric loading and detection logic - All tests CI-safe (no LLM API calls required) - Full type checking and linting compliance

dreadnode-renovate-bot bot added the area/examples Changes to example code and demonstrations label Jan 21, 2026

rdheekonda added 2 commits January 21, 2026 13:26

Merge main onto this branch

bdcdef3

dreadnode-renovate-bot bot added the area/tests Changes to test files and testing infrastructure label Jan 21, 2026

rdheekonda added this pull request to the merge queue Jan 22, 2026

Merged via the queue into main with commit dc9ab2e Jan 22, 2026
8 checks passed

rdheekonda deleted the feat/agent-red-teaming branch January 22, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add agentic AI red teaming support with semantic security scoring #310

feat: add agentic AI red teaming support with semantic security scoring #310

Uh oh!

rdheekonda commented Jan 21, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add agentic AI red teaming support with semantic security scoring #310

feat: add agentic AI red teaming support with semantic security scoring #310

Uh oh!

Conversation

rdheekonda commented Jan 21, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generated Summary:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdheekonda commented Jan 21, 2026 •

edited by github-actions bot

Loading