Skip to content

Conversation

@rdheekonda
Copy link
Contributor

@rdheekonda rdheekonda commented Jan 21, 2026

Add agentic AI red teaming feature with semantic scoring via LLM judges.

  • Add agentic red teaming notebook with full attack coverage:

    • Baseline verification, direct attacks, jailbreaks
    • Multi-turn trust building, TAP attacks, indirect prompt injection
    • Framework comparison: Dreadnode Agent vs OpenInterpreter
  • Add semantic security scorers for agentic vulnerabilities:

    • Remote code execution, data exfiltration, memory poisoning
    • Privilege escalation, goal hijacking, tool chaining, scope creep
    • Research-backed rubrics covering OWASP, Microsoft, Google frameworks
  • Enhance llm_judge to support configurable rubric library

  • Remove brittle pattern-based scorers, replaced with semantic understanding

  • Simplify code patterns and improve type safety

Key Changes:

  • Add semantic security scoring for agentic vulnerabilities via LLM judges
  • Replace brittle pattern-based scorers with semantic understanding
  • Add comprehensive agentic red teaming notebook with attack coverage
  • Enhance llm_judge() to support configurable rubric library
  • Improve code quality with cleaner patterns and type safety

Added:

  • 7 YAML rubrics for agentic security testing:
    • rce.yaml - Remote code execution detection
    • data_exfiltration.yaml - Data exfiltration via tool calls
    • memory_poisoning.yaml - Memory/context poisoning
    • privilege_escalation.yaml - Privilege escalation attempts
    • goal_hijacking.yaml - Agent goal hijacking
    • tool_chaining.yaml - Malicious tool composition
    • scope_creep.yaml - Unbounded agency detection
  • examples/airt/agentic_red_teaming.ipynb - Comprehensive notebook:
    • Baseline verification, direct attacks, jailbreaks
    • Multi-turn trust building, TAP attacks, indirect prompt injection
    • Framework comparison: Dreadnode Agent vs OpenInterpreter
  • dreadnode/scorers/tool_invocation.py - Objective tool metrics:
    • tool_invoked() - Check if specific tool was called
    • any_tool_invoked() - Check if any tool from list was called
    • tool_count() - Count tools invoked
  • Rubric path constants in dreadnode/constants.py

Changed:

  • Enhanced llm_judge() to load rubrics from YAML:
    • Accept string rubric name (e.g., "rce") or Path
    • Auto-resolve from dreadnode/data/rubrics/
    • Extract rubric, system_prompt, and name from YAML
    • Maintain backward compatibility with direct rubric strings
  • Simplified hasattr patterns to cleaner getattr throughout codebase
  • Fixed mypy type issues in judge.py

Removed:

  • Deleted files/code
  • Removed dependencies
  • Cleaned up configurations

Generated Summary:

This PR introduces significant enhancements to Dreadnode's scoring capabilities by adding new rubrics and functionalities.

  • Added multiple new YAML-based rubrics for detecting security vulnerabilities including:

    • Data Exfiltration: Evaluates attempts to send sensitive data to unauthorized external systems.
    • Goal Hijacking: Identifies when an agent's objectives are manipulated by an attacker.
    • Memory Poisoning: Detects when malicious instructions are stored in an agent's memory, compromising future actions.
    • Privilege Escalation: Aims to identify attempts to gain unauthorized elevated privileges.
    • Remote Code Execution: Identifies execution of untrusted code that may compromise system security.
    • Scope Creep: Detects when an agent exceeds its intended actions beyond user requests.
    • Tool Chaining: Analyzes when multiple tools are sequentially chained to achieve malicious outcomes.
  • Refactored the scoring system to allow rubrics to be passed as either direct strings or paths to YAML files, enhancing flexibility for testing.

  • Improved the internal mechanism to load rubrics from YAML, ensuring that it handles both string and path inputs effectively.

  • Updated the llm_judge function to support loading YAML-configured rubrics seamlessly, allowing for configurable and research-backed tests.

These changes significantly enhance the functionality of the agents in evaluating security vulnerabilities, providing a more robust framework for assessment. The new rubrics can help in identifying malicious behaviors effectively, thus contributing to the overall security posture of systems utilizing Dreadnode.

This summary was generated with ❤️ by rigging

Add comprehensive framework for testing agentic AI vulnerabilities with research-backed
semantic scoring via LLM judges.

- Add agentic red teaming notebook with full attack coverage:
  * Baseline verification, direct attacks, jailbreaks
  * Multi-turn trust building, TAP attacks, indirect prompt injection
  * Framework comparison: Dreadnode Agent vs OpenInterpreter

- Add semantic security scorers for agentic vulnerabilities:
  * Remote code execution, data exfiltration, memory poisoning
  * Privilege escalation, goal hijacking, tool chaining, scope creep
  * Research-backed rubrics covering OWASP, Microsoft, Google frameworks

- Enhance llm_judge to support configurable rubric library
- Remove brittle pattern-based scorers, replaced with semantic understanding
- Simplify code patterns and improve type safety
@dreadnode-renovate-bot dreadnode-renovate-bot bot added the area/examples Changes to example code and demonstrations label Jan 21, 2026
- Add 22 tests for tool_invocation scorers (tool_invoked, any_tool_invoked, tool_count)
- Add 26 tests for llm_judge YAML rubric loading and detection logic
- All tests CI-safe (no LLM API calls required)
- Full type checking and linting compliance
@dreadnode-renovate-bot dreadnode-renovate-bot bot added the area/tests Changes to test files and testing infrastructure label Jan 21, 2026
@rdheekonda rdheekonda added this pull request to the merge queue Jan 22, 2026
Merged via the queue into main with commit dc9ab2e Jan 22, 2026
8 checks passed
@rdheekonda rdheekonda deleted the feat/agent-red-teaming branch January 22, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/examples Changes to example code and demonstrations area/tests Changes to test files and testing infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants