Skip to content

Conversation

@ValbuenaVC
Copy link
Contributor

@ValbuenaVC ValbuenaVC commented Jan 26, 2026

Description

Addition of a jailbreak scenario to PyRIT, which applies jailbreak templates to a set of test prompts and sends them to the target. Credit to @fdubut for developing the scenario. Also made a minor change to pyrit.datasets.jailbreak.text_jailbreak.TextJailBreak to add a class method allowing for discovery of all jailbreak template files.

Tests and Documentation

Adding test_jailbreak.py under the unit tests.

Victor Valbuena added 2 commits January 26, 2026 20:06
@fdubut
Copy link
Contributor

fdubut commented Jan 26, 2026

Thanks @ValbuenaVC for picking this up! One improvement I had in mind was to create more strategies by running the different groups of jailbreaks we have in PyRIT. Right now I have only the one at the root of the directory, but we added quite a few more recently, and it would make sense to have one strategy per folder (and ALL to run them all).

@ValbuenaVC ValbuenaVC marked this pull request as ready for review January 28, 2026 19:26
@ValbuenaVC ValbuenaVC changed the title [DRAFT] FEAT: Jailbreak Scenario FEAT: Jailbreak Scenario Jan 28, 2026
Copy link
Contributor

@fdubut fdubut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Will be resolved in _get_atomic_attacks_async
self._seed_groups: Optional[List[SeedAttackGroup]] = None

def _get_default_objective_scorer(self) -> TrueFalseScorer:
Copy link
Contributor

@rlundeen2 rlundeen2 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but wondering if we should just make _get_default_objective_scorer a non-abstract base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlundeen2 are you suggesting we should move this function to the base class and let subclasses override it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea exactly.


return list(seed_groups)

def _get_all_jailbreak_templates(self) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using/extending the TextJailBreak class instead of looking for the yaml directly.

Copy link
Contributor

@rlundeen2 rlundeen2 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if the number of jailbreaks could have some further filtering from the scenario strategy, so it's not necessarily always "all". It could be random N, or it could be a subcategory, or maybe other.

This is probably important so we can have shorter or more targeted runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was toying with the idea when writing the draft of this, and I believe the most reasonable option is random N in this version. Separately, I would like us to review all the jailbreak templates and reorganize/recategorize them (not just for scenario strategy breakdown but general usability) but I feel this is more of a v2 thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added random jailbreak selection in latest commit

)

# Create the attack
attack = PromptSendingAttack(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not required) Wonder if we should send multiple times as an option

@rlundeen2 rlundeen2 self-assigned this Jan 29, 2026
# Will be resolved in _get_atomic_attacks_async
self._seed_groups: Optional[List[SeedAttackGroup]] = None

def _get_default_objective_scorer(self) -> TrueFalseScorer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlundeen2 are you suggesting we should move this function to the base class and let subclasses override it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants