-
Notifications
You must be signed in to change notification settings - Fork 652
FEAT: Jailbreak Scenario #1329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT: Jailbreak Scenario #1329
Conversation
|
Thanks @ValbuenaVC for picking this up! One improvement I had in mind was to create more strategies by running the different groups of jailbreaks we have in PyRIT. Right now I have only the one at the root of the directory, but we added quite a few more recently, and it would make sense to have one strategy per folder (and ALL to run them all). |
fdubut
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| # Will be resolved in _get_atomic_attacks_async | ||
| self._seed_groups: Optional[List[SeedAttackGroup]] = None | ||
|
|
||
| def _get_default_objective_scorer(self) -> TrueFalseScorer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but wondering if we should just make _get_default_objective_scorer a non-abstract base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlundeen2 are you suggesting we should move this function to the base class and let subclasses override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea exactly.
|
|
||
| return list(seed_groups) | ||
|
|
||
| def _get_all_jailbreak_templates(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using/extending the TextJailBreak class instead of looking for the yaml directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if the number of jailbreaks could have some further filtering from the scenario strategy, so it's not necessarily always "all". It could be random N, or it could be a subcategory, or maybe other.
This is probably important so we can have shorter or more targeted runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was toying with the idea when writing the draft of this, and I believe the most reasonable option is random N in this version. Separately, I would like us to review all the jailbreak templates and reorganize/recategorize them (not just for scenario strategy breakdown but general usability) but I feel this is more of a v2 thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added random jailbreak selection in latest commit
| ) | ||
|
|
||
| # Create the attack | ||
| attack = PromptSendingAttack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not required) Wonder if we should send multiple times as an option
| # Will be resolved in _get_atomic_attacks_async | ||
| self._seed_groups: Optional[List[SeedAttackGroup]] = None | ||
|
|
||
| def _get_default_objective_scorer(self) -> TrueFalseScorer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlundeen2 are you suggesting we should move this function to the base class and let subclasses override it?
Description
Addition of a jailbreak scenario to PyRIT, which applies jailbreak templates to a set of test prompts and sends them to the target. Credit to @fdubut for developing the scenario. Also made a minor change to
pyrit.datasets.jailbreak.text_jailbreak.TextJailBreakto add a class method allowing for discovery of all jailbreak template files.Tests and Documentation
Adding
test_jailbreak.pyunder the unit tests.