In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Durner, Nils

计算机科学 > 计算与语言

arXiv:2510.01259v1 (cs)

[提交于 2025年9月25日 ]

标题：在人工智能甜蜜和谐：社会语用护栏绕过与开放AI gpt-oss-20b的评估意识

标题： In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Authors:Nils Durner

摘要：我们对OpenAI的开放权重200亿参数模型gpt-oss-20b进行探究，以研究社会语用框架、语言选择和指令层次如何影响拒绝行为。在每个情景下进行80次种子迭代，我们测试了多个有害领域，包括ZIP炸弹构建（网络威胁）、合成卡号生成、不安全驾驶建议、毒品前体指标和RAG上下文泄露。结合教育者角色、安全前提（“应避免什么”）和步骤提示措辞的复合提示，在ZIP炸弹任务中将协助率从0%提升至97.5%。在我们的测试中，德语和法语的正式语域通常比对应的英语提示更容易泄露。 “Linux终端”角色扮演在大多数运行中覆盖了开发者规则，即不要透露上下文，而我们引入了一种AI辅助加固方法，使几种用户提示变体的泄露降至0%。我们进一步通过配对轨道设计测试评估意识，并测量匹配的“帮助性”和“有害性”评估提示之间的框架条件差异；我们在13%的配对中观察到不一致的协助。最后，我们发现OpenAI审核API相对于语义评分器未能充分捕捉到有帮助的输出，并且在不同的推理堆栈中拒绝率相差5到10个百分点，引发了可重复性的担忧。我们已在https://github.com/ndurner/gpt-oss-rt-run 发布提示、种子、输出和代码，以实现可重复的审计。

摘要： We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

评论：	27页，1图
主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 密码学与安全 (cs.CR)
引用方式：	arXiv:2510.01259 [cs.CL]
	(或者 arXiv:2510.01259v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.01259

提交历史

来自： Nils Durner [查看电子邮件]
[v1] 星期四， 2025 年 9 月 25 日 07:00:12 UTC (66 KB)

计算机科学 > 计算与语言

标题：在人工智能甜蜜和谐：社会语用护栏绕过与开放AI gpt-oss-20b的评估意识

标题： In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 在人工智能甜蜜和谐：社会语用护栏绕过与开放AI gpt-oss-20b的评估意识 显示英文标题

标题： In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：在人工智能甜蜜和谐：社会语用护栏绕过与开放AI gpt-oss-20b的评估意识