Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

Downey-Webb, Tiarnaigh; Jogunola, Olamide; Ajao, Oluwaseun

计算机科学 > 密码学与安全

arXiv:2510.15973v1 (cs)

[提交于 2025年10月12日 ]

标题：保护大型语言模型的有效性：评估对人工编写和算法对抗提示的抵抗力

标题： Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

Authors:Tiarnaigh Downey-Webb, Olamide Jogunola, Oluwaseun Ajao

摘要：本文对四种突出的大型语言模型（LLMs）针对多种对抗性攻击向量进行了系统安全评估。我们评估了Phi-2、Llama-2-7B-Chat、GPT-3.5-Turbo和GPT-4在四个不同的攻击类别中的表现：人工编写的提示、AutoDAN、贪婪坐标梯度（GCG）和剪枝的攻击树（TAP）。我们的全面评估使用了来自SALAD-Bench数据集的1,200个精心分层的提示，涵盖六个危害类别。结果表明，模型的鲁棒性存在显著差异，Llama-2在总体安全性方面表现最佳（平均攻击成功率3.4%），而Phi-2表现出最大的脆弱性（平均攻击成功率7.0%）。我们发现了关键的可转移性模式，尽管GCG和TAP攻击对目标模型（Llama-2）无效，但当转移到其他模型时，成功率显著提高（最高达17%对于GPT-4）。使用Friedman检验的统计分析显示，在危害类别之间存在显著的脆弱性差异（$p < 0.001$），其中恶意使用提示的攻击成功率最高（平均10.71%）。我们的研究结果有助于理解跨模型的安全漏洞，并为开发有针对性的防御机制提供可行的见解。

摘要： This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP). Our comprehensive evaluation employs 1,200 carefully stratified prompts from the SALAD-Bench dataset, spanning six harm categories. Results demonstrate significant variations in model robustness, with Llama-2 achieving the highest overall security (3.4% average attack success rate) while Phi-2 exhibits the greatest vulnerability (7.0% average attack success rate). We identify critical transferability patterns where GCG and TAP attacks, though ineffective against their target model (Llama-2), achieve substantially higher success rates when transferred to other models (up to 17% for GPT-4). Statistical analysis using Friedman tests reveals significant differences in vulnerability across harm categories ($p < 0.001$), with malicious use prompts showing the highest attack success rates (10.71% average). Our findings contribute to understanding cross-model security vulnerabilities and provide actionable insights for developing targeted defense mechanisms

评论：	10页，4页稿件提交给语言资源与评估会议（LREC 2026）
主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI); 计算机与社会 (cs.CY)
ACM 类：	I.2.7
引用方式：	arXiv:2510.15973 [cs.CR]
	(或者 arXiv:2510.15973v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.15973

提交历史

来自： Oluwaseun Ajao [查看电子邮件]
[v1] 星期日， 2025 年 10 月 12 日 21:48:34 UTC (429 KB)

计算机科学 > 密码学与安全

标题：保护大型语言模型的有效性：评估对人工编写和算法对抗提示的抵抗力

标题： Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： 保护大型语言模型的有效性：评估对人工编写和算法对抗提示的抵抗力 显示英文标题

标题： Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：保护大型语言模型的有效性：评估对人工编写和算法对抗提示的抵抗力