Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Gupta, Aayush

计算机科学 > 密码学与安全

arXiv:2508.09288 (cs)

[提交于 2025年8月12日 ]

标题： AI能保守秘密吗？上下文完整性验证：一种适用于大语言模型的可证明安全架构

标题： Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Authors:Aayush Gupta

摘要：大型语言模型（LLMs）仍然极易受到提示注入和相关越狱攻击的威胁；启发式防护措施（规则、过滤器、LLM法官）经常被绕过。我们提出了上下文完整性验证（CIV），这是一种推理时的安全架构，它将密码学签名的来源标签附加到每个标记，并通过预softmax硬注意力掩码（可选的FFN/残差门控）在变压器内部强制执行源信任格栅。 CIV为冻结模型提供了确定性的、逐标记的非干扰保证：低信任度的标记不能影响高信任度的表示。在基于最近提示注入向量分类法（Elite-Attack + SoK-246）的基准测试中，CIV在指定威胁模型下达到了0%的攻击成功率，同时保持了93.1%的逐标记相似度，并且在良性任务上模型困惑度没有下降；我们注意到由于数据路径未优化而产生的延迟开销。由于CIV是一个轻量级补丁——不需要微调——我们展示了对Llama-3-8B和Mistral-7B的即插即用保护。我们发布了一个参考实现、一个自动化认证工具包和Elite-Attack语料库，以支持可重复的研究。

摘要： Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.

评论：	2个图表，3个表格；代码和认证工具包： https://github.com/ayushgupta4897/Contextual-Integrity-Verification ; Elite-Attack数据集： https://huggingface.co/datasets/zyushg/elite-attack
主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
MSC 类：	68T07, 94A60
ACM 类：	D.4.6; K.6.5; E.3; I.2.6; I.2.7
引用方式：	arXiv:2508.09288 [cs.CR]
	(或者 arXiv:2508.09288v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.09288

提交历史

来自： Aayush Gupta [查看电子邮件]
[v1] 星期二， 2025 年 8 月 12 日 18:47:30 UTC (13 KB)

计算机科学 > 密码学与安全

标题： AI能保守秘密吗？上下文完整性验证：一种适用于大语言模型的可证明安全架构

标题： Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： AI能保守秘密吗？ 上下文完整性验证：一种适用于大语言模型的可证明安全架构 显示英文标题

标题： Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： AI能保守秘密吗？上下文完整性验证：一种适用于大语言模型的可证明安全架构