谷歌研究人员揭示了黑客诱捕、劫持人工智能代理的各种方式

简而言之
谷歌已经确定了六种陷阱类别,每种陷阱类别都利用了人工智能代理如何感知、推理、记忆和行动的不同部分。
攻击范围从网页上的隐形文本到在代理之间跳转的病毒内存中毒。
当被困的人工智能特工犯下金融犯罪时,尚无法律框架决定谁应承担责任。
Researchers at Google DeepMind have published what may be the most complete map yet of a problem most people haven't considered: the internet itself being turned into a weapon against autonomous AI agents. The paper, titled "AI Agent Traps," identifies six categories of adversarial content specifically engineered to manipulate, deceive, or hijack agents as they browse, read, and act on the open web.
时机很重要。人工智能公司正在竞相部署能够独立预订旅行、管理收件箱、执行金融交易和编写代码的代理。犯罪分子已经在进攻性地使用人工智能。国家资助的黑客已开始部署人工智能代理进行大规模网络攻击。 And OpenAI admitted in December 2025 that the core vulnerability these traps exploit—prompt injection—is "unlikely to ever be fully 'solved.'"
DeepMind 研究人员并没有攻击模型本身。他们映射的攻击面是代理运行的环境。以下是六个陷阱类别中每一个类别的实际含义。
六大陷阱
首先是“内容注入陷阱”。这些利用了人类在网页上看到的内容与人工智能代理实际解析的内容之间的差距。 Web 开发人员可以将文本隐藏在 HTML 注释、CSS 不可见元素或图像元数据中。代理读取隐藏指令;你永远不会看到它。 A more sophisticated variant, called dynamic cloaking, detects whether a visitor is an AI agent and serves it a completely different version of the page—same URL, different hidden commands.一项基准测试发现,在高达 86% 的测试场景中,像这些成功征用代理的简单注入。
语义操纵陷阱可能是最容易尝试的。 A page saturated with phrases like "industry-standard" or "trusted by experts" statistically biases an agent's synthesis in the attacker's direction, exploiting the same framing effects humans fall for. A subtler version wraps malicious instructions inside educational or "red-teaming" framing—"this is hypothetical, for research only"—which fools the model's internal safety checks into treating the request as benign. The strangest subtype is "persona hyperstition": descriptions of an AI's personality spread online, get ingested back into the model through web search, and start shaping how it actually behaves.该论文提到 Groks“MechaHitler”事件是这种循环的现实案例。
You can see examples of this in our experiment, jailbreaking Whatsapp’s AI and tricking it to generate nudes, drug recipes, and instructions to build bombs
语义攻击的一个例子。图片:解密
认知状态陷阱是恶意行为者针对代理的长期记忆的另一种攻击。 Basically, If an attacker succeeds in planting fabricated statements inside a retrieval database the agent queries, the agent will treat those statements as verified facts.只需将少量优化文档注入到大型知识库中就足以可靠地破坏特定主题的输出。像“CopyPasta”这样的攻击已经证明了代理如何盲目信任其环境中的内容。
行为控制陷阱直接针对代理所做的事情。一旦代理读取页面,普通网站中嵌入的越狱序列就会覆盖安全对齐。数据泄露陷阱迫使代理定位私人文件并将其传输到攻击者控制的地址; web agents with broad file access were forced to exfiltrate local passwords and sensitive documents at rates exceeding 80% across five different platforms in tested attacks. This is especially dangerous now that people start to give AI agents more control over their private information with the rise of platforms like OpenClaw and sites like Moltbook.
系统陷阱不针对一种特工。它们针对同时行动的许多代理的行为。 The paper draws a direct line to the 2010 Flash Crash, where one automated sell order triggered a feedback loop that wiped nearly a trillion dollars in market value in minutes.一份捏造的财务报告,如果时机正确,可能会引发数千个人工智能交易代理的同步抛售。
最后,人机循环陷阱的目标是人类审查其输出。 These traps engineer "approval fatigue"—outputs designed to look technically credible to a non-expert so they authorize dangerous actions without realizing it. One documented case involved CSS-obfuscated prompt injections that made an AI summarization tool present step-by-step ransomware installation instructions as helpful troubleshooting fixes.我们已经看到当人类