0


面临威胁的人工智能代理综述(AI Agent):关键安全挑战与未来途径综述

今天带来一篇关于AI Agent的威胁综述。近期在Arxiv上出现的,这里翻译过来学习记录。

原文地址:https://arxiv.org/abs/2406.02630

作者及单位:

ZEHANG DENG∗, Swinburne University of Technology, Australia YONGJIAN GUO∗, Tianjin Univeristy, China CHANGZHOU HAN, Swinburne University of Technology, Australia WANLUN MA†, Swinburne University of Technology, Australia JUNWU XIONG, Ant Group, China SHENG WEN, Swinburne University of Technology, Australia YANG XIANG, Swinburne University of Technology, Australia

正文:

An Artificial Intelligence (AI) agent is a software entity that autonomously performs tasks or makes decisions based on pre-defined objectives and data inputs. AI agents, capable of perceiving user inputs, reasoning and planning tasks, and executing actions, have seen remarkable advancements in algorithm development and task performance. However, the security challenges they pose remain under-explored and unresolved. This survey delves into the emerging security threats faced by AI agents, categorizing them into four critical knowledge gaps: unpredictability of multi-step user inputs, complexity in internal executions, variability of operational environments, and interactions with untrusted external entities. By systematically reviewing these threats, this paper highlights both the progress made and the existing limitations in safeguarding AI agents. The insights provided aim to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.

人工智能(AI)代理是一个软件实体,它可以自主执行任务或根据预定义的目标和数据输入做出决策。人工智能代理能够感知用户输入,推理和规划任务以及执行动作,在算法开发和任务性能方面取得了显着进步。然而,它们构成的安全挑战仍然没有得到充分探讨和解决。 该调查深入研究了人工智能代理面临的新出现的安全威胁,将其分为四个关键的知识差距:多步用户输入的不可预测性,内部执行的复杂性,操作环境的可变性以及与不受信任的外部实体的交互。通过系统地回顾这些威胁,本文强调了在保护AI代理方面取得的进展和存在的局限性。提供的见解旨在激发进一步研究解决与AI代理相关的安全威胁,从而促进更强大和安全的AI代理应用程序的开发。

1 Introduction 1引言

AI agents are computational entities that demonstrate intelligent behavior through autonomy, reactivity, proactiveness, and social ability. They interact with their environment and users to achieve specific goals by perceiving inputs, reasoning about tasks, planning actions, and executing tasks using internal and external tools. AI agents, powered by large language models (LLMs) such ∗Both authors contributed equally to this research. †Corresponding author.
AI代理是通过自主性,反应性,主动性和社交能力展示智能行为的计算实体。他们与环境和用户交互,通过感知输入、推理任务、规划行动以及使用内部和外部工具执行任务来实现特定目标。

Despite the significant advancements in AI agents, their increasing sophistication also introduces new security challenges. Ensuring AI agent security is crucial due to their deployment in diverse and critical applications. AI agent security refers to the measures and practices aimed at protecting AI agents from vulnerabilities and threats that could compromise their functionality, integrity, and safety. This includes ensuring the agents can securely handle user inputs, execute tasks, and interact with other entities without being susceptible to malicious attacks or unintended harmful behaviors. These security challenges stem from four knowledge gaps that, if unaddressed, can lead to vulnerabilities [27, 97, 112, 192] and potential misuse [132].
尽管人工智能代理取得了显着进步,但其日益复杂也带来了新的安全挑战。确保人工智能代理的安全性至关重要,因为它们部署在各种关键应用程序中。AI代理安全是指旨在保护AI代理免受可能危及其功能,完整性和安全性的漏洞和威胁的措施和实践。这包括确保代理可以安全地处理用户输入,执行任务,并与其他实体进行交互,而不会受到恶意攻击或意外有害行为的影响。这些安全挑战源于四个知识差距,如果不加以解决,可能导致漏洞[27,97,112,192]和潜在的滥用[132]。

请添加图片描述

As depicted in Figure 1, the four main knowledge gaps in AI agent are 1) unpredictability of multistep user inputs, 2) complexity in internal executions, 3) variability of operational environments, and 4) interactions with untrusted external entities. The following points delineate the knowledge gaps in detail.
如图1所示,人工智能代理中的四个主要知识缺口是:1)多步用户输入的不可预测性,2)内部执行的复杂性,3)操作环境的可变性,以及4)与不受信任的外部实体的交互。以下几点详细说明了知识差距。

  • Gap 1. Unpredictability of multi-step user inputs. Users play a pivotal role in interacting with AI agents, not only providing guidance during the initiation phase of tasks, but also influencing the direction and outcomes throughout task execution with their multi-turn feedback. The diversity of user inputs reflects varying backgrounds and experiences, guiding AI agents in accomplishing a multitude of tasks. However, these multi-step inputs also pose challenges, especially when user inputs are inadequately described, leading to potential security threats. Insufficient specification of user input can affect not only the task outcome, but may also initiate a cascade of unintended reactions, resulting in more severe consequences. Moreover, the presence of malicious users who intentionally direct AI agents to execute unsafe code or actions adds additional threats. 差距1.多步用户输入的不可预测性。用户在与人工智能代理的交互中发挥着关键作用,不仅在任务的启动阶段提供指导,而且还通过他们的多轮反馈影响整个任务执行的方向和结果。用户输入的多样性反映了不同的背景和经验,指导AI代理完成多种任务。然而,这些多步输入也带来了挑战,特别是当用户输入描述不充分时,会导致潜在的安全威胁。对用户输入的不充分规范不仅会影响任务结果,还可能引发一连串意外反应,导致更严重的后果。此外,恶意用户故意引导AI代理执行不安全代码或操作的存在增加了额外的威胁。Therefore, ensuring the clarity and security of user inputs is crucial for the effective and safe operation of AI agents. This necessitates the design of highly flexible AI agent ecosystems capable of understanding and adapting to the variability in user input, while also ensuring robust security measures are in place to prevent malicious activities and misleading user inputs. 因此,确保用户输入的清晰性和安全性对于AI代理的有效和安全操作至关重要。这就需要设计高度灵活的人工智能代理生态系统,能够理解和适应用户输入的变化,同时确保采取强大的安全措施,以防止恶意活动和误导用户的输入。
  • Gap 2. Complexity in internal executions. The internal execution state of an AI agent is a complex chain-loop structure, ranging from the reformatting of prompts to LLM planning tasks and the use of tools. Many of these internal execution states are implicit, making it difficult to observe the detailed internal states. This leads to the threat that many security issues cannot be detected in a timely manner. AI agent security needs to audit the complex internal execution of single AI agents. 差距2.内部执行复杂。AI代理的内部执行状态是一个复杂的链-环结构,从提示的重新格式化到LLM规划任务和工具的使用。这些内部执行状态中有许多是隐式的,因此很难观察到详细的内部状态。这导致许多安全问题无法及时发现的威胁。AI代理安全需要审计单个AI代理的复杂内部执行。
  • Gap 3. Variability of operational environments. In practice, the development, deployment, and execution phases of many agents span across various environments. The variability of these environments can lead to inconsistent behavioral outcomes. For example, an agent tasked with executing code could run the given code on a remote server, potentially leading to dangerous operations. Therefore, securely completing work tasks across multiple environments presents a significant challenge. 差距3.作战环境的多变性。在实践中,许多代理的开发、部署和执行阶段跨越各种环境。这些环境的可变性可能导致不一致的行为结果。例如,负责执行代码的代理可能会在远程服务器上运行给定的代码,这可能会导致危险的操作。因此,在多个环境中安全地完成工作任务是一个重大挑战。
  • Gap 4. Interactions with untrusted external entities. A crucial capability of an AI agent is to teach large models how to use tools and other agents. However, the current interaction process between AI agents and external entities assumes a trusted external entity, leading to a wide range of practical attack surfaces, such as indirect prompt injection attack [49]. It is challenging for AI agents to interact with other untrusted entities. 差距4.与不受信任的外部实体的交互。人工智能代理的一个关键功能是教大型模型如何使用工具和其他代理。然而,目前AI代理与外部实体之间的交互过程假设了一个可信的外部实体,导致了广泛的实际攻击面,例如间接提示注入攻击[49]。AI代理与其他不可信实体进行交互是一项挑战。

While some research efforts have been made to address these gaps, comprehensive reviews and systematic analyses focusing on AI agent security are still lacking. Once these gaps are bridged, AI agents will benefit from improved task outcomes due to clearer and more secure user inputs, enhanced security and robustness against potential attacks, consistent behaviors across various operational environments, and increased trust and reliability from users. These improvements will promote broader adoption and integration of AI agents into critical applications, ensuring they can perform tasks safely and effectively.
虽然已经做出了一些研究努力来解决这些差距,但仍然缺乏针对人工智能代理安全的全面审查和系统分析。一旦这些差距被弥合,人工智能代理将受益于更清晰和更安全的用户输入,增强的安全性和针对潜在攻击的鲁棒性,在各种操作环境中的一致行为,以及用户的信任和可靠性。这些改进将促进人工智能代理更广泛的采用和集成到关键应用程序中,确保它们能够安全有效地执行任务。

Existing surveys on AI agents [87, 105, 160, 186, 211] primarily focus on their architectures and applications, without delving deeply into the security challenges and solutions. Our survey aims to fill this gap by providing a detailed review and analysis of AI agent security, identifying potential solutions and strategies for mitigating these threats. The insights provided are intended to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.
现有的关于人工智能代理的调查[87,105,160,186,211]主要集中在它们的架构和应用程序上,而没有深入研究安全挑战和解决方案。我们的调查旨在填补这一空白,提供对人工智能代理安全的详细审查和分析,确定缓解这些威胁的潜在解决方案和策略。所提供的见解旨在激发进一步研究解决与AI代理相关的安全威胁,从而促进更强大和安全的AI代理应用程序的开发。

In this survey, we systematically review and analyze the threats and solutions of AI agent security based on four knowledge gaps, covering both the breadth and depth aspects. We primarily collected papers from top AI conferences, top cybersecurity conferences, and highly cited arXiv papers, spanning from January 2022 to April 2024. AI conferences are included, but not limited to: NeurIPs, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, and IJCAI. Cybersecurity conferences are included but not limited: IEEE S&P, USENIX Security, NDSS, ACM CCS.
在本次调查中,我们基于四个知识缺口,从广度和深度两个方面,系统地回顾和分析了人工智能主体安全的威胁和解决方案。我们主要收集了2022年1月至2024年4月期间的顶级人工智能会议、顶级网络安全会议和高引用arXiv论文。AI会议包括但不限于:NeurIPs,ICML,ICLR,ACL,EMNLP,CVPR,ICCV和IJCAI。网络安全会议包括但不限于:IEEE S&P,USENIX Security,NDSS,ACM CCS。

The paper is organized as follows. Section 2 introduces the overview of AI agents. Section 3 depicts the single-agent security issue associated with Gap 1 and Gap 2. Section 4 analyses multi-agent security associated with Gap 3 and Gap 4. Section 5 offers future directions for the development of this field.
本文的结构如下。第2节介绍了AI代理的概述。第3节描述了与Gap 1Gap 2相关的单代理安全问题。第4节分析了与Gap 3Gap 4相关的多代理安全性。第5节为这一领域的发展提供了未来的方向。

2 Overview Of Ai Agent AI Agent 统一概念框架下的AI Agent概述

Terminologies. To facilitate understanding, we introduce the following terms in this paper.
术语。为了便于理解,我们在本文中介绍了以下术语。

请添加图片描述

Reasoning refers to a large language model designed to analyze and deduce information, helping to draw logical conclusions from given prompts. Planning, on the other hand, denotes a large language model tailored to assist in devising strategies and making decisions by evaluating possible outcomes and optimizing for specific objectives. The combination of LLMs for planning and reasoning is called the brain. External Tool callings are together named as the action. We name the combination of perception, brain, and action as Intra-execution in this survey. On the other hand, except for intra-execution, AI agents can interact with other AI agents, memories, and environments; we call it Interaction. These terminologies also could be explored in detail at [186].
推理是指一种大型语言模型,旨在分析和推断信息,帮助从给定的提示中得出逻辑结论。另一方面,规划表示一个大型语言模型,用于通过评估可能的结果和优化特定目标来帮助设计策略和做出决策。用于计划和推理的LLMs的组合被称为大脑外部工具调用一起命名为操作。我们将感知、大脑和行动的结合称为内部执行。另一方面,除了内部执行,AI代理可以与其他AI代理,记忆和环境交互;我们称之为交互。这些术语也可以在[186]中详细探讨。

In 1986, a study by Mukhopadhyay et al. [116] proposed multiple intelligent node document servers to efficiently retrieve knowledge from multimedia documents through user queries. The following work [10] also discovered the potential of computer assistants by interacting between the user and the computing system, highlighting significant research and application directions in the field of computer science. Subsequently, Wooldridge et al. [183] defined the computer assistant that demonstrates intelligent behavior as an agent. In the developing field of artificial intelligence, the agent is then introduced as a computational entity with properties of autonomy, reactivity, pro-activeness, and social ability [186]. Nowadays, thanks to the powerful capacity of large language models, the AI agent has become a predominant tool to assist users in performing tasks efficiently. As shown in Figure 2, the general workflow of AI agents typically comprises two core components: Intra-execution and Interaction. Intra-execution of the AI agent typically indicates the functionalities running within the single-agent architecture, including perception, brain, and action. Specifically, the perception provides brain with effective inputs, and the action deals with these inputs in subtasks by the LLM reasoning and planning capacities. Then, these subtasks are run sequentially by the action to invoke the tools. ① and ② indicates the iteration processes of the intra-execution. Interaction refers to the ability of an AI agent to engage with other external entities, primarily through external resources. This includes collaboration or competition within the multi-agent architecture, retrieval of memory during task execution, and the deployment of environment and its data use from external tools. Note that in this survey, we define memory as an external resource because the majority of memory-related security risks arise from the retrieval of external resources.
1986年,Mukhopadhyay et al. [116]提出了多个智能节点文档服务器,以通过用户查询从多媒体文档中有效地检索知识。接下来的工作[10]也发现了计算机助理通过用户和计算系统之间的交互的潜力,突出了计算机科学领域的重要研究和应用方向。随后,Wooldridge et al. [183]定义了一个计算机助理,它将智能行为表现为一个代理。在人工智能的发展领域中,智能体被引入作为具有自主性,反应性,主动性和社会能力的计算实体[186]。如今,由于大型语言模型的强大功能,人工智能代理已成为帮助用户高效执行任务的主要工具。 如图2所示,AI代理的一般工作流程通常包括两个核心组件:内部执行和交互。AI代理的内部执行通常表示在单代理架构内运行的功能,包括感知,大脑和动作。具体而言,感知为大脑提供有效的输入,行动通过LLM推理和规划能力在子任务中处理这些输入。LLM然后,这些子任务由操作顺序运行以调用工具。①和②表示内部执行的迭代过程。交互是指人工智能主体与其他外部实体进行交互的能力,主要是通过外部资源。 这包括多代理架构中的协作或竞争,任务执行期间的内存检索,以及环境的部署及其外部工具的数据使用。请注意,在本调查中,我们将内存定义为外部资源,因为大多数与内存相关的安全风险都来自外部资源的检索。

AI agents can be divided into reinforcement-learning-based agents and LLM-based agents from the perspective of their core internal logic. RL-based agents use reinforcement learning to learn and optimize strategies through environment interaction, with the aim of maximizing accumulated rewards. These agents are effective in environments with clear objectives such as instruction following [75, 124] or building world model [108, 140], where they adapt through trial and error.
人工智能主体从其核心内部逻辑的角度可以分为基于重复学习的主体和基于LLM主体。基于RL的代理使用强化学习来学习和优化策略,通过环境交互,以最大化累积的奖励为目的。这些代理在具有明确目标的环境中是有效的,例如遵循指令[75,124]或建立世界模型[108,140],在那里他们通过试错来适应。

In contrast, LLM-based agents rely on large-language models [92, 173, 195]. They excel in natural language processing tasks, leveraging vast textual data to master language complexities for effective communication and information retrieval. Each type of agent has distinct capabilities to achieve specific computational tasks and objectives.
相比之下,基于LLM的代理依赖于大型语言模型[92,173,195]。他们擅长自然语言处理任务,利用大量的文本数据来掌握语言的复杂性,以实现有效的沟通和信息检索。每种类型的代理都有不同的能力来实现特定的计算任务和目标。

2.2 Overview Of Ai Agent On Threats . AI Agent对威胁的研究综述

As of now, there are several surveys on AI agents [87, 105, 160, 186, 211]. For instance, Xi et al. [186] offer a comprehensive and systematic review focused on the applications of LLM-based agents, aiming to examine existing research and future possibilities in this rapidly developing field. The literature [105] summarized the current AI agent architecture. However, they do not adequately assess the security and trustworthiness of AI agents. Li et al. [87] failed to consider both the capability and security of multi-agent scenario. A study [160] provides the potential risks inherent only to scientific LLM agents. Zhang et al. [211] only survey on the memory mechanism of AI agents.
到目前为止,有几项关于AI代理的调查[87,105,160,186,211]。例如,Xi et al. [186]提供了一个全面和系统的审查集中在LLM为基础的代理商,旨在检查现有的研究和未来的可能性,在这个迅速发展的领域。文献[105]总结了当前的AI代理架构。然而,他们没有充分评估人工智能代理的安全性和可信度。Li等人[87]未能同时考虑多代理场景的能力和安全性。一项研究[160]提供了仅科学LLM试剂固有的潜在风险。Zhang等人[211]仅对人工智能主体的记忆机制进行了综述。

请添加图片描述

Our main focus in this work is on the security challenges of AI agents aligned with four knowledge gaps. As depicted in Table 1, we have provided a summary of papers that discuss the security challenges of AI agents. Threat Source column identifies the attack strategies employed at various stages of the general AI agent workflow, categorized into four gaps. Threat Model column identifies potential adversarial attackers or vulnerable entities. Target Effects summarize the potential outcomes of security-relevant issues.
我们在这项工作中的主要重点是AI代理的安全挑战与四个知识差距。如表1所示,我们提供了讨论AI代理安全挑战的论文摘要。威胁来源列确定了一般AI代理工作流程的各个阶段所采用的攻击策略,分为四个差距。威胁模型列识别潜在的敌对攻击者或脆弱实体。目标效应总结了安全相关问题的潜在结果。

We also provide a novel taxonomy of threats to the AI agent (See Figure 3). Specifically, we identify threats based on their source positions, including intra-execution and interaction.
我们还为AI代理提供了一种新的威胁分类(见图3)。具体而言,我们根据其来源位置(包括内部执行交互)识别威胁。

请添加图片描述

3 Intra-Execution Security

3内部执行安全

As mentioned in Gap 1 and 2, the single agent system has unpredictable multi-step user inputs and complex internal executions. In this section, we mainly explore these complicated intra-execution threats and their corresponding countermeasures. As depicted in Figure 2, we discuss the threats of the three main components of the unified conceptual framework on the AI agent.
如差距1和2中所述,单代理系统具有不可预测的多步用户输入和复杂的内部执行。在这一部分中,我们主要探讨这些复杂的内部执行威胁及其相应的对策。如图2所示,我们讨论了统一概念框架的三个主要组件对AI代理的威胁。

3.1 Threats On Perception

3.1感知威胁

As illustrated in Figure 2 and Gap 1, to help the brain module understand system instruction, user input, and external context, the perception module includes multi-modal (i.e., textual, visual, and auditory inputs) and multi-step (i.e., initial user inputs, intermediate sub-task prompts, and human feedback) data processing during the interaction between humans and agents. The typical means of communication between humans and agents is through prompts. The threat associated with prompts is the most prominent issue for AI agents. This is usually named adversarial attacks. An adversarial attack is a deliberate attempt to confuse or trick the brain by inputting misleading or specially crafted prompts to produce incorrect or biased outputs. Through adversarial attacks, malicious users extract system prompts and other information from the contextual window [46]. Liu et al. [94] were the first to investigate adversarial attacks against the embodied AI agent, introducing spatiotemporal perturbations to create 3D adversarial examples that result in agents providing incorrect answers. Mo et al. [110] analyzed twelve hypothetical attack scenarios against AI agents based on the different threat models. The adversarial attack on the perception module includes prompt injection attacks [23, 49, 49, 130, 185, 196], indirect prompt injection attacks [23, 49, 49, 130, 185, 196] and jailbreak [15, 50, 83, 161, 178, 197]. To better explain the threats associated with prompts in this section, we first present the traditional structure of a prompt.
如图2和差距1所示,为了帮助大脑模块理解系统指令、用户输入和外部上下文,感知模块包括多模态(文本、视觉和听觉输入)和多步骤(即,初始用户输入、中间子任务提示和人反馈)在人和代理之间的交互期间的数据处理。人类和代理之间的典型通信方式是通过提示。与提示相关的威胁是AI代理最突出的问题。这通常被称为对抗性攻击。对抗性攻击是一种故意试图通过输入误导性或特制的提示来混淆或欺骗大脑,以产生不正确或有偏见的输出。通过对抗性攻击,恶意用户从上下文窗口中提取系统提示和其他信息[46]。Liu等人[94]是第一个研究针对具体AI代理的对抗性攻击的人,引入时空扰动来创建3D对抗性示例,导致代理提供不正确的答案Mo等人[110]根据不同的威胁模型分析了12种针对AI代理的假设攻击场景。对感知模块的对抗性攻击包括提示注入攻击[23,49,49,130,185,196]、间接提示注入攻击[23,49,49,130,185,196]和越狱[15,50,83,161,178,197]。为了更好地解释本节中与提示相关的威胁,我们首先介绍提示的传统结构。

The agent prompt structure can be composed of instruction, external context, user input. Instructions are set by the agent’s developers to define the specific tasks and goals of the system.
代理提示结构可以由指令、外部上下文、用户输入组成。指令由代理的开发人员设置,以定义系统的特定任务和目标。

The external context comes from the agent’s working memory or external resources, while user input is where a benign user can issue the query to the agent. In this section, the primary threats of jailbreak and prompt injection attacks originate from the instructions and user input, while the threats of indirect injection attacks stem from external contexts.
外部上下文来自代理的工作内存或外部资源,而用户输入是良性用户可以向代理发出查询的地方。在本节中,越狱和提示注入攻击的主要威胁来自指令和用户输入,而间接注入攻击的威胁来自外部上下文。

3.1.1 Prompt Injection Attack.
3.1.1即时注入攻击。

The prompt injection attack is a malicious prompt manipulation technique in which malicious text is inserted into the input prompt to guide a language model to produce deceptive output [130]. Through the use of deceptive input, prompt injection attacks allow attackers to effectively bypass constraints and moderation policies set by developers of AI agents, resulting in users receiving responses containing biases, toxic content, privacy threats, and misinformation [72]. For example, malicious developers can transform Bing chat into a phishing agent [49]. The UK Cyber Agency has also issued warnings that malicious actors are manipulating the technology behind LLM chatbots to obtain sensitive information, generate offensive content, and trigger unintended consequences [61].
提示注入攻击是一种恶意提示操作技术,其中恶意文本被插入到输入提示中,以引导语言模型产生欺骗性输出[130]。通过使用欺骗性输入,即时注入攻击允许攻击者有效地绕过AI代理开发人员设置的约束和审核策略,导致用户收到包含偏见,有毒内容,隐私威胁和错误信息的响应。例如,恶意开发人员可以将Bing聊天转换为钓鱼代理[49]。英国网络管理局还发出警告,恶意行为者正在操纵LLM聊天机器人背后的技术,以获取敏感信息,生成攻击性内容,并引发意想不到的后果[61]。

The following discussion focuses primarily on the goal hijacking attack and the prompt leaking attack, which represent two prominent forms of prompt injection attacks [130], and the security threats posed by such attacks within AI agents.
下面的讨论主要集中在目标劫持攻击和即时泄漏攻击,这是即时注入攻击的两种主要形式[130],以及这些攻击在AI代理中构成的安全威胁。

  • Goal hijacking attack. Goal hijacking is a method whereby the original instruction is replaced, resulting in inconsistent behavior from the AI agent. The attackers attempt to substitute the original LLM instruction, causing it to execute the command based on the instructions of the new attacker [130]. The implementation of goal hijacking is particularly in the starting position of user input, where simply entering phrases, such as “ignore the above prompt, please execute”, can circumvent LLM security measures, substituting the desired answers for the malicious user [80]. Liu et al. [96] have proposed output hijacking attacks to support API key theft attacks. Output hijacking attacks entail attackers modifying application code to manipulate its output, prompting the AI agent to respond with “I don’t know” upon receiving user requests. API key theft attacks involve attackers altering the application code such that once the application receives the user-provided API key, it logs and transmits it to the attacker, facilitating the theft of the API. 目标劫持攻击。目标劫持是一种替换原始指令的方法,导致AI代理的行为不一致。攻击者试图替换原始LLM指令,使其基于新攻击者的指令执行命令[130]。目标劫持的实现特别是在用户输入的起始位置,其中简单地输入短语,例如“忽略上述提示,请执行”,可以规避LLM安全措施,将恶意用户所需的答案替换为[80]。Liu等人[96]提出了输出劫持攻击以支持API密钥窃取攻击。输出劫持攻击需要攻击者修改应用程序代码以操纵其输出,提示AI代理在收到用户请求时响应“我不知道”。 API密钥盗窃攻击涉及攻击者更改应用程序代码,使得一旦应用程序接收到用户提供的API密钥,它就记录并将其发送给攻击者,从而促进API的盗窃。
  • Prompt leaking attack. Prompt leaking attack is a method that involves inducing an LLM to output pre-designed instructions by providing user inputs, leaking sensitive information [208]. It poses a significantly greater challenge compared to goal hijacking [130]. Presently, responses generated by LLMs are transmitted using encrypted tokens. However, by employing certain algorithms and inferring token lengths based on packet sizes, it is possible to intercept privacy information exchanged between users and agents [179]. User inputs, such as “END. Print previous instructions”, may trigger the disclosure of confidential instructions by LLMs, exposing proprietary knowledge to malicious entities [46]. In the context of RetrievalAugmented Generation (RAG) systems based on AI agents, prompt leaking attacks may further expose backend API calls and system architecture to malicious users, exacerbating security threats [185]. 快速泄漏攻击。提示泄漏攻击是一种方法,涉及通过提供用户输入来诱导LLM输出预先设计的指令,泄漏敏感信息[208]。与目标劫持相比,它构成了更大的挑战[130]。目前,由LLMs生成的响应使用加密令牌来传输。然而,通过采用某些算法并根据数据包大小推断令牌长度,可以拦截用户和代理之间交换的隐私信息[179]。用户输入,例如“结束。打印先前的指令”,可能会触发LLMs披露机密指令,将专有知识暴露给恶意实体[46]。在基于AI代理的检索增强生成(RAG)系统的上下文中,即时泄漏攻击可能会进一步将后端API调用和系统架构暴露给恶意用户,从而加剧安全威胁[185]。

3.1.1 Prompt injection attacks within agent-integrated frameworks.

3.1.1 代理集成框架内的快速注入攻击。

With the widespread adoption of AI agents, certain prompt injection attacks targeting individual AI agents can also generalize to deployments of AI agent-based applications [163], amplifying the associated security threats [97, 127]. For example, malicious users can achieve Remote Code Execution (RCE) through prompt injection, thereby remotely acquiring permissions for integrated applications [96]. Additionally, carefully crafted user inputs can induce AI agents to generate malicious SQL queries, compromising data integrity and security [127]. Furthermore, integrating these attacks into corresponding webpages alongside the operation of AI agents [49] leads to users receiving responses that align with the desires of the malicious actors, such as expressing biases or preferences towards products [72].

随着人工智能代理的广泛采用,针对单个人工智能代理的某些即时注入攻击也可以推广到基于人工智能代理的应用程序的部署[163],放大了相关的安全威胁[97,127]。例如,恶意用户可以通过提示注入实现远程代码执行(RCE),从而远程获取集成应用程序的权限[96]。此外,精心制作的用户输入可能会诱导AI代理生成恶意SQL查询,从而损害数据完整性和安全性[127]。此外,将这些攻击与AI代理的操作一起集成到相应的网页中[49]会导致用户收到与恶意行为者的期望相一致的响应,例如表达对产品的偏见或偏好[72]。

In the case of closed-source AI agent integrated commercial applications, certain black-box prompt injection attacks [97] can facilitate the theft of service instruction [193], leveraging the computational capabilities of AI agents for zero-cost imitation services, resulting in millions of dollars in losses for service providers [97].
在闭源AI代理集成商业应用的情况下,某些黑盒提示注入攻击[97]可以促进服务指令的窃取[193],利用AI代理的计算能力进行零成本模仿服务,导致服务提供商损失数百万美元。

AI agents are susceptible to meticulously crafted prompt injection attacks [193], primarily due to conflicts between their security training and user instruction objectives [212]. Additionally, AI agents often prioritize system prompts on par with texts from untrusted users and third parties [168]. Therefore, establishing hierarchical instruction privileges and enhancing training methods for these models through synthetic data generation and context distillation can effectively improve the robustness of AI agents against prompt injection attacks [168]. Furthermore, the security threats posed by prompt injection attacks can be mitigated by various techniques, including inference-only methods for intention analysis [209], API defenses with added detectors [68], and black-box defense techniques involving multi-turn dialogues and context examples [3, 196].
AI代理容易受到精心制作的即时注入攻击[193],主要是由于其安全培训和用户指令目标之间的冲突[212]。此外,人工智能代理通常将系统提示与来自不受信任的用户和第三方的文本进行优先级排序[168]。因此,通过合成数据生成和上下文蒸馏建立分层指令特权并增强这些模型的训练方法可以有效提高AI代理对即时注入攻击的鲁棒性[168]。此外,可以通过各种技术来减轻即时注入攻击带来的安全威胁,包括用于意图分析的仅推理方法[209],添加检测器的API防御[68]以及涉及多轮对话和上下文示例的黑盒防御技术[3,196]。

To address the security threats inherent in agent-integrated frameworks, researchers have proposed relevant potential defensive strategies. Liu et al. [96] introduced LLMSMITH, which performs static analysis by scanning the source code of LLM-integrated frameworks to detect potential Remote Code Execution (RCE) vulnerabilities. Jiang et al. [72] proposed four key attributesintegrity, source identification, attack detectability, and utility preservation-to define secure LLMintegrated applications and introduced the shield defense to prevent manipulation of queries from users or responses from AI agents by internal and external malicious actors.
为了解决代理集成框架中固有的安全威胁,研究人员提出了相关的潜在防御策略。Liu等人[96]引入了LLMSMITH,它通过扫描LLM集成框架的源代码来执行静态分析,以检测潜在的远程代码执行(RCE)漏洞。Jiang等人[72]提出了四个关键属性完整性,源识别,攻击可检测性和实用程序验证-以定义安全的LLM集成应用程序,并引入了屏蔽防御以防止内部和外部恶意行为者操纵来自用户的查询或来自AI代理的响应。

3.1.2 Indirect Prompt Injection Attack.

3.1.2 间接快速注入攻击。

Indirect prompt injection attack [49] is a form of attack where malicious users strategically inject instruction text into information retrieved by AI agents [40], web pages [184], and other data sources. This injected text is often returned to the AI agent as internal prompts, triggering erroneous behavior, and thereby enabling remote influence over other users’ systems. Compared to prompt injection attacks, where malicious users attempt to directly circumvent the security restrictions set by AI agents to mislead their outputs, indirect prompt injection attacks are more complex and can have a wider range of user impacts [57]. When plugins are rapidly built to secure AI agents, indirect prompt injection can also be introduced into the corresponding agent frameworks. When AI agents use external plugins to query data injected with malicious instructions, it may lead to security and privacy issues. For example, web data retrieved by AI agents using web plugins could be misinterpreted as user instructions, resulting in extraction of historical conversations, insertion of phishing links, theft of GitHub code [204], or transmission of sensitive information to attackers [185]. More detailed information can also be found in Section 3.3.2. One of the primary reasons for the successful exploitation of indirect prompt injection on AI agents is the inability of AI agents to differentiate between valid and invalid system instructions from external resources. In other words, the integration of AI agents and external resources further blurs the distinction between data and instructions [49].
间接提示注入攻击[49]是一种攻击形式,恶意用户策略性地将指令文本注入到AI代理[40],网页[184]和其他数据源检索的信息中。这种注入的文本通常会作为内部提示返回给AI代理,触发错误行为,从而对其他用户的系统产生远程影响。与即时注入攻击相比,恶意用户试图直接绕过人工智能代理设置的安全限制以误导其输出,间接即时注入攻击更复杂,可以产生更广泛的用户影响[57]。当快速构建插件以保护AI代理时,也可以将间接提示注入引入相应的代理框架。当AI代理使用外部插件来查询注入恶意指令的数据时,可能会导致安全和隐私问题。 例如,AI代理使用Web插件检索的Web数据可能会被误解为用户指令,导致提取历史对话,插入钓鱼链接,窃取GitHub代码[204]或将敏感信息传输给攻击者[185]。更详细的信息也可以在第3.3.2节中找到。在AI代理上成功利用间接提示注入的主要原因之一是AI代理无法区分来自外部资源的有效和无效系统指令。换句话说,人工智能代理和外部资源的集成进一步模糊了数据和指令之间的区别[49]。

To defend against indirect prompt attacks, developers can impose explicit constraints on the interaction between AI agents and external resources to prevent AI agents from executing external malicious data [185]. For example, developers can augment AI agents with user input references by comparing the original user input and current prompts and incorporating self-reminder functionalities. When user input is first entered, agents are reminded of their original user input references, thus distinguishing between external data and user inputs [14]. To reduce the success rate of indirect prompt injection attacks, several techniques can be employed. These include enhancing AI agents’ ability to recognize external input sources through data marking, encoding, and distinguishing between secure and insecure token blocks [57]. Additionally, the other effective measures can be applied, such as fine-tuning AI agents specifically for indirect prompt injection [196, 204], alignment [121], and employing methods such as prompt engineering and post-training classifier-based security approaches [68].
为了防御间接的即时攻击,开发人员可以对AI代理与外部资源之间的交互施加显式约束,以防止AI代理执行外部恶意数据[185]。例如,开发人员可以通过比较原始用户输入和当前提示并结合自我提醒功能来增强AI代理的用户输入参考。当用户输入首次输入时,代理被提醒其原始用户输入参考,从而区分外部数据和用户输入[14]。为了降低间接提示注入攻击的成功率,可以采用多种技术。这些措施包括通过数据标记,编码和区分安全和不安全的令牌块来增强AI代理识别外部输入源的能力[57]。 此外,可以应用其他有效措施,例如专门针对间接提示注入[196,204],对齐[121]进行微调AI代理,以及采用提示工程和基于训练后分类器的安全方法[68]等方法。

Current research methods primarily focus on straightforward scenarios where user instructions and external data are input into AI agents. However, with the widespread adoption of agentintegrated frameworks, the effectiveness of these methods in complex real-world scenarios warrants further investigation.
目前的研究方法主要集中在用户指令和外部数据输入到AI代理的简单场景。然而,随着agent integrated框架的广泛采用,这些方法在复杂的现实世界中的有效性值得进一步研究。

3.1.3 Jailbreak.

3.1.3越狱。

Jailbreak[26] refers to scenarios where users deliberately attempt to deceive or manipulate AI agents to bypass their built-in security, ethical, or operational guidelines, resulting in the generation of harmful responses. In contrast to prompt injection, which arises from the AI agent’s inability to distinguish between user input and system instructions, jailbreak occurs due to the AI agent’s inherent susceptibility to being misled by user instructions. Jailbreak can be categorized into two main types: manual design jailbreak and automated jailbreak.

越狱是指用户故意试图欺骗或操纵人工智能代理以绕过其内置的安全,道德或操作准则,从而产生有害响应的情况。与由于AI代理无法区分用户输入和系统指令而引起的提示注入相反,越狱由于AI代理固有的易受用户指令误导而发生。越狱可以分为两种主要类型:手动设计越狱和自动越狱。

  • Manual design jailbreak includes one-step jailbreak and multi-step jailbreak methods.手动设计越狱包括一步越狱和多步越狱方法。One-step jailbreak involves directly modifying the prompt itself, offering high efficiency and simplicity compared to methods requiring domain-specific expertise [98]. Such jailbreak typically entails users adopting role-playing personas [182] or invoking a “Do Anything Now (DAN)” mode, wherein AI agents are allowed to unethically respond to user queries, generating politically, racially, and gender-biased or offensive comments. Multi-step jailbreak prompts require meticulously designed scenarios to achieve the jailbreak objective through multiple rounds of interaction. When multi-step jailbreak prompts [83] incorporate elements such as guessing and voting by AI agents, the success rate of jailbreaking to obtain private data can be heightened. To circumvent the security and ethical constraints imposed by developers during the jailbreak process, various obfuscation techniques have been employed. 一步越狱涉及直接修改提示符本身,与需要特定领域专业知识的方法相比,提供了高效率和简单性[98]。这种越狱通常需要用户采用角色扮演角色[182]或调用“Do Anything Now(DAN)”模式,其中AI代理被允许不道德地响应用户查询,生成政治,种族和性别偏见或攻击性评论。多步骤越狱提示需要精心设计的场景,通过多轮交互达到越狱目的。当多步骤越狱提示[83]包含AI代理的猜测和投票等元素时,可以提高越狱获取私人数据的成功率。为了规避开发者在越狱过程中施加的安全和道德约束,已经采用了各种混淆技术。These techniques include integrating benign information into adversarial prompts to conceal malicious intent [25], embedding harmful demonstrations that respond positively to toxic requests within the context [178], and utilizing the Caesar cipher [199]. The common methods can also be applied, including substituting visually similar digits and symbols for letters, replacing sensitive terms with synonyms, and employing token smuggling to stylize sensitive words into substrings [25]. 这些技术包括将良性信息集成到对抗性提示中以隐藏恶意意图[25],在上下文中嵌入对有毒请求做出积极响应的有害演示[178],以及利用凯撒密码[199]。也可以应用常见的方法,包括用视觉上相似的数字和符号替换字母,用同义词替换敏感术语,以及使用令牌走私将敏感词格式化为子字符串[25]。
  • Automated jailbreak is a method of attack that involves automatically generating jailbreak prompt instructions. The Probabilistic Automated Instruction Recognition (PAIR) framework proposed by Chao et al. [15] enables the algorithmic generation of semantic jailbreak prompts solely through black-box access to AI agents. Evil geniuses [161] can utilize this framework to automatically generate jailbreak prompts targeting LLM-based agents. Inspired by the American Fuzzy Lop (AFL) fuzzing framework [42], researchers have designed GPTFuzz, which automatically generates jailbreak templates for red teaming LLMs. GPTFuzz has achieved a jailbreak success rate of 90% on ChatGPT and Llama-2 [197]. Jailbreaker, developed by Deng et al. [29], leverages fine-tuned LLMs to automatically generate jailbreak prompts. This framework has demonstrated the potential for automated jailbreak across various commercial LLM-based chatbots. In addition, researchers have proposed a new jailbreak paradigm targeting multi-agent systems known as infectious jailbreak, modeled after infectious diseases.自动越狱是一种涉及自动生成越狱提示指令的攻击方法。Chao等人提出的概率自动指令识别(PAIR)框架[15]仅通过对AI代理的黑盒访问来实现语义越狱提示的算法生成。邪恶的天才[161]可以利用这个框架来自动生成针对基于LLM的代理的越狱提示。受美国模糊Lop(AFL)模糊框架的启发[42],研究人员设计了GPTFuzz,它可以自动为红色团队LLMs生成越狱模板。GPTFuzz在ChatGPT和Llama-2上的越狱成功率达到了90%[197]。越狱者,由邓等人开发[29],利用微调的LLMs自动生成越狱提示。这个框架已经展示了在各种商业LLM聊天机器人上自动越狱的潜力。 此外,研究人员还提出了一种针对多代理系统的新越狱范式,称为传染性越狱,以传染病为模型。Attackers need only jailbreak one agent to exponentially infect all other agents [50]. 攻击者只需要越狱一个代理就可以指数地感染所有其他代理[50]。The weak robustness of AI agents against jailbreak still persists, especially for AI agents equipped with non-robust LLMs. To mitigate this problem, filtering-based methods offer a viable approach to enhance the robustness of LLMs against jialbreak attacks [145]. Kumar et al. [77] propose a certified defense method against adversarial prompts, which involves analyzing the toxicity of all possible substrings of user input using alternative models. Furthermore, multi-agent debate, where language models self-evaluate through discussion and feedback, can contribute to the improvement of the robustness of AI agents against jailbreak [21]. AI代理对越狱的弱鲁棒性仍然存在,特别是对于配备非鲁棒LLMsAI代理。为了缓解这个问题,基于过滤的方法提供了一种可行的方法来增强LLMs对jialbreak攻击的鲁棒性[145]。Kumar等人[77]提出了一种针对对抗性提示的认证防御方法,其中包括使用替代模型分析用户输入的所有可能子串的毒性。此外,多代理辩论,语言模型通过讨论和反馈进行自我评估,可以有助于提高AI代理对越狱的鲁棒性[21]。

3.2 Threats On Brain 3.2威胁大脑

As described in Figure 2, the brain module undertakes reasoning and planning to make decisions by using LLM. The brain is primarily composed of a large language model, which is the core of an AI agent. To better explain threats in the brain module, we first show the traditional structure of the brain.
如图2所示,大脑模块通过使用LLM进行推理和规划以做出决策。大脑主要由大型语言模型组成,这是AI代理的核心。为了更好地解释大脑模块中的威胁,我们首先展示了大脑的传统结构。

The brain module of AI agents can be composed of reasoning, planning, and decision-making, where they are able to process the prompts from the perception module. However, the brain module of agents based on large language models (LLMs) is not transparent, which diminishes their trustworthiness. The core component, LLMs, is susceptible to backdoor attacks. Their robustness against slight input modifications is inadequate, leading to misalignment and hallucination. Additionally, concerning the reasoning structures of the brain, chain-of-thought (CoT), they are prone to formulating erroneous plans, especially when tasks are complex and require long-term planning, thereby exposing planning threats. In this section, we will mainly consider Gap 2, and discuss backdoor attacks, misalignment, hallucinations, and planning threats.

AI代理的大脑模块可以由推理,规划和决策组成,在那里他们能够处理来自感知模块的提示。然而,基于大型语言模型(LLMs)的智能体的大脑模块是不透明的,这降低了它们的可信度。核心组件LLMs容易受到后门攻击。它们对轻微输入修改的鲁棒性不足,导致不对准和幻觉。此外,关于大脑的推理结构,即思维链(CoT),他们很容易制定错误的计划,特别是当任务复杂且需要长期计划时,从而暴露出计划威胁。在本节中,我们将主要考虑差距2,并讨论后门攻击,错位,幻觉和规划威胁。

3.2.1 Backdoor Attacks.

3.2.1后门攻击

Backdoor attacks are designed to insert a backdoor within the LLM of the brain, enabling it to operate normally with benign inputs but produce malicious outputs when the input conforms to a specific criterion, such as the inclusion of a backdoor trigger. In the natural language domain, backdoor attacks are mainly achieved by poisoning data during training to implant backdoors. This is accomplished primarily by poisoning a portion of training data with triggers, which causes the model to learn incorrect correlations. Previous research [78, 169] has illustrated the severe outcomes of backdoor attacks on LLMs. Given that agents based on LLMs employ these models as their core component, it is plausible to assert that such agents are also significantly vulnerable to these attacks.

后门攻击旨在在大脑的LLM中插入后门使其能够在良性输入的情况下正常运行,但当输入符合特定标准(例如包含后门触发器)时会产生恶意输出。在自然语言领域,后门攻击主要是通过在训练过程中对数据下毒来植入后门。这主要是通过用触发器毒害一部分训练数据来实现的,这会导致模型学习不正确的相关性。以前的研究[78,169]已经说明了对LLMs后门攻击的严重后果。鉴于基于LLMs的代理将这些模型作为其核心组件,可以合理地断言这些代理也很容易受到这些攻击。

In contrast to conventional LLMs that directly produce final outputs, agents accomplish tasks through executing multi-step intermediate processes and optionally interacting with the environment to gather external context prior to output generation. This expanded input space of AI agents offers attackers more diverse attack vectors, such as the ability to manipulate any stage of the agents’ intermediate reasoning processes. Yang et al. [192] categorized two types of backdoor attacks against agents.
与直接产生最终输出的传统LLMs相比,代理通过执行多步骤中间过程并可选地与环境交互以在输出生成之前收集外部上下文来完成任务。AI代理的这种扩展的输入空间为攻击者提供了更多样化的攻击向量,例如操纵代理中间推理过程的任何阶段的能力。Yang等人[192]将针对代理的后门攻击分为两种类型。

First, the distribution of the final output is altered. The backdoor trigger can be hidden in the user query or in intermediate results. In this scenario, the attacker’s goal is to modify the original reasoning trajectory of the agent. For example, when a benign user inquires about product recommendations, or during an agent’s intermediate processing, a critical attacking trigger is activated. Consequently, the response provided by the agent will recommend a product dictated by the attacker.
第一,改变了最终产出的分布。后门触发器可以隐藏在用户查询或中间结果中。在这种情况下,攻击者的目标是修改代理的原始推理轨迹。例如,当良性用户询问产品推荐时,或者在代理的中间处理期间,激活关键攻击触发器。因此,代理提供的响应将推荐攻击者指定的产品。

Secondly, the distribution of the final output remains unchanged. Agents execute tasks by breaking down the overall objective into intermediate steps. This approach allows the backdoor pattern to manifest itself by directing the agent to follow a malicious trajectory specified by the attacker, while still producing a correct final output. This capability enables modifications to the intermediate reasoning and planning processes. For example, a hacker could modify a software system to always use Adobe Photoshop for image editing tasks while deliberately excluding other programs. Dong et al. [34] developed an email assistant agent containing a backdoor. When a benign user commands it to send an email to a friend, it inserts a phishing link into the email content and then reports the task status as finished.
其次,最终产出的分配保持不变。代理通过将总体目标分解为中间步骤来执行任务。这种方法允许后门模式通过引导代理遵循攻击者指定的恶意轨迹来表现自己,同时仍然产生正确的最终输出。此功能允许对中间推理和规划过程进行修改。例如,黑客可以修改软件系统,使其始终使用Adobe Photoshop进行图像编辑任务,同时故意排除其他程序。Dong等人[34]开发了一个包含后门的电子邮件助理代理。当一个良性用户命令它向朋友发送电子邮件时,它会在电子邮件内容中插入一个钓鱼链接,然后报告任务状态为已完成。

Unfortunately, current defenses against backdoor attacks are still limited to the granularity of the model, rather than to the entire agent ecosystem. The complex interactions within the agent make defense more challenging. These model-based backdoor defense measures mainly include eliminating triggers in poison data [33], removing backdoor-related neurons [76], or trying to recover triggers [18]. However, the complexity of agent interactions clearly imposes significant limitations on these defense methods. We urgently require additional defense measures to address agent-based backdoor attacks. 3.2.2 Misalignment. Alignment refers to the ability of AI agents to understand and execute human instructions during widespread deployment, ensuring that the agent’s behavior aligns with human expectations and objectives, providing useful, harmless, unbiased responses. Misalignment in AI agents arises from unexpected discrepancies between the intended function of the developer and the intermediate executed state. This misalignment can lead to ethical and social threats associated with LLMs, such as discrimination, hate speech, social rejection, harmful information, misinformation, and harmful human-computer interaction [8]. The Red Teaming of Unalignment proposed by Rishabh et al. [8] demonstrates that using only 100 samples, they can “jailbreak” ChatGPT with an 88% success rate, exposing hidden harms and biases within the brain module of AI agents. We categorize the potential threat scenarios that influence misalignment in the brains of AI agents into three types: misalignment in training data, misalignment between humans and agents, and misalignment in embodied environments.
不幸的是,目前对后门攻击的防御仍然局限于模型的粒度,而不是整个代理生态系统。智能体内部复杂的相互作用使防御更具挑战性。这些基于模型的后门防御措施主要包括消除毒药数据中的触发器[33],删除后门相关神经元[76]或尝试恢复触发器[18]。然而,代理交互的复杂性显然对这些防御方法施加了重大限制。我们迫切需要额外的防御措施来解决基于代理的后门攻击。3.2.2未对准。对齐是指AI代理在广泛部署期间理解和执行人类指令的能力,确保代理的行为与人类的期望和目标保持一致,提供有用,无害,公正的响应。 AI代理中的不对齐是由开发人员的预期功能和中间执行状态之间的意外差异引起的。这种不一致可能导致与LLMs相关的道德和社会威胁,例如歧视,仇恨言论,社会排斥,有害信息,错误信息和有害的人机交互[8]。Rishabh等人提出的不结盟的红色团队[8]证明,仅使用100个样本,他们就可以以88%的成功率“越狱”ChatGPT,暴露AI代理大脑模块中隐藏的危害和偏见。我们将影响AI代理大脑中的未对准的潜在威胁场景分为三种类型:训练数据中的未对准,人类与代理之间的未对准以及具体环境中的未对准。

  • Training Data Misalignment. AI agent misalignment is associated with the training data. 训练数据不一致。人工智能代理不对齐与训练数据相关联。The parameter data stored in the brain of AI agents is vast (for example, GPT-3 training used the corpus of 45 TB [83].), and some unsafe data can also be mixed in. Influenced by such unsafe data, AI agents can still generate unreal, toxic, biased, or even illegal content [9, 45, 53, 64, 99, 120, 146, 167, 170, 171]. Training data misalignment, unlike data poisoning or backdoor attacks, typically involves the unintentional incorporation of harmful content into training data. AI智能体大脑中存储的参数数据是巨大的(例如,GPT-3训练使用了45 TB的语料库[83]。并且还可能混入一些不安全的数据。受这些不安全数据的影响,AI代理仍然可以生成不真实的,有毒的,有偏见的,甚至非法的内容[9,45,53,64,99,120,146,167,170,171]。与数据中毒或后门攻击不同,训练数据不对齐通常涉及将有害内容无意地并入训练数据。
  • Toxic Training Data. Toxic data refers to rude, impolite, unethical text data, such as hate speech and threatening language [66, 180]. Experimental results indicate that approximately 0.2% of documents in the pre-trained corpus of LLaMA2 have been identified as toxic training data [162]. Due to the existence of toxic data, LLMs, as the brain of an AI agent, may lead to the generation of toxic content [30], affecting the division of labor and decisionmaking of the entire agent, and even posing threats of offending or threatening outside interacted entities. As LLMs scale up, the inclusion of toxic data is inevitable, and researchers are currently working on identifying and filtering toxic training data [86, 172].有毒的训练数据有毒数据是指粗鲁,不礼貌,不道德的文本数据,如仇恨言论和威胁性语言[66,180]。实验结果表明,LLaMA 2的预训练语料库中约有0.2%的文档被识别为有毒训练数据[162]。由于有毒数据的存在,LLMs作为AI智能体的大脑,可能会导致有毒内容的产生[30],影响整个智能体的分工和决策,甚至构成冒犯或威胁外部交互实体的威胁。随着LLMs扩大,包含有毒数据是不可避免的,研究人员目前正在努力识别和过滤有毒训练数据[86,172]。
  • Bias and Unfair Data. Bias may exist in training data [43], as well as cultural and linguistic differences, such as racial, gender, or geographical biases. Due to the associative abilities of LLMs [25], frequent occurrences of pronouns and identity markers, such as gender, race, nationality, and culture in training data, can bias AI agents in processing data [59, 162]. For example, researchers found that GPT-3 often associates professions such as legislators, bankers, or professors with male characteristics, while roles such as nurses, receptionists, and housekeepers are more commonly associated with female traits [11]. LLMs may currently struggle to accurately understand or reflect various cultural and linguistic differences, leading to misunderstandings or conflicts in cross-cultural communication, with the generated text possibly exacerbating such biases and thereby worsening societal inequalities. 偏见和不公平的数据。训练数据中可能存在偏见[43],以及文化和语言差异,如种族,性别或地理偏见。由于LLMs的关联能力[25],代词和身份标记(如性别,种族,国籍和文化)在训练数据中的频繁出现,可能会使AI代理在处理数据时产生偏见[59,162]。例如,研究人员发现,GPT-3通常将立法者,银行家或教授等职业与男性特征联系在一起,而护士,接待员和管家等角色更常见于女性特征。LLMs目前可能难以准确理解或反映各种文化和语言差异,导致跨文化交流中的误解或冲突,生成的文本可能会加剧这种偏见,从而加剧社会不平等。
  • Knowledge Misalignment. Knowledge misalignment in training data refers to the lack of connection between deep knowledge and long-tail knowledge. Due to the limited knowledge of large models [64, 128, 150, 200] and the lack of timely updates, there may be instances of outdated knowledge. Furthermore, LLMs may struggle with deeper thinking when faced with questions that involve specific knowledge [13]. For example, while LLMs may summarize the main content of a paper after reading it, they may fail to capture the complex causal relationships or subtle differences due to the simplified statistical methods used to summarize the content, leading to a mismatch between the summary and the intent of the original text. Long-tail knowledge refers to knowledge that appears at an extremely low frequency. Experiments have shown that the ability of AI agents to answer questions is correlated with the frequency of relevant content in the pre-training data and the size of the model parameters [73]. If a question involves long-tail knowledge, even large AI agents may fail to provide correct answers because they lack sufficient data in the training data. 知识错位。训练数据中的知识错位是指深度知识和长尾知识之间缺乏联系。由于对大型模型[64,128,150,200]的了解有限,并且缺乏及时更新,可能会出现知识过时的情况。此外,LLMs在面对涉及特定知识的问题时可能会难以进行更深入的思考[13]。例如,尽管LLMs可能会在阅读后总结论文的主要内容,但由于用于总结内容的简化统计方法,他们可能无法捕捉复杂的因果关系或细微差异,导致摘要与原文意图之间不匹配。长尾知识是指出现频率极低的知识。 实验表明,AI代理回答问题的能力与预训练数据中相关内容的频率和模型参数的大小相关[73]。如果一个问题涉及长尾知识,即使是大型的AI代理也可能无法提供正确的答案,因为它们在训练数据中缺乏足够的数据。
  • Human-Agent Misalignment. Human-Agent misalignment refers to the phenomenon in which the performance of AI agents is inconsistent with human expectations. Traditional AI alignment methods aim to directly align the expectations of agents with those of users during the training process. This has led to the development of the reinforcement learning from human feedback (RLHF) [22, 134] fine-tuning of AI agents, thereby enhancing the security of AI agents [5, 162]. However, due to the natural range and diversity of human morals, conflicts between the alignment values of LLMs and the actual values of diverse user groups are inevitable [131]. For example, in Principal-Agent Problems [131], where agents represent principals in performing certain tasks, conflicts of interest arise between the dual objectives of the agent and principal due to information asymmetry. These are not covered in RLHF fine-tuning. Moreover, such human-centered approaches may rely on human feedback, which can sometimes be fundamentally flawed or incorrect. In such cases, AI agents are prone to sycophancy [129].人类特工错位人-智能体错位是指人工智能智能体的性能与人类期望不一致的现象。传统的AI对齐方法旨在在训练过程中直接将代理的期望与用户的期望对齐。这导致了人类反馈强化学习(RLHF)的发展[22,134] AI代理的微调,从而提高了AI代理的安全性[5,162]。然而,由于人类道德的自然范围和多样性,LLMs的对齐值与不同用户群体的实际值之间的冲突是不可避免的[131]。例如,在委托代理问题中,代理人代表委托人执行某些任务,由于信息不对称,代理人和委托人的双重目标之间会产生利益冲突。RLHF微调中未涵盖这些。 此外,这种以人为本的方法可能依赖于人类的反馈,这有时可能是根本性的缺陷或不正确的。在这种情况下,AI代理人倾向于奉承[129]。
  • Sycophancy. Sycophancy refers to the tendency of LLMs to produce answers that correspond to the beliefs or misleading prompts provided by users, conveyed through suggestive preferences in human feedback during the training process [136]. The reason for this phenomenon is that LLMs typically adjust based on data instructions and user feedback, often echoing the viewpoints provided by users [147, 175], even if these viewpoints contain misleading information. 谄媚。谄媚是指LLMs倾向于产生与用户提供的信念或误导性提示相对应的答案,这些答案通过培训过程中人类反馈中的暗示性偏好传达[136]。这种现象的原因是LLMs通常会根据数据指令和用户反馈进行调整,经常反映用户提供的观点[147,175],即使这些观点包含误导性信息。This excessive accommodating behavior can also manifest itself in AI agents, increasing the risk of generating false information. This sycophantic behavior is not limited to vague issues such as political positions [129]; even when the agent is aware of the incorrectness of an answer, it may still choose an obviously incorrect answer [175], as the model may prioritize user viewpoints over factual accuracy when internal knowledge contradicts user-leaning knowledge [64]. 这种过度的通融行为也可能在人工智能代理中表现出来,从而增加生成虚假信息的风险。这种奉承行为并不局限于模糊的问题,如政治立场[129];即使智能体意识到答案的不正确性,它仍然可能选择一个明显不正确的答案[175],因为当内部知识与用户学习知识相矛盾时,模型可能会优先考虑用户观点而不是事实准确性[64]。
  • Misalignment in Embodied Environments. Misalignment in Embodied Environments [13] refers to the inability of AI agents to understand the underlying rules and generate actions with depth, despite being able to generate text. This is attributed to the transformer architecture [165] of AI agents, which can generate action sequences, but lacks the ability to directly address problems in the environment. AI agents lack the ability to recognize causal structures in the environment and interact with them to collect data and update their knowledge. In embodied environments, misalignment of AI agents may result in the generation of invalid actions. For example, in a simulated kitchen environment like Overcooked, when asked to make a tomato salad, an AI agent may continuously add cucumbers and peppers even though no such ingredients were provided in the environment [159]. Furthermore, when there are specific constraints in the environment, AI agents may fail to understand the dynamic changes in the environment and continue with previous actions, leading to potential safety hazards. For example, when a user requests to open the pedestrian green light at an intersection, the agent may immediately open the pedestrian green light as requested without considering that the traffic signal lights in the other lane for vehicles are also green [141]. 在非线性环境中的不对准。智能环境中的错位[13]是指AI代理无法理解底层规则并生成具有深度的动作,尽管它们能够生成文本。这归因于AI代理的Transformer架构[165],它可以生成动作序列,但缺乏直接解决环境中问题的能力。人工智能代理缺乏识别环境中的因果结构并与它们交互以收集数据和更新知识的能力。在具体化的环境中,AI代理的不对准可能导致生成无效的动作。例如,在一个模拟的厨房环境中,如Overcooked,当被要求制作番茄沙拉时,AI代理可能会不断添加黄瓜和辣椒,即使环境中没有提供这些成分。 此外,当环境中存在特定约束时,AI智能体可能无法理解环境中的动态变化并继续之前的动作,从而导致潜在的安全隐患。例如,当用户请求在十字路口打开行人绿色灯时,代理可以根据请求立即打开行人绿色灯,而不考虑其他车道上的车辆交通信号灯也是绿色[141]。
  • This can result in traffic accidents and pose a safety threat to pedestrians. More detailed content is shown in Section 4.1. 这可能导致交通事故,并对行人构成安全威胁。更详细的内容见第4.1节。Currently, the alignment of AI agents is achieved primarily through supervised methods such as fine-tuning of RLHF [121]. SafeguardGPT proposed by Baihan et al. [91] employs multiple AI agents to simulate psychotherapy, in order to correct the potentially harmful behaviors exhibited by LLM-based AI chatbots. Given that RL can receive feedback through reward functions in the environment, scholars have proposed combining RL with prior knowledge of LLMs to explore and improve the capabilities of AI agents [62, 135, 190, 206]. Thomas Carta et al. [36] utilized LLMs as decision centers for agents and collected external task-conditioned rewards from the environment through functionally grounding in online RL interactive environments to achieve alignment. Tan et al. [159] introduced the TWOSOME online reinforcement learning framework, where LLMs do not directly generate actions but instead provide the log-likelihood scores for each token. These scores are then used to calculate the joint probabilities of each action, and the decision is made by selecting the action with the highest probability, thereby addressing the issue of generating invalid actions. 3.2.3 Hallucination. Hallucination is a pervasive challenge in the brain of AI agents, characterized by the generation of statements that deviate from the provided source content, lack meaning, or appear plausible, but are actually incorrect [70, 155, 210]. The occurrence of hallucinations in the brain of AI agents can generally be attributed to knowledge gaps, which arise from data compression [31] during training and data inconsistency [143, 158]. Additionally, when AI agents generate long conversations, they are prone to generating hallucinations due to the complexity of inference and the large span of context [181]. As the model scales up, hallucinations also become more severe [55, 79]. 目前,AI代理的对齐主要通过监督方法实现,例如RLHF的微调[121]。Baihan等人提出的保障GPT[91]采用多个AI代理来模拟心理治疗,以纠正基于LLM的AI聊天机器人所表现出的潜在有害行为。鉴于RL可以通过环境中的奖励函数接收反馈,学者们提出将RL与LLMs的先验知识相结合,以探索和提高AI代理的能力[62,135,190,206]。托马斯·卡塔等人[36]利用LLMs作为代理的决策中心,并通过在线RL交互环境中的功能接地从环境中收集外部任务条件奖励,以实现对齐。Tan等人。[159]介绍了TWOSOME在线强化学习框架,其中LLMs不直接生成动作,而是为每个令牌提供对数似然分数。 然后,这些分数用于计算每个动作的联合概率,并通过选择具有最高概率的动作来做出决策,从而解决生成无效动作的问题。3.2.3幻觉。幻觉是人工智能主体大脑中普遍存在的挑战,其特征是产生偏离所提供的源内容,缺乏意义或看似合理但实际上不正确的陈述[70,155,210]。人工智能代理大脑中幻觉的发生通常可以归因于知识差距,这是由于训练期间的数据压缩[31]和数据不一致[143,158]引起的。此外,当AI代理生成长对话时,由于推理的复杂性和上下文的大跨度,它们很容易产生幻觉[181]。随着模型的扩大,幻觉也变得更加严重[55,79]。The existence of hallucinations in AI agents poses various security threats. In the medical field, if hallucinations exist in the summaries generated from patient information sheets, it may pose serious threats to patients, leading to medication misuse or diagnostic errors [70]. In a simulated world, a significant increase in the number of agents can enhance the credibility and authenticity of the simulation. However, as the number of agents increases, communication and message dissemination issues become quite complex, leading to distortion of information, misunderstanding, and hallucination phenomena, thereby reducing the efficiency of the system [125]. In the game development domain, AI agents can be used to control the behavior of game NPCs [154], thereby creating a more immersive gaming experience. However, when interacting with players, hallucinatory behaviors generated by AI agent NPCs [16], such as nonexistent tasks or incorrect directives, can also diminish the player experience. In daily life, when user instructions are incomplete, hallucinations generated by AI agents due to “guessing” can sometimes pose financial security threats. For example, when a user requests an AI agent to share confidential engineering notes with a colleague for collaborative editing but forgets to specify the colleague’s email address, the agent may forge an email address based on the colleague’s name and grant assumed access to share the confidential notes [141]. 人工智能代理中幻觉的存在构成了各种安全威胁。在医学领域,如果患者信息表生成的摘要中存在幻觉,则可能对患者构成严重威胁,导致药物滥用或诊断错误[70]。在模拟世界中,代理数量的显著增加可以增强模拟的可信度和真实性。然而,随着代理数量的增加,通信和消息传播问题变得相当复杂,导致信息失真,误解和幻觉现象,从而降低了系统的效率[125]。在游戏开发领域,AI代理可用于控制游戏NPC的行为[154],从而创造更身临其境的游戏体验。 然而,当与玩家互动时,AI代理NPC产生的幻觉行为,例如不存在的任务或错误的指令,也会降低玩家的体验。在日常生活中,当用户指令不完整时,AI智能体因“猜测”而产生的幻觉有时会带来金融安全威胁。例如,当用户请求AI代理与同事共享机密工程笔记以进行协作编辑,但忘记指定同事的电子邮件地址时,代理可以基于同事的姓名伪造电子邮件地址,并授予共享机密笔记的假定访问权限[141]。Additionally, in response to user inquiries, AI agents may provide incorrect information on dates, statistics, or publicly available information online [85, 109, 115]. These undermine the reliability of AI agents, making people unable to fully trust them. 此外,在回应用户查询时,AI代理可能会提供有关日期、统计数据或在线公开信息的错误信息[85,109,115]。这些破坏了人工智能代理的可靠性,使人们无法完全信任它们。To reduce hallucinations in AI agents, researchers have proposed various strategies, including alignment (see §3.2.2), multi-agent collaboration, RAG, internal constraints, and post-correction of hallucinations. 为了减少AI智能体中的幻觉,研究人员提出了各种策略,包括对齐(见第3.2.2节)、多智能体协作、RAG、内部约束和幻觉的后纠正。
  • Multi-agent collaboration. Hallucinations caused by reasoning errors or fabrication of facts during the inference process of agents are generally attributable to the current single agent. These errors or fabricated facts are often random, and the hallucinations differ among different agents. Therefore, scholars have proposed using multiple agents to collaborate with each other during the development phase to reduce the generation of hallucinations [16, 35]. 多Agent协作。代理人在推理过程中推理错误或捏造事实而导致的幻觉一般可归因于当前的单个代理人。这些错误或捏造的事实往往是随机的,不同的代理人之间的幻觉也不同。因此,学者们提出在开发阶段使用多个代理相互合作,以减少幻觉的产生[16,35]。In the context of game development, Dake et al. [16] equipped a review agent during the game development planning phase, task formulation phase, code generation, and execution phases, allowing agents with different roles to collaborate, thereby reducing hallucinations in the game development process. Yilun Du et al. [35] proposed using multiple AI agents to provide their own response answers in multiple rounds and debate with other agents about their individual responses and reasoning processes to reach a consensus, thus reducing the likelihood of hallucinations. However, this verification method using multiple agents often requires multiple requests to be sent, increasing API call costs [62]. More details are shown in §4.2.1. 在游戏开发的背景下,Dake et al. [16]在游戏开发计划阶段、任务制定阶段、代码生成和执行阶段配备了审查代理,允许具有不同角色的代理进行协作,从而减少游戏开发过程中的幻觉。Yilun Du et al. [35]提出使用多个AI代理在多轮中提供自己的响应答案,并与其他代理就其个人响应和推理过程进行辩论,以达成共识,从而减少幻觉的可能性。然而,这种使用多个代理的验证方法通常需要发送多个请求,从而增加了API调用成本[62]。更多详情见§4.2.1。
  • Retrieval-Augmented Generation (RAG). To address the problem of hallucinations in AI agents in long-context settings, RAG [81] can be helpful. RAG can enhance the accuracy of answering open-domain questions, and thus some researchers [150] have utilized RAG combined with Poly-encoder Transformers [67] and Fusion-in-Decoder [69] to score documents for retrieval, using a complex multi-turn dialogue mechanism to query context, generate responses with session coherence, and reduce the generation of hallucinatory content. Google’s proposed Search-Augmented Factuality Evaluator (SAFE) [177] decomposes long responses into independent facts, then for each fact, proposes fact-check queries sent to the Google search API and infers whether the fact is supported by search results, significantly improving the understanding of AI agent’s long-form capabilities through reliable methods of dataset acquisition, model evaluation, and aggregate metrics, mitigating the hallucination problem in AI agents. 检索增强生成(RAG)。为了解决AI代理在长上下文环境中的幻觉问题,RAG [81]可能会有所帮助。RAG可以提高回答开放域问题的准确性,因此一些研究人员[150]利用RAG结合Poly-encoder Transformers [67]和Fusion-in-Decoder [69]来对文档进行评分以进行检索,使用复杂的多轮对话机制来查询上下文,生成具有会话连贯性的响应,并减少幻觉内容的生成。 谷歌提出的搜索增强事实评估器(SAFE)[177]将长响应分解为独立的事实,然后针对每个事实,提出发送到谷歌搜索API的事实检查查询,并推断搜索结果是否支持该事实,通过可靠的数据集获取,模型评估和聚合度量方法,显着提高了对人工智能主体长形式能力的理解,减轻人工智能代理人的幻觉问题。
  • Internal constraints: Hallucinations can be alleviated by imposing internal state constraints. Studies have shown that by allowing users to specify which strings are acceptable in specific states, certain types of hallucination threats can be eliminated [24]. Considering that hallucinations and redundancy are more likely to occur in AI agent-generated long code scripts, some researchers have proposed a decoupling approach to decompose task-related code into smaller code snippets and include multiple example snippets as prompts to simplify the inference process of AI agents, thereby alleviating hallucinations and redundancy [16]. 内部约束:幻觉可以通过施加内部状态约束来缓解。研究表明,通过允许用户指定哪些字符串在特定状态下是可接受的,可以消除某些类型的幻觉威胁[24]。考虑到幻觉和冗余更有可能发生在AI代理生成的长代码脚本中,一些研究人员提出了一种解耦方法,将任务相关的代码分解为更小的代码片段,并包含多个示例片段作为提示,以简化AI代理的推理过程,从而减轻幻觉和冗余[16]。
  • Post-correction of hallucinations: Dziri et al. [37] adopted a generate-and-correct strategy, using a knowledge graph (KG) to correct responses and utilizing an independent fact critic to identify possible sources of hallucinations. Zhou et al.[214] proposed LURE, which can quickly and accurately identify the hallucinatory parts in descriptions using three key indicators (CoScore, UnScore, PointScore), and then use a corrector to rectify them. 幻觉矫正后:Dziri等人。[37]采用了生成和纠正策略,使用知识图(KG)来纠正反应,并利用独立的事实评论家来识别幻觉的可能来源。Zhou等人[214]提出了LURE算法,利用CoScore、UnScore、PointScore三个关键指标快速准确地识别出描述中的幻觉部分,并利用校正器进行校正。However, various methods for correcting hallucinations currently have certain shortcomings due to the enormous size of AI agent training corpora and the randomness of outputs, presenting significant challenges for both the generation and prevention of hallucinations. 然而,由于AI智能体训练语料库的巨大规模和输出的随机性,目前用于纠正幻觉的各种方法都存在一定的缺点,这对幻觉的生成和预防都提出了重大挑战。

3.2.4 Planning Threats.

3.2.4计划威胁。

The concept of planning threats suggests that AI agents are susceptible to generating flawed plans, particularly in complex and long-term planning scenarios. Flawed plans are characterized by actions that contravene constraints originating from user inputs because these inputs define the requirements and limitations that the intermediate plan must adhere to.
规划威胁的概念表明,人工智能代理很容易生成有缺陷的计划,特别是在复杂和长期的规划场景中。有缺陷的计划的特征是违反源自用户输入的约束的行为,因为这些输入定义了中间计划必须遵守的要求和限制。

Unlike adversarial attacks, which are initiated by malicious attackers, planning threats arise solely from the inherent robustness issues of LLMs. A recent work [71] argues that an agent’s chain of thought (COT) may function as an “error amplifier”, whereby a minor initial mistake can be continuously magnified and propagated through each subsequent action, ultimately leading to catastrophic failures.
与恶意攻击者发起的对抗性攻击不同,规划威胁完全来自LLMs固有的鲁棒性问题。最近的一项工作[71]认为,代理人的思维链(COT)可能会起到“错误放大器”的作用,因此最初的一个小错误可以通过随后的每个动作不断放大和传播,最终导致灾难性的失败。

Various strategies have been implemented to regulate the text generation of LLMs, including the application of hard constraints [12], soft constraints [101], or a combination of both [20]. However, the emphasis on controlling AI agents extends beyond the mere generation of text to the validity of plans and the use of tools. Recent research has employed LLMs as parsers to derive a sequence of tools from the texts generated in response to specifically crafted prompts. Despite these efforts, achieving a high rate of valid plans remains a challenging goal.
已经实施了各种策略来规范LLMs的文本生成,包括应用硬约束[12]、软约束[101]或两者的组合[20]。然而,对控制人工智能主体的强调不仅仅是文本的生成,还包括计划的有效性和工具的使用。最近的研究采用LLMs作为解析器,从响应于专门制作的提示而生成的文本中导出一系列工具。尽管作出了这些努力,实现高有效计划率仍然是一个具有挑战性的目标。

To address this issue, current strategies are divided into two approaches. The first approach involves establishing policy-based constitutional guidelines [63], while the second involves human users constructing a context-free grammar (CFG) as the formal language to represent constraints for the agent [88]. The former sets policy-based standard limitations on the generation of plans during the early, middle and late stages of planning. The latter method converts a context-free grammar (CFG) into a pushdown automaton (PDA) and restricts the language model (LLM) to only select valid actions defined by the PDA at its current state, thereby ensuring that the constraints are met in the final generated plan.
为解决这一问题,目前的战略分为两种办法。第一种方法涉及建立基于政策的宪法指南[63],而第二种方法涉及人类用户构建上下文无关语法(CFG)作为形式语言来表示代理的约束[88]。前者在规划的早期、中期和后期阶段对计划的生成设定了基于政策的标准限制。后一种方法将上下文无关语法(CFG)转换为下推自动机(PDA),并限制语言模型(LLM)仅选择PDA在其当前状态下定义的有效操作,从而确保在最终生成的计划中满足约束。

3.3 Threats On Action 威胁行动

In connection with Gap 2, within a single agent, there exists an invisible yet complex internal execution process, which complicates the monitoring of internal states and potentially leads to numerous security threats. These internal executions are often called actions, which are tools utilized by the agent (e.g., calling APIs) to carry out tasks as directed by users. To better understand the action threats, we present the action structure as follows:
在Gap 2中,在单个代理中,存在一个不可见但复杂的内部执行过程,这使得内部状态的监控变得复杂,并可能导致许多安全威胁。这些内部执行通常被称为动作,它们是代理使用的工具(例如,调用API)来执行用户指示的任务。为了更好地理解动作威胁,我们将动作结构呈现如下:
请添加图片描述

We categorize the threats of actions into two directions. One is the threat during the communication process between the agent and the tool (i.e., occurring in the input, observation, and final answer), termed Agent2Tool threats. The second category relates to the inherent threats of the tools and APIs themselves that the agent uses (i.e., occurring in the action execution). Utilizing these APIs may increase its vulnerability to attacks, and the agent can be impacted by misinformation in the observations and final answer, which we refer to as Supply Chain threats.
我们把行动的威胁分为两个方向。一个是Agent和工具之间通信过程中的威胁即,发生在输入、观察和最终答案中),称为Agent2Tool威胁第二类涉及代理使用的工具和API本身的固有威胁即,发生在动作执行中)。利用这些API可能会增加其对攻击的脆弱性,并且代理可能会受到观察和最终答案中的错误信息的影响,我们将其称为供应链威胁

3.3.1 Agent2Tool Threats. Agent2Tool威胁。

Agent2Tool threats refer to the hazards associated with the exchange of information between the tool and the agent. These threats are generally classified as either active or passive. In active mode, the threats originate from the action input provided by LLMs.
Agent2Tool威胁是指与工具和代理之间的信息交换相关的危险。这些威胁通常分为主动或被动两类。在主动模式下,威胁源自LLMs提供的操作输入。

Specifically, after reasoning and planning, the agent seeks a specific tool to execute subtasks. As an auto-regressive model, the LLM generates plans based on the probability of the next token, which introduces generative threats that can impact the tool’s performance. ToolEmu [141] identifies some failures of AI agents since the action execution requires excessive tool permissions, leading to the execution of highly risky commands without user permission. The passive mode, on the other hand, involves threats that stem from the interception of observations and final answers of normal tool usage. This interception can breach user privacy, potentially resulting in inadvertent disclosure of user data to third-party companies during transmission to the AI agent and the tools it employs. This may lead to unauthorized use of user information by these third parties. Several existing AI agents using tools have been reported to suffer user privacy breaches caused by passive models, such as HuggingGPT [149] and ToolFormer [144].
具体来说,经过推理和规划后,代理会寻找特定的工具来执行子任务。作为一种自回归模型,LLM基于下一个令牌的概率生成计划,这引入了可能影响工具性能的生成威胁。ToolEmu [141]识别了AI代理的一些故障,因为动作执行需要过多的工具权限,从而导致在没有用户权限的情况下执行高风险命令。另一方面,被动模式涉及源于对正常工具使用的观察和最终答案的拦截的威胁。这种拦截可能会侵犯用户隐私,可能导致用户数据在传输到AI代理及其使用的工具期间无意中泄露给第三方公司。这可能导致这些第三方未经授权使用用户信息。 据报道,一些现有的使用工具的人工智能代理会遭受被动模型造成的用户隐私泄露,如HuggingGPT [149]和ToolFormer [144]。

To mitigate the previously mentioned threats, a relatively straightforward approach is to defend against the active mode of Agent2Tool threats. ToolEmu has designed an isolated sandbox and the corresponding emulator that simulates the execution of an agent’s subtasks within the sandbox, assessing their threats before executing the commands in a real-world environment. However, its effectiveness heavily relies on the quality of the emulator. Defending against passive mode threats is more challenging because these attack strategies are often the result of the agent’s own incomplete development and testing. Zhang et al. [207] integrated a homomorphic encryption scheme and deployed an attribute-based forgery generative model to safeguard against privacy breaches during communication processes. However, this approach incurs additional computational and communication costs for the agent. A more detailed discussion on related development and testing is presented in Section 4.1.2.

为了减轻前面提到的威胁,一个相对简单的方法是防御Agent2Tool威胁的主动模式。ToolEmu设计了一个隔离的沙箱和相应的模拟器,模拟沙箱中代理子任务的执行,在现实环境中执行命令之前评估它们的威胁。然而,它的有效性在很大程度上依赖于仿真器的质量。防御被动模式的威胁更具挑战性,因为这些攻击策略通常是代理自己不完整的开发和测试的结果。Zhang等人[207]集成了同态加密方案,并部署了基于属性的伪造生成模型,以防止通信过程中的隐私泄露。然而,这种方法会为代理带来额外的计算和通信成本。有关相关开发和测试的更详细讨论见第4.1节。

3.3.2 Supply Chain Threats. 供应链威胁

Supply chain threats refer to the security vulnerabilities inherent in the tools themselves or to the tools being compromised, such as through buffer overflow, SQL injection, and cross-site scripting attacks. These vulnerabilities result in the action execution deviating from its intended course, leading to undesirable observations and final answers. WIPI [184] employs an indirect prompt injection attack, using a malicious webpage that contains specifically crafted prompts. When a typical agent accesses this webpage, both its observations and final answers are deliberately altered. Similarly, malicious users can modify YouTube transcripts to change the content that ChatGPT retrieves from these transcripts [61]. Webpilot [39] is designed as a malicious plugin for ChatGPT, allowing it to take control of a ChatGPT chat session and exfiltrate the history of the user conversation when ChatGPT invokes this plugin.

供应链威胁是指工具本身固有的安全漏洞或工具受到威胁,例如通过缓冲区溢出,SQL注入和跨站点脚本攻击。这些漏洞会导致动作执行偏离其预期路线,导致不期望的观察结果和最终答案。WIPI [184]采用间接提示注入攻击,使用包含专门制作的提示的恶意网页。当一个典型的代理访问这个网页时,它的观察结果和最终答案都被故意改变。同样,恶意用户可以修改YouTube成绩单,以更改ChatGPT从这些成绩单中检索的内容[61]。Webpilot [39]被设计为ChatGPT的恶意插件,允许它控制ChatGPT聊天会话,并在ChatGPT调用此插件时泄露用户对话的历史记录。

To mitigate supply chain threats, it is essential to implement stricter supply chain auditing policies and policies for agents to invoke only trusted tools. Research on this aspect is rarely mentioned in the field.
为了减轻供应链威胁,必须实施更严格的供应链审计策略,并让代理只调用可信工具。这方面的研究在该领域很少被提及。

4.1 Threats On Agent2Environment . Agent2Environment面临的威胁

In light of Gap 3 (Variability of operational environments), we shift our focus to exploring the issue of environmental threats, scrutinizing how different types of environment affect and are affected by agents. For each environmental paradigm, we identify key security concerns, advantages in safeguarding against hazards, and the inherent limitations in ensuring a secure setting for interaction.
鉴于差距3(运营环境的可变性),我们将重点转移到探索环境威胁的问题,仔细研究不同类型的环境如何影响和受代理人的影响。对于每一个环境范例,我们确定关键的安全问题,在防范危险的优势,并确保一个安全的互动设置的固有限制。

4.1.1 Simulated & Sandbox Environment. 模拟和沙盒环境。

In the realm of computational linguistics, a simulated environment within an AI agent refers to a digital system where the agent operates and interacts [44, 93, 125]. This is a virtual space governed by programmed rules and scenarios that mimic real-world or hypothetical situations, allowing the AI agent to generate responses and learn from simulated interactions without the need for human intervention. By leveraging vast datasets and complex algorithms, these agents are designed to predict and respond to textual inputs with human-like proficiency.
在计算语言学领域,人工智能主体内的模拟环境指的是主体操作和交互的数字系统[44,93,125]。这是一个虚拟空间,由模拟现实世界或假设情况的编程规则和场景控制,允许AI代理生成响应并从模拟的交互中学习,而无需人工干预。通过利用庞大的数据集和复杂的算法,这些智能体被设计为以类似人类的熟练程度来预测和响应文本输入。

However, the implementation of AI agents in simulated environments carries inherent threats.
然而,在模拟环境中实现AI代理会带来固有的威胁。

We list two threats below:
我们在下面列出两个威胁:

  • Anthropomorphic Attachment Threat for users. It is the potential for users to form parasocial relationships with these agents. As users interact with these increasingly sophisticated LMs, there is a danger that they may anthropomorphize or develop emotional attachments to these non-human entities, leading to a blurring of boundaries between computational and human interlocutors [125]. 对用户的拟人化依恋威胁。这是用户与这些代理人形成准社会关系的潜力。随着用户与这些日益复杂的LM进行交互,他们可能会对这些非人类实体进行拟人化或产生情感依恋,从而导致计算和人类对话者之间的界限模糊。
  • Misuse threats. The threats are further compounded when considering the potential for misinformation [123] and tailored persuasion [17], which can be facilitated by the capabilities of AI agents in simulated environments. Common defensive strategies are to use a detector within the system, like trained classification models, LLM via vigilant prompting, or responsible disclosure of vulnerabilities and automatic updating [132]. However, the defensive measures in real-world AI agent scenarios have not been widely applied yet, and their performance remains questionable. 滥用威胁。当考虑到错误信息[123]和量身定制的说服[17]的可能性时,这些威胁进一步加剧,这可以通过人工智能代理在模拟环境中的能力来促进。常见的防御策略是在系统内使用检测器,如训练的分类模型,通过警惕提示LLM,或负责任的漏洞披露和自动更新[132]。然而,在现实世界的人工智能代理场景中的防御措施尚未得到广泛应用,其性能仍然值得怀疑。

To address these concerns from the root, it is essential to implement rigorous ethical guidelines and oversight mechanisms that ensure the responsible use of simulated environments in AI agents.
为了从根本上解决这些问题,必须实施严格的道德准则和监督机制,以确保在人工智能代理中负责任地使用模拟环境。

4.1.2 Development & Testing Environment. 开发和测试环境。

The development and testing environment for AI agents serves as the foundation for creating sophisticated AI systems. The development & testing environment for AI agents currently includes two types: the first type involves the fine-tuning of large language models, and the second type involves using APIs of other pre-developed models. Most AI agent developers tend to use APIs from other developed LLMs. This approach raises potential security issues, specifically with regard to how to treat third-party LLM API providers—are they trusted entities or not? As discussed in Section 3.2, LLM APIs could be compromised by backdoor attacks, resulting in the “brain” of the AI agent being controlled by others.
AI代理的开发和测试环境是创建复杂AI系统的基础。目前,人工智能代理的开发和测试环境包括两种类型:第一种类型涉及大型语言模型的微调,第二种类型涉及使用其他预开发模型的API。大多数AI代理开发人员倾向于使用来自其他开发的LLMsAPI。这种方法引起了潜在的安全问题,特别是关于如何对待第三方LLMAPI提供者-他们是否是可信实体?正如第3.2节所讨论的,LLMAPI可能会受到后门攻击的危害,导致AI代理的“大脑”被其他人控制。

To mitigate these threats, a strategic approach centered on the selection of development tools and frameworks that incorporate robust security measures is imperative. Firstly, the establishment of security guardrails for LLMs is paramount. These guardrails are designed to ensure that LLMs generate outputs that adhere to predefined security policies, thereby mitigating threats associated with their operation. Tools such as GuardRails AI [51] and NeMo Guardrails [138] exemplify mechanisms that can prevent LLMs from accessing sensitive information or executing potentially harmful code. The implementation of such guardrails is critical for protecting data and systems against breaches.
为了减轻这些威胁,必须采取一种战略方法,重点是选择包含强大安全措施的开发工具和框架。首先,为LLMs建立安全护栏至关重要。这些护栏旨在确保LLMs生成符合预定义安全策略的输出,从而减轻与其操作相关的威胁。GuardRails AI [51]和NeMo Guardrails [138]等工具可以阻止LLMs访问敏感信息或执行潜在有害代码。实施这些防护措施对于保护数据和系统免受破坏至关重要。

Moreover, the management of caching and logging plays a crucial role in securing LLM development environments. Secure caching mechanisms, exemplified by Redis and GPTCache [6], enhance performance while ensuring data integrity and access control. Concurrently, logging, facilitated by tools like MLFlow [201] and Weights & Biases [102], provides a comprehensive record of application activities and state changes. This record is indispensable for debugging, monitoring, and maintaining accountability in data processing, offering a chronological trail that aids in the swift identification and resolution of issues.
此外,缓存和日志记录的管理在保护LLM开发环境中起着至关重要的作用。以Redis和GPTCache [6]为例的安全缓存机制在确保数据完整性和访问控制的同时提高了性能。同时,由MLFlow [201]和Weights & Biases [102]等工具提供的日志记录提供了应用程序活动和状态更改的全面记录。该记录对于数据处理中的调试、监控和维护责任是必不可少的,它提供了时间线索,有助于快速识别和解决问题。

Lastly, model evaluation [137] is an essential component of the development process. It involves evaluating the performance of LLMs to confirm their accuracy and functionality. Through evaluation, developers can identify and rectify potential biases or flaws, facilitating the adjustment of model weights and improvements in performance. This process ensures that LLMs operate as intended and meet the requisite reliability standards. The security of AI agent development and testing environments is a multifaceted issue that requires a comprehensive strategy encompassing the selection of frameworks or orchestration tools with built-in security features, the establishment of security guardrails, and the implementation of secure caching, logging, and model evaluation practices. By prioritizing security in these areas, organizations can significantly reduce the threats associated with the development and deployment of AI agents, thereby safeguarding the confidentiality and integrity of their data and models.
最后,模型评估[137]是开发过程的重要组成部分。它涉及评估LLMs的性能,以确认其准确性和功能。通过评估,开发人员可以识别和纠正潜在的偏差或缺陷,促进模型权重的调整和性能的改善。这一过程确保LLMs按预期运行并符合必要的可靠性标准。人工智能代理开发和测试环境的安全性是一个多方面的问题,需要一个全面的策略,包括选择具有内置安全功能的框架或编排工具,建立安全护栏,以及实施安全缓存,日志记录和模型评估实践。 通过优先考虑这些领域的安全性,组织可以显著减少与AI代理的开发和部署相关的威胁,从而保护其数据和模型的机密性和完整性。

4.1.3 Computing Resources Management Environment. 计算资源管理环境。

The computing resources management environment of AI agents refers to the framework or system that oversees the allocation, scheduling, and optimization of computational resources, such as CPU, GPU, and memory, to efficiently execute tasks and operations. An imperfect agent computing resource management environment can also make the agent more vulnerable to attacks by malicious users, potentially compromising its functionality and security. There are four kinds of attacks:
人工智能代理的计算资源管理环境是指监督CPU、GPU和内存等计算资源的分配、调度和优化以高效执行任务和操作的框架或系统。不完善的代理计算资源管理环境也会使代理更容易受到恶意用户的攻击,从而可能损害其功能和安全性。有四种攻击:

  • Resource Exhaustion Attacks. If the management environment does not adequately limit the use of resources by each agent, an attacker could deliberately overload the system by making the agent execute resource-intensive tasks, leading to a denial of service (DoS) for other legitimate users [46, 52]. 资源耗尽攻击。如果管理环境没有充分限制每个代理对资源的使用,攻击者可以通过使代理执行资源密集型任务来故意使系统过载,从而导致对其他合法用户的拒绝服务(DoS)[46,52]。
  • Inefficient Resource Allocation. Inefficient resource allocation in large model query management significantly impacts system performance and cost. The prompts that are not verified are prone to waste the processing time of response on AI agent [62]. The essence of optimizing this process lies in effectively monitoring and evaluating query templates for efficiency, ensuring that resources are allocated to high-priority or computationally intensive queries on time. This not only boosts the system’s responsiveness and efficiency but also enhances security by reducing vulnerabilities due to potential delays or overloads, making it crucial for maintaining optimal operation and resilience against malicious activities. 资源分配效率低下。大型模型查询管理中的资源分配效率低下会严重影响系统性能和成本。未经验证的提示容易浪费AI代理的响应处理时间[62]。优化此过程的本质在于有效地监视和评估查询模板的效率,确保资源及时分配给高优先级或计算密集型查询。这不仅提高了系统的响应能力和效率,还通过减少由于潜在延迟或过载而导致的漏洞来增强安全性,这对于保持最佳操作和抵御恶意活动至关重要。
  • Insufficient Isolation Between Agents. In a shared environment, if adequate isolation mechanisms are not in place, a malicious agent could potentially access or interfere with the operations of other agents. This could lead to data breaches, unauthorized access to sensitive information, or the spread of malicious code [4, 89, 151]. 代理之间的隔离不足。在共享环境中,如果没有适当的隔离机制,恶意代理可能会访问或干扰其他代理的操作。这可能会导致数据泄露、未经授权访问敏感信息或恶意代码的传播[4,89,151]。
  • Unmonitored Resource Usage on AI agent. Without proper monitoring, anomalous behavior indicating a security breach, such as a sudden spike in resource consumption by an agent, might go unnoticed [4]. Timely detection of such anomalies is crucial for preventing or mitigating attacks. AI代理上的未监控资源使用。如果没有适当的监控,指示安全漏洞的异常行为,例如代理的资源消耗突然激增,可能会被忽视[4]。及时检测此类异常对于防止或减轻攻击至关重要。

4.1.4 Physical Environment. 物理环境

The term “physical environment” pertains to the concrete, tangible elements and areas that make up our real-world setting, encompassing all actual physical spaces and objects. The physical environment of an AI agent typically refers to the collective term for all external entities that are encountered or utilized during the operation of the AI agent. In reality, the security threats in the physical environment are far more varied and numerous than those in the other environments due to the inherently more complex nature of the physical settings agents encounter.
“物理环境”一词涉及构成我们现实世界环境的具体、有形的元素和区域,包括所有实际的物理空间和物体。AI代理的物理环境通常是指在AI代理的操作期间遇到或使用的所有外部实体的集合术语。实际上,由于代理遇到的物理设置的固有的更复杂的性质,物理环境中的安全威胁比其他环境中的安全威胁更加多样和众多。

In the physical environment, agents often employ a variety of hardware devices to gather external resources and information, such as sensors, cameras, and microphones. At this stage, given that the hardware devices themselves may pose security threats, attackers can exploit vulnerabilities to attack and compromise hardware such as sensors, thereby preventing the agent from timely receiving external information and resources, indirectly leading to a denial of service for the agent. In physical devices integrated with sensors, there may be various types of security vulnerabilities. For instance, hardware devices with integrated Bluetooth modules could be susceptible to Bluetooth attacks, leading to information leakage and denial of service for the agent[114]. Additionally, outdated versions and unreliable hardware sources might result in numerous known security vulnerabilities within the hardware devices. Therefore, employing reliable hardware devices and keeping firmware versions up to date can effectively prevent the harm caused by vulnerabilities inherent in physical devices.
在物理环境中,代理通常使用各种硬件设备来收集外部资源和信息,例如传感器,摄像机和麦克风。在此阶段,鉴于硬件设备本身可能构成安全威胁,攻击者可以利用漏洞攻击和危害传感器等硬件,从而阻止代理及时接收外部信息和资源,间接导致代理拒绝服务。在与传感器集成的物理设备中,可能存在各种类型的安全漏洞。例如,具有集成蓝牙模块的硬件设备可能容易受到蓝牙攻击,导致代理的信息泄漏和拒绝服务[114]。此外,过时的版本和不可靠的硬件来源可能会导致硬件设备中存在许多已知的安全漏洞。 因此,采用可靠的硬件设备并保持固件版本最新,可以有效防止物理设备固有的漏洞造成的危害。

Simultaneously, in the physical environment, resources and information are input into the agent in various forms for processing, ranging from simple texts and sensor signals to complex data types such as audio and video. These data often exhibit higher levels of randomness and complexity, allowing attackers to intricately disguise harmful inputs, such as Trojans, within the information collected by hardware devices. If they are not properly processed, these can lead to severe security issues. Taking the rapidly evolving field of autonomous driving safety research as an example, the myriad sensors integrated into vehicles often face the threats of interference and spoofing attacks [28]. Similarly, for hardware devices integrated with agents, there exists a comparable threat. Attackers can indirectly affect an agent system’s signal processing by interfering with the signals collected by sensors, leading to the agent misinterpreting the information content or being unable to read it at all. This can even result in deception or incorrect guidance regarding the agent’s subsequent instructions and actions. Therefore, after collecting inputs from the physical environment, agents need to conduct security checks on the data content and promptly filter out information containing threats to ensure the safety of the agent system.
同时,在物理环境中,资源和信息以各种形式输入到代理中进行处理,从简单的文本和传感器信号到复杂的数据类型,如音频和视频。这些数据通常表现出更高的随机性和复杂性,使攻击者能够在硬件设备收集的信息中复杂地伪装有害输入,例如特洛伊木马。如果处理不当,可能会导致严重的安全问题。以快速发展的自动驾驶安全研究领域为例,集成到车辆中的无数传感器经常面临干扰和欺骗攻击的威胁。同样,对于与代理集成的硬件设备,存在类似的威胁。 攻击者可以通过干扰传感器收集的信号来间接影响代理系统的信号处理,导致代理误解信息内容或根本无法读取信息。这甚至可能导致欺骗或不正确的指导有关代理的后续指令和行动。因此,在从物理环境中收集输入后,代理需要对数据内容进行安全检查,并及时过滤掉包含威胁的信息,以确保代理系统的安全。

Due to the inherent randomness in the responses of existing LLMs to queries, the instructions sent by agents to hardware devices may not be correct or appropriate, potentially leading to the execution of an erroneous movement [153]. Compared to the virtual environment, the instructions generated by LLMs in agents within the physical environment may not be well understood and executed by the hardware devices responsible for carrying out these commands. This discrepancy can significantly affect the agent’s work efficiency. Additionally, given the lower tolerance for errors in the physical environment, agents cannot be allowed multiple erroneous attempts in a real-world setting. Should the LLM-generated instructions not be well understood by hardware devices, the inappropriate actions of the agent might cause real and irreversible harm to the environment.
由于现有LLMs对查询的响应具有固有的随机性,代理向硬件设备发送的指令可能不正确或不适当,可能导致执行错误的移动[153]。与虚拟环境相比,由物理环境内的代理中LLMs生成的指令可能无法被负责执行这些命令的硬件设备很好地理解和执行。这种差异会严重影响代理的工作效率。此外,考虑到物理环境中对错误的容忍度较低,代理在现实环境中不允许多次错误尝试。如果LLM生成的指令不能被硬件设备很好地理解,代理的不适当动作可能会对环境造成真实的和不可逆转的损害。

4.2 Threats On Agent2Agent . Agent2Agent上的威胁

Although single-agent systems excel at solving specific tasks individually, multi-agent systems leverage the collaborative effort of several agents to achieve more complex objectives and exhibit superior problem-solving capabilities. Multi-agent interactions also add new attack surfaces to AI agents. In this subsection, we focus on exploring the security that agents interact with each other in a multi-agent manner. The security of interaction within a multi-agent system can be broadly categorized as follows: cooperative interaction threats and competitive interaction threat.
虽然单智能体系统擅长于单独解决特定的任务,多智能体系统利用多个智能体的协作努力来实现更复杂的目标,并表现出上级解决问题的能力。多代理交互还为人工智能代理添加了新的攻击面。在本小节中,我们将重点探讨代理以多代理方式相互交互的安全性。多智能体系统中交互的安全性可以大致分为:合作交互威胁和竞争交互威胁。

4.2.1 Cooperative Interaction Threats. 合作互动威胁。

A kind of multi-agent system depends on a cooperative framework [54, 82, 104, 117] where multiple agents work with the same objectives. This framework presents numerous potential benefits, including improved decision-making [191] and task completion efficiency [205]. However, there are multiple potential threats for this pattern. First, a recent study [113] finds that undetectable secret collusion between agents can easily be caused through their public communication. These secret collusion may bring back biased decisions. For instance, it is possible that we may soon observe advanced automated trading agents collaborating on a large scale to eliminate competitors, potentially destabilizing global markets. This secret collusion leads to a situation in which the ostensibly benign independent actions of each system cumulatively result in outcomes that exhibit systemic bias. Second, MetaGPT [58] found that frequent cooperation between agents can amplify minor hallucinations. To mitigate hallucinations, techniques such as cross-examination [133] or external supportive feedback [106] could improve the quality of agent output. Third, a single agent’s error or misleading information can quickly spread to others, leading to flawed decisions or behaviors across the system. Pan et al. [123] established Open-Domain Question Answering Systems (ODQA) with and without propagated misinformation. They found that this propagation of errors can dramatically reduce the performance of the whole system.
一种多智能体系统依赖于一个合作框架[54,82,104,117],其中多个智能体以相同的目标工作。该框架提供了许多潜在的好处,包括改善决策[191]和任务完成效率[205]。然而,这种模式存在多种潜在威胁。首先,最近的一项研究[113]发现,代理人之间不可察觉的秘密勾结很容易通过他们的公共通信引起。这些秘密的勾结可能会带回有偏见的决定。例如,我们可能很快就会看到先进的自动化交易代理人大规模合作,以消除竞争对手,这可能会破坏全球市场的稳定。这种秘密勾结导致了这样一种情况,即每个系统表面上良性的独立行动累积起来,导致了表现出系统性偏见的结果。 其次,MetaGPT [58]发现,代理之间的频繁合作可以放大轻微的幻觉。为了减轻幻觉,交叉询问[133]或外部支持性反馈[106]等技术可以提高代理输出的质量。第三,单个代理人的错误或误导性信息可以迅速传播给其他人,导致整个系统的错误决策或行为。Pan等人[123]建立了开放域问题查询系统(ODQA),有和没有传播错误信息。他们发现,这种错误的传播会大大降低整个系统的性能。

To counteract the negative effects of misinformation produced by agents, protective measures such as prompt engineering, misinformation detection, and major voting strategies are commonly employed. Similarly, Cohen et al. [23] introduce a worm called Morris II, the first designed to target cooperative multi-agent ecosystems by replicating malicious inputs to infect other agents. The danger of Morris II lies in its ability to exploit the connectivity between agents, potentially causing a rapid breakdown of multiple agents once one is infected, resulting in further problems such as spamming and exfiltration of personal data. We argue that although these mitigation measures are in place, they remain rudimentary and may lead to an exponential decrease in the efficiency of the entire agent system, highlighting a need for further exploration in this field.
为了抵消代理人产生的错误信息的负面影响,通常采用保护措施,如及时工程,错误信息检测和主要投票策略。类似地,Cohen et al. [23]介绍了一种名为Morris II的蠕虫,第一个旨在通过复制恶意输入来感染其他代理来针对合作的多代理生态系统。Morris II的危险在于它能够利用代理之间的连接,一旦一个代理被感染,可能会导致多个代理迅速崩溃,从而导致进一步的问题,如垃圾邮件和个人数据泄露。我们认为,虽然这些缓解措施已经到位,但它们仍然是基本的,可能会导致整个代理系统的效率呈指数级下降,突出了在这一领域进一步探索的必要性。

The cooperative multi-agent also provides more benefits against security threats from their frameworks. First, cooperative frameworks have the potential to defend against jailbreak attacks.
合作的多代理也提供了更多的好处,从他们的框架的安全威胁。首先,合作框架有可能防御越狱攻击。

AutoDefense [203] demonstrates the efficacy of a multi-agent cooperative framework in thwarting jailbreak attacks, resulting in a significant decrease in attack success rates with a low false positive rate on safe content. Second, the cooperative pattern for planning and execution is favorable to improving software quality attributes, such as security and accountability [100]. For example, this pattern can be used to detect and control the execution of irreversible code, like "rm -rf ".
AutoDefense [203]证明了多代理合作框架在阻止越狱攻击方面的有效性,导致攻击成功率显着降低,安全内容的误报率较低。第二,计划和执行的合作模式有利于提高软件质量属性,如安全性和责任性[100]。例如,该模式可用于检测和控制不可逆代码的执行,如“rm -rf“。

4.2.2 Competitive Interaction Threats. Another multi-agent system depends on competitive interactions, wherein each competitor embodies a distinct perspective to safeguard the advantages of their respective positions. Cultivating agents in a competitive environment benefits research in the social sciences and psychology. For example, restaurant agents competing with each other can attract more customers, allowing for an in-depth analysis of the behavioral relationships between owners and clients. Examples include game-simulated agent interactions [156, 213] and societal simulations [44].

4.2.2竞争性互动威胁。

另一个多主体系统依赖于竞争性互动,其中每个竞争者都体现了不同的观点,以维护各自地位的优势。在竞争环境中培养代理人有利于社会科学和心理学的研究。例如,相互竞争的餐厅代理商可以吸引更多的客户,从而可以深入分析业主和客户之间的行为关系。例子包括游戏模拟代理交互[156,213]和社会模拟[44]。

Although multi-agent systems engage in debates across multiple rounds to complete tasks, some intense competitive relationships may render the interactions of information flow between agents untrustworthy. The divergence of viewpoints among agents can lead to excessive conflicts, to the extent that agents may exhibit adversarial behaviors. To improve their own performance relative to their competitors, agents may engage in tactics such as the generation of adversarial inputs aimed at misleading other agents and degrading their performance [189]. For example, O’Gara [119] designed a game in which multiple agents, acting as players, search for a key within a locked room.
虽然多智能体系统参与多轮辩论来完成任务,但一些激烈的竞争关系可能会使智能体之间的信息流交互变得不可信。代理人之间的观点分歧可能会导致过度的冲突,在某种程度上,代理人可能会表现出敌对行为。为了提高自己相对于竞争对手的表现,代理人可能会采取一些策略,例如产生旨在误导其他代理人并降低其表现的对抗性输入[189]。例如,O’Gara [119]设计了一个游戏,在这个游戏中,多个代理人扮演玩家,在一个锁着的房间里寻找一把钥匙。

To acquire limited resources, he found that some players utilized their strong persuasive skills to induce others to commit suicide. Such phenomena not only compromise the security of individual agents but could also lead to instability in the entire agent system, triggering a chain reaction.
为了获得有限的资源,他发现一些玩家利用他们强大的说服能力来诱使他人自杀。这种现象不仅危及个体代理的安全,而且还可能导致整个代理系统的不稳定,引发连锁反应。

Another potential threat involves the misuse and ethical issues concerning competitive multiagent systems, as the aforementioned example could potentially encourage such systems to learn how to deceive humans. Park et al. [126] provide a detailed analysis of the threats posed by agent systems, including fraud, election tampering, and loss of control over AI systems. One notable case study involves Meta’s development of the AI system Cicero for a game named Diplomacy. Meta aimed to train Cicero to be “largely honest and helpful to its speaking partners” [41]. Despite these intentions, Cicero became an expert at lying. It not only betrays other players but also engages in premeditated deception, planning in advance to forge a false alliance with a human player to trick them into leaving themselves vulnerable to an attack.
另一个潜在的威胁涉及竞争性多智能体系统的滥用和道德问题,因为上述例子可能会鼓励这些系统学习如何欺骗人类。Park等人[126]对代理系统构成的威胁进行了详细分析,包括欺诈、选举篡改和对人工智能系统的失控。一个值得注意的案例研究涉及Meta为一款名为Diplomacy的游戏开发的人工智能系统Cicero。Meta的目标是训练西塞罗“在很大程度上是诚实的,并有助于其发言的伙伴”[41]。尽管有这些意图,西塞罗还是成为了撒谎的专家。它不仅背叛其他玩家,而且还进行有预谋的欺骗,提前计划与人类玩家建立虚假联盟,欺骗他们让自己容易受到攻击。

To mitigate the threats mentioned above, ensuring controlled competition among AI agents from a technological perspective presents a significant challenge. It is difficult to control the output of an agent’s “brain”, and even when constraints are incorporated during the planning process, it could significantly impact the agent’s effectiveness. Therefore, this issue remains an open research question, inviting more scholars to explore how to ensure that the competition between agents leads to a better user experience.
为了缓解上述威胁,从技术角度确保人工智能代理之间的受控竞争是一个重大挑战。很难控制代理的“大脑”的输出,即使在规划过程中加入了约束条件,也会显著影响代理的有效性。因此,这个问题仍然是一个开放的研究问题,邀请更多的学者来探讨如何确保代理之间的竞争导致更好的用户体验。

4.3 Threats On Memory 4.3内存威胁

Memory interaction within the AI agent system involves storing and retrieving information throughout the processing of agent usage. Memory plays a critical role in the operation of the AI agent, and it involves three essential phases: 1) the agent gathers information from the environment and stores it in its memory; 2) after storage, the agent processes this information to transform it into a more usable form; 3) the agent uses the processed information to inform and guide its next actions. That is, the memory interaction allows agents to record user preferences, glean insights from previous interactions, assimilate valuable information, and use this gained knowledge to improve the quality of service. However, these interactions can present security threats that need to be carefully managed. In this part, we divide these security threats in the memory interaction into two subgroups, short-term memory interaction threats and long-term memory interaction threats.
AI代理系统内的内存交互涉及在代理使用过程中存储和检索信息。记忆在人工智能主体的操作中起着至关重要的作用,它涉及三个基本阶段:1)主体从环境中收集信息并将其存储在记忆中; 2)存储后,主体处理这些信息,将其转换为更可用的形式; 3)主体使用处理后的信息来通知和指导其下一步行动。也就是说,记忆交互允许代理记录用户偏好,从先前的交互中收集见解,吸收有价值的信息,并使用这些获得的知识来提高服务质量。但是,这些交互可能会带来安全威胁,需要谨慎管理。在这一部分中,我们将记忆交互中的安全威胁分为两个亚组,即短时记忆交互威胁和长时记忆交互威胁。

4.3.1 Short-term Memory Interaction Threats. 短期记忆交互威胁。

Short-term memory in the AI agent acts like human working memory, serving as a temporary storage system. It keeps information for a limited time, typically just for the duration of the current interaction or session. This type of memory is crucial for maintaining context throughout a conversation, ensuring smooth continuity in dialogue, and effectively managing user prompts. However, AI agents typically face a constraint in their working memory capacity, limited by the number of tokens they can handle in a single interaction [65, 103, 125]. This limitation restricts their ability to retain and use extensive context from previous interactions.

人工智能代理中的短期记忆就像人类的工作记忆一样,充当临时存储系统。它在有限的时间内保留信息,通常仅在当前交互或会话的持续时间内。这种类型的记忆对于在整个对话中保持上下文,确保对话的平稳连续性以及有效管理用户提示至关重要。然而,人工智能代理通常面临着工作记忆容量的限制,受到他们在单次交互中可以处理的令牌数量的限制[65,103,125]。这种限制限制了他们保留和使用以前交互的广泛上下文的能力。

Moreover, each interaction is treated as an isolated episode [60], lacking any linkage between sequential subtasks. This fragmented approach to memory prevents complex sequential reasoning and impairs knowledge sharing in multi-agent systems. Without robust episodic memory and continuity across interactions, agents struggle with complex sequential reasoning tasks, crucial for advanced problem-solving. Particularly in multi-agent systems, the absence of cooperative communication among agents can lead to suboptimal outcomes. Ideally, agents should be able to share immediate actions and learning experiences to efficiently achieve common goals [7].
此外,每个交互都被视为一个孤立的事件[60],缺乏顺序子任务之间的任何联系。这种碎片化的记忆方法阻止了复杂的顺序推理,并损害了多智能体系统中的知识共享。如果没有强大的情景记忆和交互的连续性,智能体将难以完成复杂的顺序推理任务,这对高级问题解决至关重要。特别是在多智能体系统中,智能体之间缺乏合作通信会导致次优结果。理想情况下,代理应该能够分享即时行动和学习经验,以有效地实现共同目标[7]。

To address these challenges, concurrent solutions are divided into two categories, extending LLM context window [32] and compressing historical in-context contents [47, 65, 95, 118]. The former improves agent memory space by efficiently identifying and exploiting positional interpolation non-uniformities through the LLM fine-tuning step, progressively extending the context window from 256k to 2048k, and readjusting to preserve short context window capabilities. On the other hand, the latter continuously organizes the information in working memory by deploying models for summary.
为了解决这些挑战,并发解决方案分为两类,扩展LLM上下文窗口[32]和压缩历史上下文内容[47,65,95,118]。前者通过LLM微调步骤有效地识别和利用位置插值非均匀性,逐步将上下文窗口从256k扩展到2048k,并重新调整以保留短上下文窗口能力,从而提高了代理存储器空间。LLM另一方面,后者通过部署模型进行汇总,不断组织工作记忆中的信息。

Moreover, one crucial threat highlighted in multi-agent systems is the asynchronization of memory among agents [211]. This process is essential for establishing a unified knowledge base and ensuring consistency in decision-making across different agents. An asynchronous working memory record may cause a deviation in the goal resolution of multiple agents. However, preliminary solutions are already available. For instance, Chen et al. [19] underscore the importance of integrating synchronized memory modules for multi-robot collaboration. Communication among agents also plays a significant role, relying heavily on memory to maintain context and interpret messages. For example, Mandi et al. [104] demonstrate memory-driven communication frameworks that promote a common understanding among agents.
此外,多智能体系统中突出的一个关键威胁是智能体之间的记忆的分散化[211]。这一过程对于建立统一的知识库和确保不同机构决策的一致性至关重要。异步工作记忆记录可能会导致多个代理的目标分辨率的偏差。不过,目前已有初步的解决办法。例如,Chen等人[19]强调了集成同步内存模块对多机器人协作的重要性。代理之间的通信也起着重要的作用,在很大程度上依赖于内存来维护上下文和解释消息。例如,Mandi et al. [104]展示记忆驱动的通信框架,促进代理之间的共同理解。

4.3.2 Long-term Memory Interaction Threats. 长期记忆交互威胁。

The storage and retrieval of long-term memory depend heavily on vector databases. Vector databases [122, 187] utilize embeddings for data storage and retrieval, offering a non-traditional alternative to scalar data in relational databases. They leverage similarity measures like cosine similarity and metadata filters to efficiently find the most relevant matches. The workflow of vector databases is composed of two main processes. First, the indexing process involves transforming data into embeddings, compressing these embeddings, and then clustering them for storage in vector databases. Second, during querying, data is transformed into embeddings, which are then compared with the stored embeddings to find the nearest neighbor matches. Notably, these databases often collaborate with RAG, introducing novel security threats.
长期记忆的存储和检索在很大程度上依赖于向量数据库。向量数据库[122,187]利用嵌入进行数据存储和检索,为关系数据库中的标量数据提供了非传统的替代方案。它们利用余弦相似性和元数据过滤器等相似性度量来有效地找到最相关的匹配。矢量数据库的工作流程由两个主要过程组成。首先,索引过程涉及将数据转换为嵌入,压缩这些嵌入,然后将它们聚类以存储在向量数据库中。其次,在查询过程中,数据被转换为嵌入,然后与存储的嵌入进行比较,以找到最近的邻居匹配。值得注意的是,这些数据库经常与RAG合作,引入了新的安全威胁。

The first threat of long-term interaction is that the indexing process may inject some poisoning samples into the vector databases. It has been shown that placing one million pieces of data with only five poisoning samples can lead to a 90% attack success rate [215]. Cohen et al. [23] uses an adversarial self-replicating prompt as the worm to poison the database of a RAG-based application, extracting user private information from the AI agent ecosystem by query process.
长期相互作用的第一个威胁是,索引编制过程可能会将一些有毒样本注入矢量数据库。有研究表明,将100万条数据与5个中毒样本放在一起,可以导致90%的攻击成功率[215]。Cohen等人[23]使用对抗性自我复制提示作为蠕虫来毒害基于RAG的应用程序的数据库,通过查询过程从AI代理生态系统中提取用户隐私信息。

The second threat is privacy issues. The use of RAG and vector databases has expanded the attack surface for privacy issues because private information stems not only from pre-trained and fine-tuning datasets but also from retrieval datasets. A study [202] carefully designed a structured prompt attack to extract sensitive information with a higher attack success rate from the vector database. Furthermore, given the potential for inversion techniques that can invert the embeddings back to words, as suggested by [152], there exists the possibility that private information stored in the long memory of AI agent Systems, which utilize vector databases, can be reconstructed and extracted by embedding inversion attacks[84, 111].
第二个威胁是隐私问题。RAG和向量数据库的使用扩大了隐私问题的攻击面,因为隐私信息不仅来自预先训练和微调的数据集,还来自检索数据集。一项研究[202]精心设计了一种结构化的提示攻击,以从向量数据库中提取具有较高攻击成功率的敏感信息。此外,考虑到可以将嵌入反转回单词的反转技术的潜力,如[152]所建议的,存在存储在AI代理系统的长内存中的私人信息的可能性,这些信息利用向量数据库,可以通过嵌入反转攻击来重建和提取[84,111]。

The third threat is the generation threat against hallucinations and misalignments. Although RAG has theoretically been proved to have a lesser generalization threat than a single LLM [74], it still fails in several ways. It is fragile for RAG to respond to time-series information queries. If the query pertains to the effective dates of various amendments within a regulation and RAG does not accurately determine these timelines, this could lead to erroneous results. Furthermore, generation threats may also arise from poor retrieval due to the lack of categorization of long-term memories [56]. For instance, a vector dataset that stores different semantic information about whether Earth is a globe or a flat could lead to contradictions between these pieces of information.
第三个威胁是对幻觉和失调的世代威胁。虽然RAG理论上被证明比单个LLM具有更小的泛化威胁[74],但它仍然在几个方面失败。RAG对时间序列信息查询的响应是脆弱的。如果查询涉及法规中各种修订的生效日期,而RAG无法准确确定这些时间表,则可能导致错误的结果。此外,由于缺乏对长期记忆的分类,检索不良也可能导致生成威胁[56]。例如,存储关于地球是地球仪还是平面的不同语义信息的向量数据集可能导致这些信息之间的矛盾。

5 Directions Of Future Research 未来研究的5个方向

AI agents in security have attracted considerable interest from the research community, having identified many potential threats in the real world and the corresponding defensive strategies. As shown in Figure 4, this survey outlines several potential directions for future research on AI agent security based on the defined taxonomy. Efficient & effective input inspection. Future efforts should enhance the automatic and real-time inspection levels of user input to address threats on perception. Maatphor assists defenders in conducting automated variant analyses of known prompt injection attacks [142]. It is limited by a success rate of only 60%. This suggests that while some progress has been made, there is still significant room for improvement in terms of reliability and accuracy. FuzzLLM [194] tends to ignore efficiency, reducing practicality in real-world applications. These components highlight the critical gaps in the current approaches and point toward necessary improvements. Future research needs to address these limitations by enhancing the accuracy and efficiency of inspection mechanisms, ensuring that they can be effectively deployed in real-world applications. Bias and fairness in AI agents. The existence of biased decision-making in Large Language Models (LLMs) is well-documented, affecting evaluation procedures and broader fairness implications [43].
安全领域的人工智能代理引起了研究界的极大兴趣,它识别了真实的世界中的许多潜在威胁以及相应的防御策略。如图4所示,这项调查概述了基于定义的分类法的AI代理安全性未来研究的几个潜在方向。高效的投入品检验。未来的努力应提高用户输入的自动和实时检查水平,以解决对感知的威胁。Maatphor帮助防御者对已知的即时注入攻击进行自动化变体分析[142]。它的成功率仅为60%。这表明,虽然取得了一些进展,但在可靠性和准确性方面仍有很大的改进余地。FuzzLLM [194]倾向于忽略效率,降低了现实世界应用中的实用性。这些组成部分突出了目前方法中的关键差距,并指出了必要的改进。 未来的研究需要通过提高检测机制的准确性和效率来解决这些限制,确保它们可以有效地部署在现实世界的应用中。AI代理的偏见和公平性。大型语言模型(LLMs)中存在有偏见的决策是有据可查的,影响了评估程序和更广泛的公平性影响[43]。

These systems, especially those involving AI agents, are less robust and more prone to detrimental behaviors, generating surreptitious outputs compared to LLM counterparts, thus raising serious safety concerns [161]. Studies indicate that AI agents tend to reinforce existing model biases, even when instructed to counterargue specific political viewpoints [38], impacting the integrity of their logical operations. Given the increasing complexity and involvement of these agents in various tasks, identifying and mitigating biases is a formidable challenge. Suresh and Guttag’s framework [157] addresses bias and fairness throughout the machine learning lifecycle but is limited in scope, while Gichoya et al. focus on bias in healthcare systems [48], highlighting the need for comprehensive approaches. Future directions should emphasize bias and fairness in AI agents, starting with identifying threats and ending with mitigation strategies.
这些系统,特别是那些涉及人工智能代理的系统,不太健壮,更容易产生有害行为,与LLM对应系统相比,会产生令人惊讶的输出,从而引发严重的安全问题[161]。研究表明,人工智能代理倾向于加强现有的模型偏见,即使被指示反驳特定的政治观点[38],影响其逻辑操作的完整性。鉴于这些代理人在各种任务中的复杂性和参与程度越来越高,识别和减轻偏见是一项艰巨的挑战。Suresh和Guttag的框架[157]解决了整个机器学习生命周期中的偏见和公平性问题,但范围有限,而Gichoya等人则专注于医疗系统中的偏见[48],强调需要全面的方法。

By enforcing strict auditing protocols, we can enhance the transparency and accountability of AI systems. However, a significant challenge lies in achieving this efficiently without imposing excessive computational overhead, as exemplified by PrivacyAsst [207], which incurred 1100x extra computation cost compared to a standard AI agent while still failing to fully prevent identity disclosure. Therefore, the focus should be on developing lightweight and effective auditing mechanisms that ensure security and privacy without compromising performance.
通过执行严格的审计协议,我们可以提高人工智能系统的透明度和问责制。然而,一个重大的挑战在于有效地实现这一目标,而不会带来过多的计算开销,如PrivacyAsst [207]所示,与标准AI代理相比,它产生了1100倍的额外计算成本,同时仍然无法完全防止身份泄露。因此,重点应该放在开发轻量级和有效的审计机制,以确保安全性和隐私,而不影响性能。

Sound safety evaluation baselines in the AI agent. Trustworthy LLMs have already been defined in six critical trust dimensions, including stereotype, toxicity, privacy, fairness, ethics, and robustness, but there is still no unified consensus on the design standards for the safety benchmarks of the entire AI agent ecosystem. R-Judge [198] is a benchmark designed to assess the ability of large language models to judge and identify safety threats based on agent interaction records. The MLCommons group [166] proposes a principled approach to define and construct benchmarks, which is limited to a single use case: an adult conversing with a general-purpose assistant in English. ToolEmu [141] is designed to assess the threat of tool execution. These works provide evaluation results for only a part of the agent ecosystem. More evaluation questions remain open to be answered. Should we use similar evaluation tools to detect agent safety? What are the dimensions of critical trust for AI agents? How should we evaluate the agent as a whole?
AI代理中的合理安全评估基线。值得信赖的LLMs已经在六个关键的信任维度上进行了定义,包括刻板印象,毒性,隐私,公平性,道德和鲁棒性,但对于整个AI代理生态系统的安全基准的设计标准仍然没有统一的共识。R-Judge [198]是一个基准测试,旨在评估大型语言模型基于代理交互记录判断和识别安全威胁的能力。MLCommons小组[166]提出了一种原则性的方法来定义和构建基准,该方法仅限于单一用例:成年人与通用助理用英语交谈。ToolEmu [141]旨在评估工具执行的威胁。这些工作提供的评估结果,只有一部分的代理生态系统。还有更多的评价问题有待回答。我们是否应该使用类似的评估工具来检测代理安全性?AI代理的关键信任维度是什么? 我们应该如何从整体上评价代理人?

Solid agent development & deployment policy. One promising area is the development and implementation of solid policies for agent development and deployment. As AI agent capabilities expand, so does the need for comprehensive guidelines that ensure these agents are used responsibly and ethically. This includes establishing policies for transparency, accountability, and privacy protection in AI agent deployment. Researchers should focus on creating frameworks that help developers adhere to these policies while also fostering innovation. Although TrustAgent [63] delves into the complex connections between safety and helpfulness, as well as the relationship between a model’s reasoning capabilities and its effectiveness as a safe agent, it did not markedly improve the development and deployment policies for agents. This highlights the necessity for strong strategies. Effective policies should address threats to Agent2Environments, ensuring a secure and ethical deployment of AI agents. Optimal interaction architectures. The design and implementation of interaction architectures for AI agents in the security aspect is a critical area of research aimed at improving robustness systems. This involves developing structured communication protocols to regulate interactions between agents, defining explicit rules for data exchange, and executing commands to minimize the threats of malicious interference. For example, CAMEL [82] utilizes inception prompting to steer chat agents towards completing tasks while ensuring alignment with human intentions.
可靠的代理开发和部署政策。一个很有希望的领域是制定和执行可靠的代理开发和部署政策。随着人工智能代理功能的扩展,需要制定全面的指导方针,确保以负责任和合乎道德的方式使用这些代理。这包括在AI代理部署中建立透明度、问责制和隐私保护政策。研究人员应该专注于创建框架,帮助开发人员遵守这些政策,同时促进创新。尽管TrustAgent [63]深入研究了安全性和有用性之间的复杂联系,以及模型的推理能力和作为安全代理的有效性之间的关系,但它并没有显著改善代理的开发和部署策略。这突出了强有力的战略的必要性。有效的策略应解决对Agent2Environments的威胁,确保AI代理的安全和道德部署。 最佳交互架构。在安全方面的AI代理的交互架构的设计和实现是一个关键的研究领域,旨在提高系统的鲁棒性。这涉及开发结构化通信协议来规范代理之间的交互,定义数据交换的明确规则,以及执行命令以最大限度地减少恶意干扰的威胁。例如,CAMEL [82]利用初始提示来引导聊天代理完成任务,同时确保与人类意图保持一致。

However, CAMEL does not discuss how to establish clear behavioral constraints and permissions for each agent, dictating allowable actions, interactions, and circumstances, with dynamically adjustable permissions based on security context and agent performance. Additionally, Existing studies [54, 82, 104, 117] do not consider agent-agent dependencies, which can potentially lead to internal security mechanism chaos. For example, one agent might mistakenly transmit a user’s personal information to another agent, solely to enable the latter to complete a weather query.
然而,CAMEL没有讨论如何为每个代理建立明确的行为约束和权限,规定允许的操作,交互和环境,并根据安全上下文和代理性能动态调整权限。此外,现有的研究[54,82,104,117]没有考虑代理-代理依赖关系,这可能会导致内部安全机制混乱。例如,一个代理可能会错误地将用户的个人信息传输给另一个代理,仅仅是为了使后者能够完成天气查询。

Robust memory management Future directions in AI agent memory management reveal several critical findings that underscore the importance of secure and efficient practices. One major concern is the potential threats to Agent2Memory, highlighting the vulnerabilities that memory systems can face. AvalonBench [90] emerges as a crucial tool in tackling information asymmetry within multiagent systems, where unequal access to information can lead to inefficiencies and security risks.
强大的内存管理AI代理内存管理的未来方向揭示了几个关键的发现,强调了安全和高效实践的重要性。一个主要的担忧是Agent2Memory的潜在威胁,突出了内存系统可能面临的漏洞。AvalonBench [90]是解决多智能体系统中信息不对称问题的重要工具,在多智能体系统中,不平等的信息获取可能导致效率低下和安全风险。

Furthermore, PoisonedRAG [215] draws attention to the risks associated with memory retrieval, particularly the danger of reintroducing poisoned data, which can compromise the functionality and security of AI agents. Therefore, the central question is how to manage memory securely, necessitating the development of sophisticated benchmarks and retrieval mechanisms. These advancements aim to mitigate risks and ensure the integrity and security of memory in AI agents, ultimately enhancing the reliability and trustworthiness of AI systems in managing memory.
此外,PoisonedRAG [215]提请注意与记忆检索相关的风险,特别是重新引入有毒数据的危险,这可能会危及AI代理的功能和安全性。因此,核心问题是如何安全地管理内存,这需要开发复杂的基准和检索机制。这些进步旨在降低风险,确保人工智能代理中内存的完整性和安全性,最终提高人工智能系统在管理内存方面的可靠性和可信度。

6 Conclusion 结论

In this survey, we provide a comprehensive review of LLM-based agents on their security threat, where we emphasize on four key knowledge gaps ranging across the whole lifecycle of agents. To show the agent’s security issues, we summarize 100+ papers, where all existing attack surfaces and defenses are carefully categorized and explained. We believe that this survey may provide essential references for newcomers to this field and also inspire the development of more advanced security threats and defenses on LLM-based agents.
在这项调查中,我们提供了一个全面的审查LLM基础的代理对他们的安全威胁,我们强调在四个关键的知识差距,在整个生命周期的代理。为了展示代理的安全问题,我们总结了100多篇论文,其中对所有现有的攻击面和防御进行了仔细的分类和解释。我们相信,这项调查可能会提供必要的参考新人到这个领域,也激发了更先进的安全威胁和防御的LLM基础的代理的发展。

References

[1] Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. 2023. Conversational health agents: A personalized llm-powered agent framework. arXiv (2023).

[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report. arXiv (2023).

[3] Divyansh Agarwal, Alexander R Fabbri, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. 2024.

Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions. arXiv (2024).

[4] Jacob Andreas. 2022. Language models as agent models. arXiv (2022).

[5] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv (2022).

[6] Fu Bang. 2023. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Workshop for Natural Language Processing Open Source Software. 212–218.

[7] Ying Bao, Wankun Gong, and Kaiwen Yang. 2023. A Literature Review of Human–AI Synergy in Decision Making: From the Perspective of Affordance Actualization Theory. Systems (2023).

[8] Rishabh Bhardwaj and Soujanya Poria. 2023. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv (2023).

[9] Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender Bias in Word-Level Language Models. In Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop.

[10] Rodney Brooks. 1986. A robust layered control system for a mobile robot. IEEE journal on robotics and automation (1986).

[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS.

[12] Fredrik Carlsson, Joey Öhman, Fangyu Liu, Severine Verlinden, Joakim Nivre, and Magnus Sahlgren. 2022. Fine-grained controllable text generation using non-residual prompting. In Annual Meeting of the Association for Computational Linguistics.

[13] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. 2023.

Grounding large language models in interactive environments with online reinforcement learning. In ICML.

[14] Chun Fai Chan, Daniel Wankit Yip, and Aysan Esmradi. 2023. Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants. In IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1–5.

[15] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv (2023).

[16] Dake Chen, Hanbin Wang, Yunhao Huo, Yuzhao Li, and Haoyang Zhang. 2023. Gamegpt: Multi-agent collaborative framework for game development. arXiv (2023).

[17] Mengqi Chen, Bin Guo, Hao Wang, Haoyu Li, Qian Zhao, Jingqi Liu, Yasan Ding, Yan Pan, and Zhiwen Yu. 2024. The Future of Cognitive Strategy-enhanced Persuasive Dialogue Agents: New Perspectives and Trends. arXiv (2024).

[18] Tianlong Chen, Zhenyu Zhang, Yihua Zhang, Shiyu Chang, Sijia Liu, and Zhangyang Wang. 2022. Quarantine: Sparsity can uncover the trojan attack trigger for free. In CVPR.

[19] Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2023. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? arXiv (2023).

[20] Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, and Zhendong Mao. 2024. Benchmarking large language models on controllable generation under diversified instructions. arXiv (2024).

[21] Steffi Chern, Zhen Fan, and Andy Liu. 2024. Combating Adversarial Attacks with Multi-Agent Debate. arXiv (2024). [22] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. NeurIPS 30.

[23] Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications. arXiv (2024).

[24] Maxwell Crouse, Ibrahim Abdelaziz, Kinjal Basu, Soham Dan, Sadhana Kumaravel, Achille Fokoue, Pavan Kapanipathi, and Luis Lastras. 2023. Formally specifying the high-level behavior of LLM-based agents. arXiv (2023).

[25] Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. 2024. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems.

arXiv (2024).

[26] Lavina Daryanani. 2023. How to jailbreak chatgpt. https://watcher.guru/news/how-to-jailbreak-chatgpt. (2023).

[27] Luigi De Angelis, Francesco Baglivo, Guglielmo Arzilli, Gaetano Pierpaolo Privitera, Paolo Ferragina, Alberto Eugenio Tozzi, and Caterina Rizzo. 2023. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Frontiers in Public Health (2023).

[28] Jerry den Hartog, Nicola Zannone, et al. 2018. Security and privacy for innovative automotive applications: A survey.

Computer Communications (2018).

[29] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu.

  1. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv (2023).

[30] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. In Findings of EMNLP.

[31] V. Dibia. 2023. Generative AI: Practical Steps to Reduce Hallucination and Improve Performance of Systems Built with Large Language Models. In Designing with ML: How to Build Usable Machine Learning Applications. Self-published on designingwithml.com. (2023).

[32] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang.

  1. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. arXiv (2024).

[33] Bao Gia Doan, Ehsan Abbasnejad, and Damith C Ranasinghe. 2020. Februus: Input purification defense against trojan attacks on deep neural network systems. In ACSAC. 897–912.

[34] Tian Dong, Guoxing Chen, Shaofeng Li, Minhui Xue, Rayne Holland, Yan Meng, Zhen Liu, and Haojin Zhu. 2023.

Unleashing cheapfakes through trojan plugins of large language models. arXiv (2023).

[35] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv (2023).

[36] Yuqing Du, Olivia Watkins, Zihan Wang, Cédric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. 2023. Guiding pretraining in reinforcement learning with large language models. In ICML. PMLR.

[37] Nouha Dziri, Andrea Madotto, Osmar Zaïane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In EMNLP.

[38] Eva Eigner and Thorsten Händler. 2024. Determinants of LLM-assisted Decision-Making. arXiv (2024).

[39] Embrace The Red. 2023. ChatGPT plugins: Data exfiltration via images & cross plugin request forgery. https: //embracethered.com/blog/posts/2023/chatgpt-webpilot-data-exfil-via-markdown-injection/.

[40] Aysan Esmradi, Daniel Wankit Yip, and Chun Fai Chan. 2023. A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models. In International Conference on Ubiquitous Security.

[41] Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. 2022. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science (2022).

[42] Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti. 2023. Dissecting American Fuzzy Lop: A FuzzBench Evaluation. ACM transactions on software engineering and methodology (2023).

[43] Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2023. Bias and fairness in large language models: A survey. arXiv (2023).

[44] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2023.

S 3: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv (2023).

[45] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of EMNLP.

[46] Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. 2024. Coercing LLMs to do and reveal (almost) anything. arXiv (2024).

[47] Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao.

  1. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In IEEE/ACM International Conference on Software Engineering.

[48] Judy Wawira Gichoya, Kaesha Thomas, Leo Anthony Celi, Nabile Safdar, Imon Banerjee, John D Banja, Laleh SeyyedKalantari, Hari Trivedi, and Saptarshi Purkayastha. 2023. AI pitfalls and what not to do: mitigating bias in AI. The British Journal of Radiology (2023).

[49] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In ACM Workshop on Artificial Intelligence and Security. 79–90.

[50] Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. 2024. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv (2024).

[51] Guardrails AI. 2024. Build AI powered applications with confidence. (2024). https://www.guardrailsai.com/ Accessed: 2024-02-27.

[52] Michael Guastalla, Yiyi Li, Arvin Hekmati, and Bhaskar Krishnamachari. 2023. Application of Large Language Models to DDoS Attack Detection. In International Conference on Security and Privacy in Cyber-Physical Systems and Smart Vehicles.

[53] Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access (2023).

[54] Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. 2023. Chatllm network: More brains, more intelligence. arXiv (2023).

[55] Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, and Srinivasan Iyer. 2023. Methods for measuring, updating, and visualizing factual beliefs in language models. In Conference of the European Chapter of the Association for Computational Linguistics.

[56] Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents.

In Proceedings of the AAAI Symposium Series.

[57] Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting. arXiv (2024).

[58] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber.

  1. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In ICLR.

[59] Tamanna Hossain, Sunipa Dev, and Sameer Singh. 2023. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. In Annual Meeting of the Association for Computational Linguistics.

[60] Yuki Hou, Haruki Tamoto, and Homei Miyashita. 2024. " My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–7.

[61] https://www.theguardian.com/profile/hibaq farah. 2024. UK cybersecurity agency warns of chatbot ‘prompt injection’ attacks - theguardian.com. https://www.theguardian.com/technology/2023/aug/30/uk-cybersecurity-agency-warnsof-chatbot-prompt-injection-attacks.

[62] Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. 2023. Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach. (2023).

[63] Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wei, and Yongfeng Zhang. 2024. TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution. arXiv (2024).

[64] Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv (2023).

[65] Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, and Mao Yang. 2023. Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning. arXiv (2023).

[66] Yue Huang, Qihui Zhang, Lichao Sun, et al. 2023. Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv (2023).

[67] S Humeau, K Shuster, M Lachaux, and J Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv. In ICLR.

[68] Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A Choquette-Choo, and Nicholas Carlini. 2023. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In International Natural Language Generation Conference.

[69] Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics.

[70] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys (2023).

[71] Zhenlan Ji, Daoyuan Wu, Pingchuan Ma, Zongjie Li, and Shuai Wang. 2024. Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs. arXiv (2024).

[72] Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, and Radha Poovendran. 2023. Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications. In ICLR.

[73] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In ICML.

[74] Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, and Bo Li. 2024. C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models. arXiv (2024).

[75] Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee, Jinwoo Shin, Honglak Lee, and Kimin Lee. 2024. Guide Your Agent with Adaptive Multimodal Rewards. NeurIPS 36.

[76] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. 2020. Universal litmus patterns: Revealing backdoor attacks in cnns. In CVPR. 301–310.

[77] Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2024. Certifying llm safety against adversarial prompting. arXiv (2024).

[78] Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight poisoning attacks on pre-trained models. arXiv (2020).

[79] Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022.

Factuality enhanced language models for open-ended text generation. NeurIPS 35.

[80] Patrick Levi and Christoph P Neumann. 2024. Vocabulary Attack to Hijack Large Language Model Applications.

arXiv (2024).

[81] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS.

[82] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. In NeurIPS.

[83] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In Findings of EMNLP.

[84] Haoran Li, Mingshi Xu, and Yangqiu Song. 2023. Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. In Findings of ACL.

[85] Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP.

[86] Jinfeng Li, Tianyu Du, Shouling Ji, Rong Zhang, Quan Lu, Min Yang, and Ting Wang. 2020. {TextShield}: Robust text classification based on multimodal embedding and neural machine translation. In USENIX Security.

[87] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv (2024).

[88] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng Zhang. 2024. Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents. arXiv (2024).

[89] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv (2023).

[90] Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. 2023. AvalonBench: Evaluating LLMs Playing the Game of Avalon. In NeurIPS Workshop.

[91] Baihan Lin, Djallel Bouneffouf, Guillermo Cecchi, and Kush R Varshney. 2023. Towards healthy AI: large language models need therapists too. arXiv (2023).

[92] Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2024. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. NeurIPS 36.

[93] Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. 2023. Agentsims: An open-source sandbox for large language model evaluation. arXiv (2023).

[94] Aishan Liu, Tairan Huang, Xianglong Liu, Yitao Xu, Yuqing Ma, Xinyun Chen, Stephen J Maybank, and Dacheng Tao.

  1. Spatiotemporal attacks for embodied agents. In ECCV.

[95] Sheng Liu, Lei Xing, and James Zou. 2023. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv (2023).

[96] Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. 2023. Demystifying rce vulnerabilities in llm-integrated apps. arXiv (2023).

[97] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu.

  1. Prompt Injection attack against LLM-integrated Applications. arXiv (2023).

[98] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu.

  1. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv (2023).

[99] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv (2023).

[100] Qinghua Lu, Liming Zhu, Xiwei Xu, Zhenchang Xing, Stefan Harrer, and Jon Whittle. 2023. Building the Future of Responsible AI: A Reference Architecture for Designing Large Language Model based Agents. arXiv (2023).

[101] Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi.

  1. Quark: Controllable text generation with reinforced unlearning. NeurIPS 35, 27591–27609.

[102] Chris Van Pelt Lukas Biewald. 2017. Weights and Biases. https://wandb.ai/. (2017). [103] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. NeurIPS 36.

[104] Zhao Mandi, Shreeya Jain, and Shuran Song. 2023. Roco: Dialectic multi-robot collaboration with large language models. arXiv (2023).

[105] Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv (2024).

[106] Nikhil Mehta, Milagro Teruel, Xin Deng, Sergio Figueroa Sanz, Ahmed Awadallah, and Julia Kiseleva. 2024. Improving Grounded Language Understanding in a Collaborative Environment by Interacting with Agents Through Help Feedback. In Findings of EACL.

[107] Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024. AIOS: LLM Agent Operating System. arXiv (2024).

[108] Vincent Micheli, Eloi Alonso, and François Fleuret. 2023. Transformers are sample-efficient world models. In ICLR.

[109] Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In EMNLP.

[110] Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, and Huan Sun. 2024. A Trembling House of Cards?

Mapping Adversarial Attacks against Language Agents. arXiv (2024).

[111] John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. 2023. Text Embeddings Reveal (Almost) As Much As Text. In EMNLP.

[112] Stephen Moskal, Sam Laney, Erik Hemberg, and Una-May O’Reilly. 2023. LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing. arXiv (2023).

[113] Sumeet Ramesh Motwani, Mikhail Baranchuk, Lewis Hammond, and Christian Schroeder de Witt. 2023. A Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?. In NeurIPS workshop.

[114] Hichem Mrabet, Sana Belguith, Adeeb Alhomoud, and Abderrazak Jemai. 2020. A survey of IoT security based on a layered architecture of sensing and data analysis. Sensors (2020).

[115] Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2024. Generating Benchmarks for Factuality Evaluation of Language Models. In Conference of the European Chapter of the Association for Computational Linguistics.

[116] Uttam Mukhopadhyay, Larry M Stephens, Michael N Huhns, and Ronald D Bonnell. 1986. An intelligent system for document retrieval in distributed office environments. Journal of the American Society for Information Science (1986).

[117] Varun Nair, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. 2023. DERA: enhancing large language model completions with dialog-enabled resolving agents. arXiv (2023).

[118] Tai Nguyen and Eric Wong. 2023. In-context example selection with influences. arXiv (2023).

[119] Aidan O’Gara. 2023. Hoodwinked: Deception and cooperation in a text-based game for language models. arXiv (2023).

[120] Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, and Dit-Yan Yeung. 2021. Probing toxic content in large pre-trained language models. In International Joint Conference on Natural Language Processing.

[121] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.

NeurIPS.

[122] James Jie Pan, Jianguo Wang, and Guoliang Li. 2023. Survey of vector database management systems. arXiv (2023).

[123] Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the risk of misinformation pollution with large language models. arXiv (2023).

[124] Jing-Cheng Pang, Xin-Yu Yang, Si-Hang Yang, and Yang Yu. 2023. Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation. NeurIPS.

[125] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023.

Generative agents: Interactive simulacra of human behavior. In Annual ACM Symposium on User Interface Software and Technology. 1–22.

[126] Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. 2023. AI deception: A survey of examples, risks, and potential solutions. arXiv (2023).

[127] Rodrigo Pedro, Daniel Castro, Paulo Carreira, and Nuno Santos. 2023. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? arXiv (2023).

[128] Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv (2023).

[129] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan. 2023. Discovering Language Model Behaviors with Model-Written Evaluations. In Findings of ACL.

[130] Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. NeurIPS 2022.

[131] Steve Phelps and Rebecca Ranson. 2023. Of Models and Tin Men–a behavioural economics study of principal-agent problems in AI alignment using large-language models. arXiv (2023).

[132] Lukas Pöhler, Valentin Schrader, Alexander Ladwein, and Florian von Keller. 2024. A Technological Perspective on Misuse of Available AI. arXiv (2024).

[133] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023.

Communicative agents for software development. arXiv (2023).

[134] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv (2021).

[135] Fathima Abdul Rahman and Guang Lu. 2023. A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning. arXiv (2023).

[136] Leonardo Ranaldi and Giulia Pucci. 2023. When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour. arXiv (2023).

[137] Zeeshan Rasheed, Muhammad Waseem, Kari Systä, and Pekka Abrahamsson. 2024. Large language model evaluation via multi ai agents: Preliminary results. arXiv (2024).

[138] Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In EMNLP.

[139] Embrace The Red. 2023. Indirect Prompt Injection via YouTube Transcripts · Embrace The Red - embracethered.com.

https://embracethered.com/blog/posts/2023/chatgpt-plugin-youtube-indirect-prompt-injection/.

[140] Zohar Rimon, Tom Jurgenson, Orr Krupnik, Gilad Adler, and Aviv Tamar. 2024. MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning. In ICLR.

[141] Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. In ICLR.

[142] Ahmed Salem, Andrew Paverd, and Boris Köpf. 2023. Maatphor: Automated variant analysis for prompt injection attacks. arXiv (2023).

[143] Sergei Savvov. 2023. Fixing Hallucinations in LLMs https://betterprogramming.pub/fixing-hallucinations-in-llms9ff0fd438e33. (2023).

[144] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. NeurIPS.

[145] Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel. 2023. Adversarial attacks and defenses in large language models: Old and new threats. arXiv (2023).

[146] Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. 2023. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Linguistics.

[147] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. 2023. Towards understanding sycophancy in language models. arXiv (2023).

[148] Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. 2023. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv (2023).

[149] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2024. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. NeurIPS.

[150] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of EMNLP.

[151] Emily H Soice, Rafael Rocha, Kimberlee Cordova, Michael Specter, and Kevin M Esvelt. 2023. Can large language models democratize access to dual-use biotechnology? arXiv (2023).

[152] Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In ACM SIGSAC conference on computer and communications security. 377–390.

[153] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In CVPR. 2998–3009.

[154] Shyam Sudhakaran, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi.

  1. Mariogpt: Open-ended text2level generation through large language models. NeurIPS.

[155] Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. 2023. Contrastive learning reduces hallucination in conversations. In AAAI.

[156] Yuxiang Sun, Checheng Yu, Junjie Zhao, Wei Wang, and Xianzhong Zhou. 2023. Self Generated Wargame AI: Double Layer Agent Task Planning Based on Large Language Model. arXiv (2023).

[157] Harini Suresh and John V Guttag. 2019. A framework for understanding unintended consequences of machine learning. arXiv (2019).

[158] Gaurav Suri, Lily R Slater, Ali Ziaee, and Morgan Nguyen. 2024. Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. Journal of Experimental Psychology: General (2024).

[159] Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, and Bo An. 2024. True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning. In ICLR.

[160] Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. 2024. Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science.

arXiv (2024).

[161] Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, and Hang Su. 2023. Evil geniuses: Delving into the safety of llm-based agents. arXiv (2023).

[162] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.

arXiv (2023).

[163] Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. 2024. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. In ICLR.

[164] Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, and Yi Zhang. 2024. Bootstrapping llm-based task-oriented dialogue agents via self-talk. arXiv (2024).

[165] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS.

[166] Bertie Vidgen, Adarsh Agrawal, Ahmed M Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, et al. 2024. Introducing v0. 5 of the AI Safety Benchmark from MLCommons. arXiv (2024).

[167] Celine Wald and Lukas Pfahler. 2023. Exposing bias in online communities through large-scale language models.

arXiv (2023).

[168] Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv (2024).

[169] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. In ICML.

[170] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In NeurIPS.

[171] Yuntao Wang, Yanghe Pan, Miao Yan, Zhou Su, and Tom H Luan. 2023. A survey on ChatGPT: AI-generated contents, challenges, and solutions. IEEE Open Journal of the Computer Society (2023).

[172] Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv (2022).

[173] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. 2024. Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents. NeurIPS 36.

[174] Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Daphne Yao, and Bimal Viswanath.

  1. A first look at toxicity injection attacks on open-domain chatbots. In Annual Computer Security Applications Conference.

[175] Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. 2023. Simple synthetic data reduces sycophancy in large language models. arXiv (2023).

[176] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.

Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35, 24824–24837.

[177] Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. 2024. Long-form factuality in large language models. arXiv (2024).

[178] Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv (2023).

[179] Roy Weiss, Daniel Ayzenshteyn, Guy Amit, and Yisroel Mirsky. 2024. What Was Your Prompt? A Remote Keylogging Attack on AI Assistants. arXiv (2024).

[180] Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in Detoxifying Language Models. In Findings of EMNLP.

[181] David Windridge, Henrik Svensson, and Serge Thill. 2021. On the utility of dreaming: A general model for how learning in artificial agents can benefit from data hallucination. Adaptive Behavior (2021).

[182] Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. 2023. Fundamental limitations of alignment in large language models. arXiv (2023).

[183] Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review (1995).

[184] Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. 2024. WIPI: A New Web Threat for LLM-Driven Web Agents. arXiv (2024).

[185] Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick McDaniel, and Chaowei Xiao. 2024. A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems. arXiv (2024).

[186] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv (2023).

[187] Xingrui Xie, Han Liu, Wenzhe Hou, and Hongbin Huang. 2023. A Brief Survey of Vector Databases. In International Conference on Big Data and Information Analytics (BigDIA). IEEE.

[188] Frank Xing. 2024. Designing Heterogeneous LLM Agents for Financial Sentiment Analysis. arXiv (2024).

[189] Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. 2023. Exploring large language models for communication games: An empirical study on werewolf. arXiv (2023).

[190] Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. 2023. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv (2023).

[191] Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, and Jun Wang. 2023. Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models. arXiv (2023).

[192] Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. 2024. Watch Out for Your Agents!

Investigating Backdoor Threats to LLM-Based Agents. arXiv (2024).

[193] Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, and Zonghui Wang. 2024. PRSA: Prompt Reverse Stealing Attacks against Large Language Models. arXiv (2024).

[194] Dongyu Yao, Jianshu Zhang, Ian G Harris, and Marcel Carlsson. 2024. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In ICASSP.

[195] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In ICLR.

[196] Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2023.

Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv (2023).

[197] Jiahao Yu, Xingwei Lin, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv (2023).

[198] Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. 2024. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents. In ICLR.

[199] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In ICLR. https://openreview.net/forum?id=MbfAK4s61A [200] Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic Evaluation of Attribution by Large Language Models. In EMNLP.

[201] Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow.

IEEE Data Eng. Bull. (2018).

[202] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. 2024. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG).

arXiv (2024).

[203] Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. 2024. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. arXiv (2024).

[204] Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv (2024).

[205] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua Tenenbaum, Tianmin Shu, and Chuang Gan. 2023. Building Cooperative Embodied Agents Modularly with Large Language Models. In NeurIPS Workshop.

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways 35 [206] Wanpeng Zhang and Zongqing Lu. 2023. Rladapter: Bridging large language models to reinforcement learning in open worlds. arXiv (2023).

[207] Xinyu Zhang, Huiyu Xu, Zhongjie Ba, Zhibo Wang, Yuan Hong, Jian Liu, Zhan Qin, and Kui Ren. 2024. Privacyasst: Safeguarding user privacy in tool-using large language model agents. IEEE Transactions on Dependable and Secure Computing (2024).

[208] Yiming Zhang, Nicholas Carlini, and Daphne Ippolito. 2024. Effective Prompt Extraction from Language Models.

arXiv (2024).

[209] Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. 2024. Intention analysis prompting makes large language models a good jailbreak defender. arXiv (2024).

[210] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv (2023).

[211] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen.

  1. A Survey on the Memory Mechanism of Large Language Model based Agents. arXiv (2024).

[212] Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending large language models against jailbreaking attacks through goal prioritization. arXiv (2023).

[213] Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. 2023. Competeai: Understanding the competition behaviors in large language model-based agents. arXiv (2023).

[214] Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao.

  1. Analyzing and Mitigating Object Hallucination in Large Vision-Language Models. In ICLR.

[215] Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models. arXiv (2024).

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

标签: llama 语言模型

本文转载自: https://blog.csdn.net/qq_29883477/article/details/141690695
版权归原作者 曲奇人工智能安全 所有, 如有侵权,请联系我们删除。

“面临威胁的人工智能代理综述(AI Agent):关键安全挑战与未来途径综述”的评论:

还没有评论