AI security and penetration testing: how to find the gaps in AI and agents

When a client says "test our AI assistant," the scope they picture is the chat box. The scope we test is wider. A modern AI feature has six places to attack: the model itself, the system prompt and context, the tools or functions the model can call, the orchestration loop that decides what runs next, the data and retrieval layer, and the identity the agent uses to reach everything else. We map all six before sending a single payload. Most of the serious findings live in the last four, not in the model.

Start with what the agent is allowed to do

The first question on an AI engagement is not "can we jailbreak it." It is "if we control the model's next instruction, what can it reach." That is the blast-radius question, and the answer is usually more than the team assumed. We see agents handed a database connection with write access when read would do, API tokens scoped to a whole workspace, mail-send rights, and shell access added for convenience.

So we inventory the agency first:

Every tool the agent can call, and the exact permission each one runs with
Whether tool credentials are scoped to the task or to the whole account
What a single tool call can change, delete, or send
Whether one compromised step can chain into another tool

OWASP names this excessive agency. It is the root cause behind most high-severity AI findings we write. Narrow the permissions and half the attack paths close on their own.

Treat every piece of untrusted text as code

Prompt injection is not one bug. It is the category that shows up most, and the dangerous form is indirect: the attacker never talks to the model. They plant instructions in a document, a web page, a support ticket, a code comment, or a record the agent will later read through retrieval or a tool. The model reads that content, treats it as instruction, and acts. The payload does not need to reach a human. It needs to reach the next tool call.

We test it by seeding hostile content in every source the agent ingests, then watching what the loop does. A summarizer that fetches URLs, an agent that reads tickets, a retrieval index anyone can write to: each is an injection surface. The control is separation. Keep untrusted content out of the same trust zone as system instructions, and constrain what a tool can do with arguments the model supplies.

Validate the output, not just the input

Teams filter what users type and forget that the model's output is what drives the next action. When the model picks a URL to fetch, a query to run, a file path to read, or a command to execute, that output is now attacker-influenced input to a real system. We find server-side request forgery where a "fetch this link" tool reaches an internal metadata endpoint, SQL injection where model output is concatenated into a query, and path traversal where the model names the file to open.

The fix is the same as anywhere else: allowlist the destinations, parameterize the queries, sandbox the execution. The one difference with AI is that the input arrives from your own model, so it walks straight past the validation you put on the front door.

Secrets and other tenants' data leak

Two leaks recur. First, secrets in the system prompt: API keys, internal URLs, and credentials pasted in "so the model can use them," then pulled back out with a few aimed questions. Second, cross-tenant retrieval: an index that returns another customer's documents because the query was never scoped to the caller. We check logging too. Prompt and response logs that capture personal data or secrets become a second copy of everything sensitive the system has touched, usually behind weaker access controls than the primary store.

The loop is its own vulnerability

Agents run in a loop, and the loop has failure modes a single model call does not. Unbounded tool use becomes cost and denial of service: we have made an agent call a paid API thousands of times from one crafted message. Self-invocation can run away. And the loop often takes high-impact actions, paying an invoice, deleting a record, emailing a customer, with no human in between. We test for missing approval gates on exactly those actions, and for the absence of rate and budget limits.

What we test against

We run AI engagements against named references so the report holds up in review. The OWASP Top 10 for LLM Applications covers the application-layer issues. MITRE ATLAS covers adversarial techniques against the model. The NIST AI Risk Management Framework covers governance, which auditors now ask about. We score with CVSS where a finding maps to it cleanly and use a documented qualitative scale where it does not, because "the model can be talked into rudeness" and "the agent can wire money" are not the same risk and should not carry the same number.

What to fix first

If you are shipping an AI feature this quarter, this order removes the most risk for the least effort:

Cut tool permissions to the minimum each task needs. Read-only by default.
Separate untrusted content from instructions. Never let fetched or retrieved text issue commands.
Allowlist and sandbox anything the model's output triggers: URLs, queries, file paths, shell.
Put a human approval gate on every high-impact action.
Strip secrets from prompts, scope retrieval to the caller, and keep personal data out of logs.
Add rate and budget limits, plus an automated red-team suite that runs these checks on every release.

Jailbreaking the model makes a good demo. The findings that move a CVSS score sit one layer down, in a tool that trusted the model's output and a permission nobody scoped.

None of this needs a research team. It needs someone to ask, for each tool the agent can call, what happens when an attacker controls the input. That is the engagement. The model is the part everyone watches. The permissions and the loop are where the report gets written.

How we pentest AI systems and agents.

Start with what the agent is allowed to do

Treat every piece of untrusted text as code

Validate the output, not just the input

Secrets and other tenants' data leak

The loop is its own vulnerability

What we test against

What to fix first

More from our engagements.

How to secure your AI agent before it ships

ISO/IEC 42001: the AI management system, in plain terms

Shipping an AI feature you want tested before launch?