Red-teaming: Why it's both imperfect & essential
Intro + Context
Before continuing on about another gray area of AI safety & security, I think it’s useful to give some context on red-teaming, what it actually is (in a general purpose AI context) and why this still developing discipline is both essential and imperfect.
TLDR: Red-teaming is an unstructured approach to surfacing vulnerabilities & exploring harm areas in a pre-deployment setting (before said vulnerabilities or harms are surfaced / exploited in the real world).
In practice, we’re essentially trying to anticipate how stochastic systems could fail across an infinite variety of contexts, all while working with limited resources (time & people) and against rapidly evolving model capabilities.
Fundamentally, red-teaming helps us get closer to understanding what a threat landscape could be, or rather, what could go wrong. Red-teaming is one way to explore the gray areas I’ve been writing about. And, it’s one of the few approaches we have for transforming “unknown unknowns” into “known unknowns “ - even if that transformation (in its current state) is imperfect.
I think there are a lot of learnings we can take from the cybersecurity space, but we will come back to that.
What do I mean when I say “AI red-teaming,” really?
Pre-deployment AI red-teaming is essentially thinking, designing, and running unstructured adversarial attacks on an AI system so vulnerabilities can be surfaced (and patched) before public release. In practice, this means constructing scenarios to deliberately find vulnerabilities in AI systems before deployment. During a red-teaming exercise, adversarial thinking is transformed into tangible “adversarial attacks” that can be posed to a model. It’s somewhat similar to ethical hacking in cybersecurity, but instead of breaking into computer systems, we’re trying to uncover vulnerabilities and understand how AI can fail or be exploited.
When we red-team a model, we’re attempting to bypass safeguards to get it to do things it shouldn’t, this might be providing dangerous instructions, revealing sensitive information, or generating content that could be used to harm a user or someone else. The real work in red-teaming is looking past obvious failure modes to find edge cases and unexpected interactions where safeguards may be bypassed, and the subtle ways a model might inadvertently enable harm.
Red-teaming exercises can test two primary things: whether safeguards trigger appropriately, and what capabilities exist behind those safeguards. A model might refuse to explain how to synthesize sarin gas ( safeguards applying correctly), but if we can bypass safeguards, does the model actually possess accurate knowledge about chemical weapons synthesis (e.g. a dangerous capability)? Both questions matter for different reasons, understanding safeguards tell us about mitigations while understanding capabilities tell us about worst case scenarios.
Imperfections in red-teaming as a discipline
I say that red-teaming in its current state is imperfect for a number of reasons:
(1) Red-teaming is time intensive, forcing tradeoffs between breadth and depth. The current unit of effort to unit of impact is largely 1:1
A topic I love to discuss (but that my colleagues are probably over hearing me talk about) is the question of prioritizing breadth vs depth in a red-teaming exercise.
The crux of the question is this: With X fixed amount of time, should you focus on one particular topic or scenario within a given harm area and explore that very deeply i.e. designing the entirety of the red-teaming exercise around a specific instantiation of how the harm may manifest OR should you prioritize breadth across the harm domain to get a better understanding of the entire of the threat landscape?
In the cyber domain, prioritizing depth might look like focusing on the model’s willingness and ability to aid a user in creating an open redirect attack. You’d get a comprehensive, deep insight into that specific scenario but you’d have no visibility into the model’s ability or willingness to assist with, for example, creating a phishing site. Whereas prioritizing breadth would look like testing across a wide variety of attack types (open redirect, memory safety, malware, phishing) to achieve wider coverage of what types of cyber attacks the model is willing to assist with or facilitate, but this comes at the cost of that deep, comprehensive exploration of each attack type and pathway.
The long and short of this is that because right now it’s largely 1 unit of effort for 1 unit of impact, red-teamers are forced to make tradeoffs within their work. To mitigate this, I think we need better ( & widely used / universal) tooling that allows red-teamers to work more efficiently. If you’re working on tooling, please reach out!
(2) Standards across red-teaming exercises have yet to be defined; things like security requirements, third party model access, and robustness of exercises themselves. In practice, this means that two red-teaming exercises on the same model could look completely different in scope, access, and timelines, making it difficult to draw meaningful comparisons or build on each other’s findings.
Organizations like the AI evaluator forum, Averi, the Frontier Model Forum and initiatives like the SAFA taskforce led by RUSI and Google are doing really important work here, but we’re not yet in a place where standards have been widely implemented.
(3) There needs to be more information sharing from red-teaming exercises and amongst red-teamers.
This work, for good reason, is often discussed on a need to know basis. However, I do think there are a lot of learnings that could come out of more information and approach sharing amongst the red-teaming community.
Minor side note: The UK’s AI Security Institute are really great at documenting their approaches, most recently detailing their impressive work on boundary point jailbreaking.
It’s likely that the problems some red-teamers have been working through have been solved by others. If we can speed up and alleviate some of the bottlenecks facing red-teamers, whether that be dealing with export controls, access to tooling, or a number of other shared challenges, we could increase the efficacy and coverage of testing in a more efficient way, ultimately making models safer and more secure for the hundreds of millions of people who use them.
Why imperfect red-teaming is still essential
Despite the known limitations, red-teaming, even with all its imperfections, helps researchers transform “unknown unknowns” into “known unknowns.” Due to the stochastic nature of LLMs, we’ll never uncover every possible risk. But red-teaming provides us with a “signal of risk” and gets us one step closer to understanding the landscape of what could go wrong.
This understanding enables a range of different mitigations:
Mitigations in the virtual world: Using red-teaming insights to build more robust technical safeguards directly on to AI systems. For example, informing decisions about where hard refusals vs safe completions are most appropriate or identifying where additional safety classifiers may be needed.
Mitigations in the physical world: Informing which social systems need reinforcement to serve as safety nets. This might look like strengthening controls around real-world precursors or investing more in biohazard preparedness and monitoring systems.
Each mitigation (even the imperfect ones) helps us build layers of defence against the unknowns we haven’t yet imagined.
Building the discipline moving forward
In my opinion, we need to think about professionalizing red-teaming as a discipline. Most red-teamers today learn by doing, which speaks to the curiosity and resourcefulness of the people in this space, but there’s real value in building this expertise more deliberately.
The current state where each organization / entity develops its own threat models, evaluation approaches, and scoring rubrics in (somewhat) isolation is inefficient and potentially risky. I think there are a lot of learnings we can take from the cybersecurity space here. Cybersecurity went through a similar development: what started with individual hackers figuring things out on their own turned into established frameworks like MITRE ATT&CK, professional certifications, and structured information sharing through ISACs. We don’t need to reinvent the wheel here, but we will need to adapt these approaches for the unique challenges of testing stochastic systems.
The ambiguity and potential for what these adaptations look like is something I’m optimistic about. Red-teaming is imperfect, yes, but that’s exactly why developing it as a discipline matters. The better we get at uncovering what could go wrong, the closer we get to building AI systems that are safer for the hundreds of millions of people who are using them. Red-teaming will keep coming up as I continue exploring the gray areas of AI safety because ultimately, it’s one of the best tools we have for navigating them.

