
Can cybersecurity red teams keep AI safe?
AI safety and security remain a defining issue in artificial intelligence. Learn how AI experts have imported the concept of "red teaming" into their security efforts.
By: James M. Tobin, Edited by: Mitch Jacobson
Published: September 25, 2025
"Red teams" have played a crucial role in the development of cybersecurity strategy for decades. The concept involves assigning a dedicated group known as a "red team" to stage targeted cyberattacks on networks and systems. Their efforts provide a controlled yet effective method for identifying and addressing security risks.
AI red teaming applies similar concepts to artificial intelligence technologies. Explore this fascinating area of AI security research, and discover ways to make AI safer for individual users, businesses, governments, and society.
Key takeaways
- AI red teams work to identify safety, security, and privacy risks in predictive and generative artificial intelligence systems.
- The concept of red teaming originated in cybersecurity and was subsequently applied to AI security in the 2010s.
- Red teams employ the same techniques that real-life hackers and threat actors use to stage attacks, adapting them to the unique safety risks associated with AI.
- Human-led red teams remain a common standard, but recent research shows AI-powered red teams can outperform them in certain scenarios.
Cybersecurity red teams: What do they do?
The concept of "red teaming" first emerged in military contexts. During the Cold War, U.S. military units would conduct safety exercises with a "blue team" representing the United States and a "red team" simulating the Soviet Union.
As cybersecurity developed alongside increasingly advanced computer technologies, it imported the notion of "red teaming" into security testing frameworks. Cybersecurity red teams include ethical hackers, who carry out mock intrusions using real-world attack vectors to identify and close previously undetected vulnerabilities before a real attack can happen.
The following table highlights some of the major objectives and strategies used in cybersecurity red teaming:
| Red teaming technique | Security objective |
|---|---|
| Penetration testing | Probe web applications for coding errors or oversights that create security weaknesses |
| Social engineering | Target unsuspecting human users in a bid to gain login credentials or other forms of unauthorized access |
| Network sniffing | Surveil network activity and identify opportunities to stage attacks |
| Brute-force attacks | Use automated or manual methods of systematically guessing access credentials |
| Continuous automated red teaming | Automate red teaming efforts and conduct testing over an extended period of time |
Are there security concerns unique to AI?
Yes, AI poses multiple security risks that are especially relevant or unique to its associated technologies. While these risks continue to evolve as AI becomes increasingly integrated into business, known security concerns can be grouped into two broad categories:
Technology-based risks
These risks relate to the underlying technologies that power AI and machine learning (ML) systems. They broadly include:
- Data poisoning: This describes deliberate efforts to corrupt training data to create inaccurate, inappropriate, or malicious AI outputs.
- Prompt injection attacks: Attackers can engineer AI prompts to "trick" the system into exhibiting unintended behavior or revealing sensitive information.
- System prompt leaks: The system prompts that guide AI outputs may themselves contain sensitive information, which attackers can target and exploit.
- Unchecked autonomy: AI systems may lack effective oversight, allowing them to abuse their agency and carry out harmful activities.
Technology-based risks
These risks relate to the underlying technologies that power AI and machine learning (ML) systems. They broadly include:
- Data poisoning: This describes deliberate efforts to corrupt training data to create inaccurate, inappropriate, or malicious AI outputs.
- Prompt injection attacks: Attackers can engineer AI prompts to "trick" the system into exhibiting unintended behavior or revealing sensitive information.
- System prompt leaks: The system prompts that guide AI outputs may themselves contain sensitive information, which attackers can target and exploit.
- Unchecked autonomy: AI systems may lack effective oversight, allowing them to abuse their agency and carry out harmful activities.
Emerging and future AI risks
Popular culture has explored concerns over the evolution of artificial general intelligence, which would theoretically surpass even the highest levels of human intelligence and potentially place AI beyond human control. However, many other, more realistic risks are already emerging, and these could become very real and concerning AI safety and security issues in the near future.
Examples include:
- Mass-scale job losses as human workers become redundant, potentially contributing to volatile socioeconomic conditions.
- AI's weaponization in surveillance, profit-driven exploitation, and warfare.
- Serious environmental harm arising from the enormous quantities of electricity required to power AI data centers.
AI safety and security experts also widely believe that malicious actors could use artificial intelligence to develop highly advanced hacking and cyberattacking capabilities. In a worst-case scenario, these could have global impacts.
Red teaming AI: Opportunities and limitations
Current AI red teaming models favor human personnel, led by accomplished interdisciplinary experts and composed of prompt engineers and ethical hackers. While major technology companies have used AI red teams since the late 2010s, the practice remains a rising but underutilized concept.
On the plus side, AI red teaming offers these advantages:
- It makes AI more resilient against common attack vectors, such as data poisoning.
- AI red teaming can reduce bias, enhance system performance, and support compliance.
- Red teaming can help AI developers quantify and measure risks rather than rely on theoretical guesswork.
However, it also has limitations:
- AI red teaming can be subjective, and it remains heavily dependent on the insights and abilities of its human designers.
- It can struggle to replicate an authentic attack environment, especially when limited to a single model.
- Businesses have a general lack of incentive to invest in AI red teams, given the lack of standardized regulatory and compliance frameworks.
Going forward, automated AI red teaming — powered by artificial intelligence itself — can dramatically improve the AI safety and security landscape. Some organizations already use automated approaches, especially for systematic tasks involving pattern matching, but future efficacy will strongly depend on the reliability and resilience of the AI/ML systems that control them.
Data spotlight
In a Cornell University study, automated AI red teaming had a 69.5% success rate vs. 47.6% for human red teams.