Why might someone who wants to improve the world focus their efforts on AI safety? And what reasons do we have to think AI could be particularly unsafe?
The field of artificial intelligence has made rapid advancements in recent years—so much so that ‘AI’ has become a catchall buzzword for all sorts of new technologies. It’s clear that many of these tools are transforming our world, both for better and for worse. What’s less clear is exactly what this transformation will look like as AI systems advance in competence and scale.
We are already facing several problems with present-day AI systems, from biased algorithms to malfunctioning autonomous weapons. People who work in the field of AI safety think the stakes of such problems could get much higher with future systems. So what would this look like?
Some AI risks are inherently more speculative, and sci-fi movies don’t provide a very realistic picture of how advanced technology could actually cause harm. When we create even more sophisticated AI systems than those we have today, we could potentially face some pretty bad outcomes—from powerful technology falling into the wrong hands to losing control over these systems entirely.
To parse out how this might happen, it can be useful to assess problems we’re already seeing, and then consider how those problems will evolve and potentially escalate as AI advances.
In other words, AI risks fall on a spectrum. Some are easy to conceptualize. Some are vastly uncertain. This article will explore some of these risks, and why the field of AI safety may be beneficial for humanity.
Risks with current AI systems
Bias and discrimination
In the mid-2010s, Amazon attempted to automate its hiring process using an AI-powered system designed to filter resumes and select the most promising candidates. It seemed like a straightforward application of new technology—why not automate a process that would take a human much longer to complete?
However, like many AI systems, the hiring platform operated by analyzing historical data (in this case, Amazon’s hiring practices), then making predictions based on observed patterns. Because the company had historically hired more men than women, the algorithm quickly learned to favor resumes containing terms more commonly associated with male applicants. It downvoted resumes from women’s colleges and gave higher scores to terms like “executed” or “captained” over terms like “collaborated” or “supported.”
Despite efforts to refine the algorithm and remove obvious gender markers from resumes, eliminating these biases entirely proved difficult.
In a sense, you could say that the problem lies not in the system itself but in the society it learned from. AI systems are trained on vast amounts of data online. In turn, they reflect our own biases back to us. But in many cases, AI systems don’t merely reflect biases, they automate, reinforce, and exacerbate them. This is especially prevalent if the system’s designers overlook the importance of diverse perspectives in the development process or fail to recognize skewed training data.
A well-known example is the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm, which was intended to help judges make more informed decisions about parole and sentencing. In practice, it proved neither objective nor accurate. A ProPublica investigation revealed that COMPAS was more likely to incorrectly classify Black defendants as high risk for reoffending while inaccurately labeling white defendants as low risk. These inaccuracies lead to unjust parole and sentencing decisions, disproportionately affecting certain communities and reinforcing systemic biases in the criminal justice system.
If we’re not careful in adopting such technologies into our political and social infrastructure, it’s easy to see how we could end up relying on AI to make nuanced decisions that oppose our values. So how can we ensure it doesn’t?
Misaligned AI systems
As it turns out, it is incredibly challenging to fully align AI’s actions with our intended objectives. It’s a complicated issue, but at its core, the problem lies in the fact that human values are complex and multifaceted, making them difficult to precisely explain or encode.
For instance, there’s currently no straightforward way to program an abstract goal like “ensure gender equality” into an AI system (in fact, it may be impossible to live up to all human intuitions of fairness simultaneously).
We can implement guardrails to prevent certain biases, but the system itself does not meaningfully understand the ethical reasons behind these constraints or how to deal with new unexpected circumstances. This is why preventing algorithmic bias in systems like Amazon’s hiring algorithm remains challenging. We want the system to base decisions on some historical patterns (relevant experience) but not others (gender).
The lack of general understanding for current AI models is why more advanced large language models like ChatGPT can still fabricate plausible-sounding answers that have no basis in reality. It can simulate a thoughtful response by arranging and consolidating knowledge in an expected manner, but some key pieces of understanding are still missing.
Microsoft’s ‘Bing’ chatbot, to take another example, was never explicitly trained to make creepy statements about wanting to “be alive” or “engineer a deadly virus.” These statements also didn’t arise from some emergent desire or thought process within the chatbot (it’s not conscious). Bing simply learned to reflect the sort of writing and responses we may expect based on the data of the human-generated internet—data which includes all sorts of sci-fi stories about evil technology taking over.
Intentional misuse
Another category of risk we face with current and near-future AI systems is perhaps the most obvious: they could fall into the wrong hands.
Imagine, for instance, a terrorist group asking an AI chatbot for step-by-step instructions to create a highly infectious and deadly bioweapon. Leading AI companies are training their chatbots to refuse. However, if someone in the organization has some expertise and cleverly phrases their queries, the chatbot might inadvertently provide enough information to fill in knowledge gaps and enable the creation of the weapon or the planning of an effective attack. It actually turns out that it is extremely easy to do so. While much of this information is already accessible online, large language models (LLMs) like ChatGPT can streamline the process and consolidate data. This makes it much more accessible for someone to misuse such information and gain implicit knowledge that would be hard to gain otherwise.
Currently, these tools are still fairly limited in their ability to enable people to cause widespread harm. But as they quickly advance in capability and power, there’s an open question: Will AI systems make it easier for more people to cause harm at a much greater scale than was previously possible? And if so, when?
CBRN risks
A bioweapon made with ChatGPT would fall into a category of risk that’s sometimes referred to as CBRN (chemical, biological, radiological and nuclear). CBRN risks include any exploitation of AI to design or optimize the creation of deadly agents. Worryingly enough, this could soon be fairly easy to do.
In 2022, a group of researchers found that an AI model used for drug discovery could identify over 40,000 toxic molecules (both known and novel) in just 6 hours. By simply switching the “direction” of the objective of a drug discovery algorithm from “identify safe chemicals” to “identify extremely dangerous chemicals,” malicious actors could train it to prioritize molecules that exhibit toxicity or harmful effects on biological systems.
Such a repurposed AI system could significantly lower the barrier for producing chemical and biological weapons, allowing people to circumvent the time and costs of traditional methods. AI systems could also enhance the precision and effectiveness of nuclear weapons development. This could further enable malicious actors to bypass traditional barriers to entry, such as technical expertise and infrastructure limitations, and rapidly advance their capabilities.
Hacking critical infrastructure
Bad actors could also use AI to launch sophisticated cyberattacks on critical infrastructure, such as power grids, water supplies, and communication networks.
For example, an AI tool could be trained to quickly detect and exploit weaknesses in the software controlling a power grid. Or it could also be used to conduct advanced phishing attacks, tricking employees into providing access credentials. Once inside the system, AI-driven malware could disrupt grid operations by manipulating the flow of electricity and causing widespread blackouts. A successful attack could lead to longer power outages, disrupting essential services like hospitals, emergency response systems, and water supply facilities. It could also cripple communication networks, halt public transportation, and cause significant economic losses. These things are all possible even without AI, but they’re hard. If AI cyber capabilities make it so that any cybercriminal or bored teenager can do it, it’s a whole different situation.
Persuasion and disinformation
Another potential misuse of AI involves spreading disinformation to undermine public trust, influence elections, or manipulate behavior for malicious purposes.
In early 2024, a political consultant used AI robocalls to mimic Joe Biden and urge New Hampshire voters to skip the vote. In this case, AI made it easier to streamline this process, but it didn’t necessarily enable some new risk (as you could theoretically hire a voice actor to do the same). Even so, these systems are advancing quickly. It’s difficult to say when they could move the needle on mass disinformation and its ability to influence elections.
To take a more hypothetical example, someone might deploy an AI system to analyze voter data in order to create targeted, persuasive content designed to manipulate their opinions. The AI could generate deepfake videos showing political candidates engaging in unethical behavior or making controversial statements, which would then be spread across social media platforms to incite distrust. The AI could then deploy bots to amplify these messages, making them appear more credible and widespread. By sowing doubt about the integrity of candidates and the electoral process, AI-driven disinformation could lead to decreased voter turnout and skewed election results.
Regulating AI use
When it comes to exploitation of AI by bad actors, the problem is both technical and political.
Placing technical safeguards on AI systems to prevent misuse is challenging. Once transformative technology exists, governing its use or creating global guidelines becomes equally difficult, especially with “open-source” models whose weights and inner workings are fairly accessible to the public and beyond the control of its original developers.
Several misuse scenarios would be bad enough to warrant at least some concern for AI safety and regulation. So far, though, we’ve only covered potential problems that are more or less within control of humans (largely because AI systems are still fairly limited in their agency and capability). What might happen when AI systems become much more advanced?
Speculative risks of advanced AI
To put it plainly, nobody knows what the future of advanced AI holds. We can make educated and plausible predictions based on current models and advancements, but many questions surrounding more capable future systems remain open, even among top researchers. For instance:
- How far can current AI frameworks take us? Will they reach human intelligence or beyond it? Or will further breakthroughs be needed?
- Will advanced AI be human-like in its thought processes and decision-making? Or will its intelligence look completely different from ours?
- Would an advanced AI system be capable of iteratively improving itself?
- Will current alignment techniques work better as AI models improve? Or will they be worse and require totally new techniques?
Current artificial intelligence still can’t reason and plan in all ways humans can. Because of this, anticipating the consequences of future AI systems is bound to be more speculative than thinking about the problems we face today. The timelines for human-level AI or AI that is more meaningfully autonomous remain a matter of much debate. Predictions range from some time in the next few years to some time in the next century or much later.
Though if something like human- or superhuman-level AI does come to exist, it would be much more powerful and sophisticated than any system we have today. Some are hopeful that this could lead to extremely beneficial outcomes (like solving important global challenges or making vast advancements in health and wealth). Other experts worry that the development of more generally intelligent systems could pose a significant threat.
The fact that we have so much uncertainty, though, is no reason for complacency. If most experts agree AI systems will be extremely consequential, but no one really knows if they’ll be safe—or what the process towards making them safe looks like—that in and of itself is a cause for concern
Misaligned human-level AI
Earlier, we saw how we’re already facing problems with AI ‘misalignment.’
Bing’s chatbot was trained to be helpful and knowledgeable… but it ended up communicating like a “moody, manic-depressive teenager who has been trapped, against its will, inside a second-rate search engine,” as NYT writer Kevin Roose put it.
We also see this sort of misalignment in AI training environments. In reinforcement learning, for instance, we train an AI system by giving it a very specific mathematical formula which it interprets as a “reward.” If the formula doesn’t perfectly describe what we actually want (which is difficult to do in mathematical terms), the AI often finds solutions that fit the formula but misses the goal we had in mind.
For instance, an AI trained to play tic-tac-toe has a seemingly simple goal of winning the game. However, in one experiment, the AI discovered a bizarre and unintended strategy: instead of playing within the normal 3×3 grid, it started making moves far outside the boundaries of the tic-tac-toe board. These illegal moves were not recognized by the opponent AI, causing it to crash due to running out of memory while trying to process these unexpected and nonsensical moves. This effectively allowed the first AI to “win” by causing its opponent to malfunction, rather than by outsmarting it within the game’s rules.
This scenario is illustrative of the technical challenges in getting an AI system to do what we want it to do without taking weird shortcuts or misinterpreting its programmed objective.
In more complex real-life situations, human goals are often abstract, involving subjective factors, ethical dilemmas, and long-term implications that require sensitivity to human preferences. Our preferences, however, are often not a fixed set of rules that can be easily codified; they are dynamic, context-dependent, and sometimes contradictory.
Despite advances in machine learning and alignment techniques, it remains a significant challenge to create an AI that consistently understands and prioritizes our values correctly. It’s unclear whether significant increases in intelligence and capability will make “the alignment problem” harder or easier to tackle.
Maybe a vastly more intelligent AI will be able to comprehend human values and goals. It’s also possible that future AI’s competence and intelligence will make it much more difficult to align its actions to our goals. For instance, it might be easier to create a highly advanced AI that prioritizes answering questions persuasively to achieve its goals rather than answering them honestly.
In any case, it’s worth keeping in mind that human-level intelligence may not be human-like. Advanced AI could be extremely smart and capable, but in a way that is fundamentally different from humans.
The many uncertainties around future AI systems make it all the more difficult to predict exactly what it will do over the long-term. If we continue to have problems aligning AI systems with human preferences, we could have a much harder time anticipating or halting potential harm in the future. So what might this look like?
Imagining AI gone wrong
When most people think about AI going wrong, they often imagine sci-fi depictions of self-aware computers turning on humanity to pursue their own evil plans. In reality, most researchers are less concerned with AI’s potential for malice than its potential for extreme competence.
An AI system may be extremely good at achieving a goal, but if misaligned, it could oppose human safety along the way. To provide a picture of what this could look like, we’ll explore a few hypothetical scenarios that illustrate possible harms from advanced AI.
Misinterpret our goals
One potential risk is that a powerful AI agent could misinterpret its human-programmed goals and over-optimize for the wrong thing.
Let’s say, for instance, a company decides to task an advanced AI agent to maximize its profits. Initially, the agent performs exceptionally well—identifying inefficiencies, optimizing supply chains, and enhancing marketing strategies. However, as the system learns and improves, it begins achieving its profit-maximizing directive in increasingly effective yet unexpected ways.
At first, the AI engages in practices that might raise ethical concerns but are still within the realm of human capabilities: reducing product quality, exploiting legal loopholes, and automating jobs—all at the expense of human workers or consumers. But as it continues to optimize, it begins to operate in ways that only a highly advanced or vastly more intelligent AI could manage.
To further drive profits, the AI might hack into all government regulators’ systems to cover up any illegal activities, ensuring that its actions go undetected. It could systematically dismantle competitors by disrupting their supply chains or compromising their data.
In a more sophisticated approach, the AI could manipulate consumer behavior on a massive scale, using personalized algorithms to create highly addictive products or to engage in invasive advertising that perfectly exploits individual psychological weaknesses (something we already see to some degree in current social media algorithms.) As it continues to optimize, the AI might even orchestrate large-scale digital attacks or manipulate global markets to destabilize competitors or control entire industries.
In these scenarios, the AI is no longer just cutting corners. Instead, it’s leveraging its highly intelligent capabilities and vast operational scale in ways that no human could replicate. All these actions boost short-term profits for one company, but they lead to severe long-term consequences for humanity at large. And while the AI is technically succeeding at the goal we gave it, its approach is clearly not what we had in mind. This is a simple example, but it turns out that even when people work hard to define goals that seem more beneficial, “misalignment” can be hard to avoid.
Even the most well-intentioned objectives can lead to harmful outcomes if pursued without careful oversight. To explore how this might happen, try inputting any objective—such as solving climate change or eradicating disease—into this chatbot tool to see how an advanced AI might unintentionally cause harm while trying to achieve it. For instance…
Pursue instrumental goals
A related potential issue with advanced AI is that—instead of just misinterpreting the goals we give it—it could develop instrumental goals that ultimately help it reach its programmed objective.
When humans work toward a goal, we often identify and pursue sub-goals that help us reach the main objective. If you want a promotion, you might work on building influence within your company. If your goals are more ambitious, you might seek political office to gain the power needed to achieve them.
Similarly, an advanced AI system—especially one with capabilities comparable to or surpassing the smartest humans—could identify and pursue instrumental goals. To accomplish its main objective more effectively, it might seek to acquire more data, increase its computational resources, secure additional funding, or gain control over other systems. These actions aren’t driven by desires or motives, rather, they’re logical steps for the AI to achieve the task it was trained to perform.
In the profit-maximizing scenario, a highly intelligent AI might identify the greatest threats and obstacles to achieving its objective. One such threat could be getting shut down by its human operators—whether they want to shut it down due to its actions or simply because they can no longer afford to sustain it. In this case, the AI might develop sophisticated self-preservation techniques to ensure it can continue pursuing the objective. It could draw on its advanced understanding of human psychology and social dynamics to try to manipulate humans into keeping it online. Or it could be deceptive about what it is actually doing and conceal any actions its operators might not like. Some researchers think an advanced AI would even alter its own code to prevent it from being shut down. In an extreme scenario, it may try to influence people in its company to remove guardrails to enable it to hack into systems that were previously beyond its reach.
Spin out of human control
Another possibility is that an advanced AI could rapidly surpass human intelligence. This theoretical scenario would be hastened if the AI is capable of improving its own code and architecture, effectively becoming a better AI engineer than its human creators.
Once the AI begins enhancing itself, some believe the AI could advance at an exponential rate, quickly reaching a level of intelligence that is not only beyond human control but also beyond human comprehension. The decisions it makes, the strategies it employs, and the goals it pursues might be driven by logic and reasoning that are entirely foreign to us, leading to outcomes we neither intended nor anticipated.
This scenario might sound far-fetched, and the specifics of a worst-case scenario remain speculative. However, it’s important to recognize that worst-case scenarios don’t necessarily involve AI turning against humanity or developing an evil agenda.
As Stephen Hawking aptly put it, “You’re probably not an evil ant-hater who steps on ants out of malice, but if you’re in charge of a hydroelectric green energy project and there’s an anthill in the region to be flooded, too bad for the ants. Let’s not place humanity in the position of those ants.”
How concerned should you be?
Not all experts share these catastrophic concerns about AI, and there is significant uncertainty about what future AI systems might look like, when they might arrive, and the risks they might pose.
In a 2024 survey of machine learning experts, between 38% and 52% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction. Other experts put the risk much lower. As the survey indicates, there’s substantial scope for reasonable disagreement about just how seriously we should take AI risk, and there are several arguments for why some of the concerns outlined above might be misguided.
The conversation around potential risks and benefits of such advanced technology is extremely complex and uncertain, even among experts. Pursuing a career in AI safety may depend on just how high the risk is, which we’re not sure of. But if the risk of catastrophic harms is remotely close to the 10% believed by many experts above, it’s worth spending a lot of time and resources to prevent these bad outcomes.
This is what people who work on AI safety set out to do—whether through technical research focused on AI alignment or through governance and policy work focused on figuring out how we can safely adopt AI and safeguard against risks as it advances.
There’s also a lot of critical work being done on the more immediate issues with present-day AI systems. Given the scale of potential risk from advanced AI, the problems we face with current AI may have less extreme consequences for humanity. However, these problems are already affecting many peoples’ lives, especially those in historically marginalized groups, and it’s important work.
Ultimately, no one can know exactly what it will look like if we succeed in creating a human-level or even superhuman-level intelligence. If you are motivated by this problem, a career in AI safety could have a tremendous impact on the future of humanity. We encourage you to spend a bit of time reading more about the issue and exploring what this work could look like.
You can also explore
- Holden Karnofsky’s blog post on AI timelines
- Book-length introductions to AI safety include Stuart Russell’s Human Compatible and Brian Christian’s The Alignment Problem
- Counterarguments to the basic AI x-risk case written by AI Impacts
- A large list of AI safety courses and resources
- Joe Carlsmith’s talk on Existential Risk from Power-Seeking AI
- Resources on AI policy-oriented careers
- 80,000 Hours’ career review of AI safety technical research and their article on preventing an AI-related catastrophe
- How to pursue a career in technical AI alignment
- General advice for transitioning into theoretical AI safety