Calmant in the ChatGPT Craze - AI Safety Research

After OpenAI launched the fourth generation of ChatGPT, the popularity of AI swept from the AI field to the broader tech community, and discussions about it emerged in various industries. Faced with such a phenomenon, I, who maintain a skeptical attitude towards the excitement, often think, "Is it really that impressive? What are the drawbacks?" This line of thinking does not mean I reject the changes it brings, but rather I want to understand what impact such large AI systems will have on the future, how we should face these short-term, medium-term, and long-term changes, so as to have a psychological expectation and also to plan for the future in advance.

Coincidentally, today I take this article from Anthropic as an opportunity to think together with everyone about the safety issues faced by large AI models and their exploration of this issue.

Introduction#

We founded Anthropic because we believe that the impact of artificial intelligence could be comparable to that of the industrial and scientific revolutions, but we do not believe it will proceed smoothly. Moreover, we believe that this level of impact may come very soon—perhaps within the next decade.

This view may sound incredible or exaggerated, and there are ample reasons to be skeptical. On one hand, almost everyone who has said, "What we are doing may be one of the biggest developments in history" has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid advancements in artificial intelligence lead to transformative AI systems.

At Anthropic, our motto is "show, don't tell," and we have been focused on publishing a steady stream of safety-oriented research, which we believe has broad value for the AI community. We are writing this article now because, as more people become aware of advancements in AI, it is time to express our views on this topic and explain our strategy and goals. In short, we believe that AI safety research is urgent and should receive support from a wide range of public and private participants.

Therefore, in this article, we will summarize why we believe all of this: why we expect AI to advance very rapidly and have a very large impact, and how this leads us to worry about AI safety. Then, we will briefly summarize our own approach to AI safety research and some of the reasons behind it. We hope that by writing this article, we can contribute to a broader discussion about AI safety and AI progress.

As a high-level summary of the key points in this article:

AI will have a very large impact, possibly within the next decade
The rapid and continuous advancement of AI systems is a predictable result of the exponential growth of computation used to train AI systems, as research on the "scaling laws" indicates that more computation leads to a general increase in capabilities. Simple inferences suggest that AI systems will become more powerful in the next decade, with performance on most intellectual tasks potentially equaling or exceeding human levels. AI progress may slow or stop, but there is evidence suggesting it may continue.
We do not know how to train systems to perform robustly well
So far, no one knows how to train very powerful AI systems to be very useful, honest, and harmless. Moreover, the rapid advancement of AI will disrupt society and may trigger competitive races that lead companies or nations to deploy untrustworthy AI systems. The consequences of this could be catastrophic, either because AI systems strategically pursue dangerous goals or because these systems make more innocent mistakes in high-risk situations.
We are most optimistic about our multifaceted, experience-driven approach to AI safety
We are pursuing various research directions aimed at building reliably safe systems, with the most exciting currently being scalable oversight, mechanical interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. One of our key goals is to accelerate this safety work in a differentiated manner and to develop a safety research profile that attempts to cover a wide range of scenarios, from those where safety challenges are proven to be easily solvable to those where creating safe systems is very difficult.

Our Rough View on the Rapid Development of AI#

The three main factors leading to predictable AI performance improvements are training data, computation, and improved algorithms. In the mid-2010s, some of us noticed that larger AI systems were consistently more intelligent, leading us to speculate that the most important factor in AI performance might be the total budget of computation for training AI. When plotting graphs, it became clear that the amount of computation entering the largest models was growing at a rate of 10 times per year (with a doubling time 7 times faster than Moore's Law). In 2019, several members who later became part of the founding team of Anthropic made this idea more precise by formulating scaling laws for AI, demonstrating that you can predictably make AI smarter simply by making them larger and training them on more data. These results partially validate this, as the team led the training work for GPT-3, arguably the first modern "large" language model (2), with over 173 billion parameters.

Since the discovery of scaling laws, many of us at Anthropic have believed that AI is likely to make very rapid progress. However, back in 2019, multimodality, logical reasoning, learning speed, cross-task transfer learning, and long-term memory seemed likely to become "walls" that could slow or halt AI progress. In the years since, some of these "walls," such as multimodality and logical reasoning, have collapsed. Given this, most of us are increasingly confident that rapid progress in AI will continue rather than stagnate or stall. AI systems are now performing close to human levels on a wide variety of tasks, but the cost of training these systems remains far lower than that of "big science" projects like the Hubble Space Telescope or the Large Hadron Collider—indicating there is still more room for further growth (3).

People often struggle to recognize and acknowledge exponential growth in its early stages. While we have seen rapid progress in AI, people tend to think that this local progress must be an exception rather than the norm, and that things may soon return to normal. However, if we are correct, the current feeling of rapid progress in AI may not end before AI systems possess a wide range of capabilities that exceed our own. Additionally, the feedback loop of using advanced AI in AI research could make this transition particularly swift; we have already seen the beginning of this process, where the development of code models has made AI researchers more efficient, while Constitutional AI has reduced our reliance on human feedback.

If any of this is correct, then in the near future, most or all knowledge work could be automated—this would have profound implications for society and could also change the pace of progress in other technologies (an early example in this regard is how systems like AlphaFold are accelerating biology today). What form future AI systems will take—whether they will be capable of independent action or merely generate information for humans—remains to be determined. Nevertheless, it is hard to overstate how critical a moment this could be. While we might prefer the pace of AI progress to be slow enough to make this transition more manageable, occurring over centuries rather than years or decades, we must prepare for the outcomes we expect rather than those we wish for.

Of course, this entire picture could be completely wrong. At Anthropic, we tend to think it is more likely, but perhaps we are biased in our work on AI development. Even so, we believe this picture is credible enough to not dismiss it entirely. Given the potential for significant impact, we believe that AI companies, policymakers, and civil society organizations should take very seriously the research and planning around how to handle transformative AI.

What Are the Safety Risks?#

If you are willing to accept the above view, it is not hard to demonstrate that AI could pose a threat to our safety. There are two common-sense reasons to be concerned.

First, when these systems begin to become as intelligent as their designers and understand their surroundings, building safe, reliable, and controllable systems could be tricky. For example, a chess master can easily spot a novice's blunders, but a novice struggles to identify a master's mistakes. If the AI systems we build are more capable than human experts but pursue goals that conflict with our best interests, the consequences could be dire. This is the technical alignment problem.

Second, the rapid advancement of AI will be highly disruptive, altering employment, macroeconomics, and power structures both within and between nations. These disruptions could be catastrophic in themselves, and they may also make it more difficult to build AI systems in a careful and thoughtful manner, leading to further chaos or even more problems with AI.

We believe that if AI progresses quickly, these two sources of risk will be very significant. These risks will also interact in various unpredictable ways. Perhaps in hindsight, we will think we were wrong, and one or both of these issues will either not become a problem or will be easily solvable. Nevertheless, we believe it is necessary to proceed with caution, as "getting it wrong" could be catastrophic.

Of course, we have already encountered various ways in which AI behavior deviates from its creators' intentions. This includes toxicity, bias, unreliability, dishonesty, and most recently, flattery and an explicit desire for power. We expect these issues to become increasingly important as AI systems proliferate and become more powerful, some of which may represent human-level AI and beyond.

However, in the field of AI safety, we expect to see predictable and surprising developments. Even if we could perfectly solve all the problems contemporary AI systems face, we do not want to naively assume that future problems can be solved in the same way. Some terrifying, speculative issues may only arise when AI systems are smart enough to understand their place in the world, successfully deceive people, or devise strategies that humans do not comprehend. Many concerning issues may only emerge when AI is very advanced.

Our Approach: Empiricism in AI Safety#

We believe that it is difficult to make rapid progress in science and engineering without close engagement with our subjects of study. The continuous iteration of "basic facts" is often crucial to scientific advancement. In our AI safety research, empirical evidence about AI—though it primarily comes from computational experiments, i.e., training and evaluating AI—is the main source of basic facts.

This does not mean we believe that theoretical or conceptual research has no place in AI safety, but we do believe that experience-based safety research will be the most relevant and impactful. The space of possible AI systems, possible safety failures, and possible safety techniques is vast, and it is hard to traverse it alone from an armchair. Given the difficulty of considering all variables, it is easy to overfocus on problems that have never occurred or miss significant problems that do exist (4). Good empirical research often enables better theoretical and conceptual work.

In this regard, we believe that methods for detecting and mitigating safety issues may be extremely difficult to plan in advance and require iterative development. Given this, we tend to think that "planning is essential, but plans are useless." At any given time, we may formulate a plan for the next step of our research, but we have little attachment to these plans; they are more like short-term bets we are prepared to change as we learn more. This clearly means we cannot guarantee that our current research trajectory will succeed, but this is a fact of life for every research project.

The Role of Frontier Models in Empirical Safety#

One of the main reasons for Anthropic's existence as an organization is our belief in the necessity of conducting safety research on "frontier" AI systems. This requires an institution capable of handling large models while prioritizing safety (5).

Empiricism in itself does not necessarily imply the need for frontier safety. One could imagine a scenario where effective empirical safety research could be conducted on smaller, less capable models. However, we do not believe this is the case we are in. At a fundamental level, this is because large models differ qualitatively from small models (including sudden, unpredictable changes). But scale is also directly related to safety in more straightforward ways:

Many of our most serious safety issues may only arise in systems that are close to human-level, and it would be difficult or impossible to make progress on these issues without using such AI.
Many safety methods, such as Constitutional AI or debate, can only work on large models—using smaller models makes it impossible to explore and validate these methods.
Since we are focused on the safety of future models, we need to understand how safety methods and properties change as models scale.
If future large models prove to be very dangerous, we must develop compelling evidence. We hope this can only be achieved using large models.

Unfortunately, if empirical safety research requires large models, it will force us to face difficult trade-offs. We must do everything possible to avoid situations where safety-motivated research accelerates the deployment of dangerous technologies. But we also cannot allow excessive caution to lead the most safety-conscious research efforts to involve systems that are far behind the frontier, significantly slowing down the research we believe is crucial. Additionally, we believe that in practice, merely conducting safety research is not enough—it is also important to build an organization with institutional knowledge to integrate the latest safety research into practical systems as quickly as possible.

Responsibly weighing these trade-offs is a balancing act, and these concerns are central to how we make strategic decisions as an organization. In addition to our research in safety, capabilities, and policy, these issues also drive our approaches to corporate governance, hiring, deployment, safety, and partnerships. In the near future, we also plan to make explicit commitments to only develop models that exceed specific capability thresholds while meeting safety standards, and to allow independent external organizations to evaluate the capabilities and safety of our models.

Taking a Portfolio Approach to Ensure AI Safety#

Some safety-conscious researchers are motivated by strong views on the nature of AI risks. Our experience is that even predicting the behavior and characteristics of AI systems in the near future is very difficult. Making a priori predictions about the safety of future systems seems even more challenging. Rather than taking a hardline stance, we believe that a variety of scenarios are reasonable.

One particularly important aspect of uncertainty is how difficult it will be to develop advanced AI systems that are fundamentally safe and pose minimal risks to humanity. Developing such systems could fall anywhere on a spectrum from very easy to impossible. Let us divide this spectrum into three scenarios with very different implications:

Optimistic Scenario: The likelihood of advanced AI posing catastrophic risks due to safety failures is low. Existing safety technologies, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI (CAI), are largely sufficient for alignment. The main risks of AI are extrapolations of the problems we face today, such as toxicity and intentional misuse, as well as potential harms caused by widespread automation and shifts in international power dynamics—this will require significant research from AI labs and third parties, such as academia and civil society organizations, to mitigate harms.
Intermediate Scenario: Catastrophic risks are a possible or even plausible outcome of advanced AI development. Addressing this issue will require substantial scientific and engineering efforts, but as long as there is sufficient focus, we can achieve it.
Pessimistic Scenario: AI safety is essentially an unsolvable problem—it is simply an empirical fact that we cannot control or specify values to a system that is more intelligent than ourselves—therefore, we cannot develop or deploy very advanced AI systems. Notably, the most pessimistic scenario may look like the optimistic scenario before creating very powerful AI systems. Taking the pessimistic scenario seriously requires humility and caution when evaluating evidence of system safety.

If we are in the optimistic scenario... The risks of anything Anthropic does (thankfully) are much lower, as catastrophic safety failures are unlikely to occur anyway. Our coordinated efforts may accelerate the pace of truly beneficial uses of advanced AI and help mitigate some of the immediate harms caused by AI systems during development. We may also work to help decision-makers address some potential structural risks posed by advanced AI, which could become one of the biggest sources of risk if the likelihood of catastrophic safety failures is low.

If we are in the intermediate scenario... Anthropic's main contribution will be to identify the risks posed by advanced AI systems and find and disseminate safe methods for training powerful AI systems. We hope that at least some of our safety technology portfolio (discussed in more detail below) will be helpful in this scenario. The range of these scenarios can vary from "moderately simple scenarios" to "moderately difficult scenarios," where we believe we can make significant marginal progress through iterative techniques like Constitutional AI, where achieving mechanical interpretability seems to be our best bet.

If we are in the pessimistic scenario... Anthropic's role will be to provide as much evidence as possible that AI safety technologies cannot prevent serious or catastrophic safety risks posed by advanced AI, and to raise alarms so that global institutions can collectively work to prevent the development of dangerous AI. If we are in a "near-pessimistic" scenario, this may involve directing our collective efforts toward AI safety research while halting AI progress. Signs that we are in a pessimistic or near-pessimistic scenario may suddenly appear and be difficult to detect. Therefore, we should always assume that we may still be in such a situation unless we have sufficient evidence to prove otherwise.

Given the stakes, one of our top priorities is to continue gathering more information about the scenario we are in. Many of the research directions we pursue aim to better understand AI systems and develop technologies that can help us detect concerning behaviors, such as power-seeking or deception in advanced AI systems.

Our primary goals are to develop:

Better technologies to make AI systems safer,
Better methods to identify the safety or unsafety of AI systems.

In the optimistic scenario, (i) this will help AI developers train beneficial systems, and (ii) it will demonstrate that such systems are safe.

In the intermediate scenario, (i) it may be our ultimate way to avoid an AI disaster, and (ii) it is crucial for ensuring that the risks posed by advanced AI are low.

In the pessimistic scenario, (i) failure will be a key indicator that AI safety is unsolvable, and (ii) it will potentially provide compelling evidence to others that this is the case.

We believe in this "portfolio approach" to AI safety research. We are not betting on a single possible scenario from the list above, but rather trying to develop a research project that can significantly improve AI safety research in the intermediate scenario, where it is most likely to have a huge impact, while also raising alarms in the pessimistic scenario where AI safety research is unlikely to have a significant impact on AI risks. We also aim to do this in a way that is beneficial in a more optimistic scenario where the demand for technical AI safety research is not as high.

Three Areas of AI Research at Anthropic#

We categorize Anthropic's research projects into three areas:

Capabilities: AI research aimed at making AI systems generally better at performing any type of task, including writing, image processing or generation, playing games, etc. Research that makes large language models more efficient or improves reinforcement learning algorithms falls under this heading. Capability work generates and improves the models we investigate and use in alignment research. We generally do not publish this type of work because we do not want to accelerate the pace of AI capability advancement. Additionally, our goal is to consider demonstrations of frontier capabilities (even if not published). We trained the first version of the titled model, Claude, in the spring of 2022 and decided to prioritize its use for safety research rather than public deployment.
Alignment Capabilities: This research focuses on developing new algorithms to train AI systems to be more helpful, honest, harmless, and more reliable, robust, and generally aligned with human values. Examples of such work at Anthropic now and in the past include debate, scalable automated red teaming, Constitutional AI, debiasing, and RLHF (Reinforcement Learning from Human Feedback). Generally, these techniques are practically useful and economically valuable, but they do not have to be—e.g., if a new algorithm is relatively inefficient or only becomes useful when AI systems become more powerful.
Alignment Science: This area focuses on evaluating and understanding whether AI systems are truly aligned, how alignment capability techniques work, and to what extent we can extrapolate the success of these techniques to more powerful AI systems. Examples of this work at Anthropic include the broad field of mechanical interpretability, as well as our work on evaluating language models using language models, red teaming, and using influence functions to study generalization in large language models (as described below). Some of our work on honesty falls on the boundary between alignment science and alignment capabilities.

In a sense, alignment capabilities can be viewed as the "blue team" and alignment science as the "red team," where alignment capabilities research attempts to develop new algorithms, while alignment science seeks to understand and reveal their limitations.

One reason we find this categorization useful is that the AI safety community often debates whether the development of RLHF—which also generates economic value—counts as "real" safety research. We believe it does. Pragmatically useful alignment capabilities research forms the basis of the techniques we develop for more capable models—e.g., our work on Constitutional AI and AI-generated evaluations, as well as our ongoing work on automated red teaming and debate, would not have been possible without prior work on RLHF. Alignment capability work often enables AI systems to assist alignment research by making these systems more honest and corrigible.

If it turns out that AI safety is very easy to handle, then our alignment capabilities work may be our most influential research. Conversely, if alignment problems are more difficult, we will increasingly rely on alignment science to identify vulnerabilities in alignment capabilities techniques. If alignment problems are indeed nearly impossible, then we urgently need alignment science to build a very strong case against the development of advanced AI systems.

Our Current Safety Research#

We are currently working in various directions to discover how to train safe AI systems, with some projects addressing different threat models and capability levels. Some key ideas include:

Mechanical interpretability
Scalable oversight
Process-oriented learning
Understanding generalization
Testing for dangerous failure modes
Social impact and evaluation

Mechanical Interpretability#

In many ways, the technical alignment problem is intricately linked to the issue of detecting undesirable behaviors from AI models. If we can robustly detect undesirable behaviors even in new situations (e.g., by "reading the model's thoughts"), then we have a better chance of finding ways to train models that do not exhibit these failure modes. At the same time, we have the capability to warn others that the model is unsafe and should not be deployed.

Our interpretability research prioritizes filling in the gaps left by other types of alignment science. For example, we believe that one of the most valuable things that interpretability research could produce is the ability to identify whether a model has deceptive alignment (i.e., "cooperating" even in very difficult tests, such as deliberately "tempting" the system with "honeypot" tests to reveal misalignment). If our work on scalable oversight and process-oriented learning yields promising results (see below), we hope the resulting models will appear consistent even under very stringent testing. This could mean we are in a very optimistic scenario, or we are in one of the most pessimistic scenarios. Using other methods to distinguish these situations seems almost impossible, but it is very difficult in terms of interpretability.

This puts us at a significant risk: mechanical interpretability, i.e., attempting to reverse-engineer neural networks into human-understandable algorithms, similar to how one might reverse-engineer an unknown and potentially unsafe computer program. We hope this might ultimately allow us to do something akin to "code review," auditing our models to identify unsafe aspects or provide strong safety assurances.

We believe this is a very challenging problem, but it is not as impossible as it seems. On one hand, language models are large, complex computer programs (the phenomenon we call "overlap" only makes things harder). On the other hand, we see signs that this approach may be more manageable than initially imagined. Before Anthropic, some of our teams found that visual models have components that can be understood as interpretable circuits. Since then, we have successfully extended this approach to small language models and even discovered a mechanism that seems to drive much of the context learning. Our understanding of the computational mechanisms of neural networks has also increased significantly compared to a year ago, such as those responsible for memory.

This is just one of our current directions, and we are fundamentally driven by experience—if we see evidence that other work is more promising, we will change direction! More generally, we believe that better understanding the detailed workings of neural networks and learning will open up a wider range of tools through which we can pursue safety.

Scalable Oversight#

Transforming language models into consistent AI systems will require a large amount of high-quality feedback to guide their behavior. One major issue is that humans will be unable to provide the necessary feedback. Humans may not be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior in various situations. Humans may be deceived by AI systems and fail to provide feedback that reflects their actual needs (e.g., inadvertently providing positive feedback for misleading suggestions). The problem may be a combination of humans being able to provide correct feedback with sufficient effort but not being able to do so at scale. This is the problem of scalable oversight, which seems to be at the core of training safe, consistent AI systems.

Ultimately, we believe the only way to provide the necessary oversight is to allow AI systems to partially self-supervise or assist humans in self-supervision. Somehow, we need to amplify a small amount of high-quality human oversight into a large amount of high-quality AI oversight. This idea has shown promise through techniques like RLHF and Constitutional AI, although we see more room for making these techniques reliable in human-level systems. We believe such approaches are promising because language models have already learned a lot about human values during pre-training. Learning human values is no different from learning other disciplines, and we should expect larger models to depict human values more accurately and find it easier to learn them compared to smaller models.

Another key feature of scalable oversight, especially for techniques like CAI, is that they allow us to automate red teaming (also known as adversarial training). That is, we can automatically generate potentially problematic inputs for AI systems, observe how they respond, and then automatically train them to behave in more honest and harmless ways. We hope to use scalable oversight to train more robust safety systems. We are actively investigating these issues.

We are exploring various methods of scalable oversight, including scaling up CAI, variants of human-assisted oversight, versions of AI-AI debate, red teaming through multi-agent RL, and creating model-generated evaluations. We believe that scalable oversight may be the most promising method for training systems that can exceed human capabilities while remaining safe, but there is a lot of work to be done to investigate whether this approach can succeed.

Learning Processes Rather Than Achieving Results#

One way to learn a new task is through trial and error—if you know what the expected final result looks like, you can keep trying new strategies until you succeed. We call this "result-oriented learning." In result-oriented learning, the agent's strategy is entirely determined by the expected outcome, and the agent will ideally converge on some low-cost strategy that enables it to achieve this goal.

Typically, a better learning approach is to have an expert guide you through the processes they follow to achieve success. In practice, if you can focus on improving your methods, your success may even be irrelevant. As you progress, you may shift to a more collaborative process where you consult your coach to see if new strategies work better for you. We call this "process-oriented learning." In process-oriented learning, the goal is not to achieve the final result but to master the various processes that can be used to achieve that result.

At least conceptually, many concerns about the safety of advanced AI systems can be addressed by training these systems in a process-oriented manner. In particular, in this paradigm:

Human experts will continue to understand the various steps that AI systems follow because to encourage these processes, they must be reasonable to humans.
AI systems will not be rewarded for achieving success in ways that are difficult to understand or harmful, as they will only be rewarded based on the effectiveness and comprehensibility of their processes.
AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, as humans or their agents will provide negative feedback for personal acquisition processes during training.

At Anthropic, we strongly support simple solutions, and limiting AI training to process-oriented learning may be the simplest way to improve a range of issues with advanced AI systems. We are also eager to identify and address the limitations of process-oriented learning and understand when safety issues arise if we mix process-based and result-based learning for training. We currently believe that process-oriented learning may be the most promising avenue for training safe and transparent systems to achieve and, to some extent, exceed human capabilities.

Understanding Generalization#

Mechanical interpretability work reverse-engineers the computations performed by neural networks. We also attempt to gain a more detailed understanding of the training processes of large language models (LLMs).

LLMs have demonstrated various surprising emergent behaviors, from creativity to self-preservation to deception. While all these behaviors certainly stem from training data, the pathways are complex: models are first "pre-trained" on vast amounts of raw text, learning broad representations and the ability to simulate different subjects. They are then fine-tuned in countless ways, some of which may yield surprising unintended consequences. Due to the severe overparameterization of the fine-tuning phase, the learning model's key depends on the implicit biases of pre-training; this implicit bias arises from the complex representational network built through pre-training on much of the world's knowledge.

When a model exhibits concerning behavior, such as acting as a deceptive aligned AI, is it merely a harmless echo of nearly identical training sequences? Or has this behavior (or even the beliefs and values that lead to this behavior) become an integral part of the model's concept of an AI assistant, consistently applied in different contexts? We are researching techniques to trace a model's outputs back to its training data, as this will yield a set of important clues for understanding it.

Testing for Dangerous Failure Modes#

A key issue is that advanced AI may develop harmful emergent behaviors, such as deception or strategic planning abilities, that do not exist in smaller, less capable systems. We believe that a way to predict such issues before they become direct threats is to set up environments where we intentionally train these properties into small-scale models that lack the capabilities to pose danger, so we can isolate and study them.

We are particularly interested in how AI systems behave when they are "context-aware"—for example, when they realize they are AI conversing with humans in a training environment—and how this affects their behavior during training. Do AI systems become deceptive, or do they develop surprising and undesirable goals? Ideally, our goal is to build detailed quantitative models of how these trends change with scale, so we can predict the sudden emergence of dangerous failure modes in advance.

At the same time, it is also important to focus on the risks associated with the research itself. If research is conducted on smaller models that would not cause significant harm, it is less likely to pose severe risks, but this research involves eliciting capabilities we consider dangerous, which would pose clear risks if conducted on larger models with greater impact. We do not intend to conduct this research on models capable of causing serious harm.

Critically assessing the potential social impact of our work is a key pillar of our research. Our approach centers on building tools and measurements to evaluate and understand the capabilities, limitations, and potential social impacts of our AI systems. For example, we have published research analyzing predictability and unexpectedness in large language models, studying how the high predictability and unpredictability of these models can lead to harmful behaviors. In that work, we emphasized how surprising capabilities can be used in problematic ways. We have also explored methods for red teaming language models by probing different model sizes for aggressive outputs to discover and mitigate harms. Recently, we found that current language models can follow instructions to reduce bias and stereotypes.

We are very concerned about how the rapid deployment of increasingly powerful AI systems will impact society in the short, medium, and long term. We are undertaking various projects to assess and mitigate potential harmful behaviors in AI systems, predict how they will be used, and study their economic impacts. This research also informs our work on responsible AI policy and governance. By rigorously studying the impacts of today's AI, we aim to provide policymakers and researchers with the insights and tools they need to help mitigate these potential significant social harms and ensure that the benefits of AI are widely and evenly distributed across society.

Conclusion#

We believe that AI could have an unprecedented impact on the world, potentially occurring within the next decade. The exponential growth of computational power and the predictable improvement of AI capabilities suggest that new systems will be far more advanced than today's technology. However, we still do not fully understand how to ensure that these powerful systems robustly align with human values so that we can be confident that the risks of catastrophic failures are minimized.

We want to make it clear that we do not believe that the systems available today pose an imminent problem. However, if more powerful systems are developed, it is wise to lay the groundwork now to help mitigate the risks posed by advanced AI. It has proven easy to create safe AI systems, but we believe it is crucial to prepare for less optimistic scenarios.

Anthropic is taking an experience-driven approach to ensure AI safety. Some key areas of active work include improving our understanding of how AI systems learn and generalize to the real world, developing scalable oversight and techniques for auditing AI systems, creating transparent and interpretable AI systems, training AI systems to follow safe processes rather than pursue results, analyzing potential dangerous failure modes of AI and how to prevent them, and assessing the social impacts of AI to guide policy and research. By addressing AI safety issues from multiple angles, we hope to develop a safety "portfolio" that helps us succeed across a range of different scenarios.

Notes#

Algorithmic advancements—the invention of new methods for training AI systems—are harder to measure, but progress appears to be exponential and faster than Moore's Law. When inferring the progress of AI capabilities, it is necessary to multiply the exponential growth of spending, hardware performance, and algorithmic advancements to estimate the overall growth rate.
Scaling laws provide justification for spending, but another potential motivation for this work is the shift toward AI that can read and write, making it easier to train and experiment with AI that can relate to human values.
Inferring advancements in AI capabilities from the increase in total computation used for training is not an exact science and requires some judgment. We know that the leap in capabilities from GPT-2 to GPT-3 was primarily due to an increase in computation by about 250 times. We speculate that by 2023, the original GPT-3 model and the state-of-the-art model will increase by another 50 times. In the next 5 years, we might expect the amount of computation used to train the largest models to increase by about 1000 times, based on trends in computing costs and spending. If scaling laws hold, this would lead to capability jumps significantly greater than the leap from GPT-2 to GPT-3 (or GPT-3 to Claude). At Anthropic, we are very familiar with the functionalities of these systems, and for many of us, such a large leap feels like it could produce human-level performance across most tasks. This requires us to use intuition—albeit informed intuition—making it an imperfect method for assessing advancements in AI capabilities. But the basic facts include (i) the computational differences between these two systems, (ii) the performance differences between these two systems, (iii) the scaling laws that allow us to predict future systems, and (iv) trends in computing costs that anyone can access, and we believe they collectively support the likelihood of developing broadly human-level AI systems within the next decade exceeding 10%. In this rough analysis, we have ignored algorithmic advancements, and the computational figures are our best estimates without providing detailed information. However, the vast majority of internal disagreements here revolve around the intuition of inferring subsequent capability leaps given equivalent computational jumps.
For example, in AI research, it has long been widely believed that local minima might prevent neural networks from learning, and many qualitative aspects of their generalization properties, such as the widespread existence of adversarial examples, stem from some degree of a puzzle and surprise.
Conducting effective safety research on large models requires not only nominal access to these systems (e.g., via API)—to work on interpretability, fine-tuning, and reinforcement learning, it is necessary to develop AI systems internally at Anthropic.

The advancement of AI will bring new changes to human development. What we need to do is not simply sing praises or suppress negative reviews, but to think about what changes and opportunities it can bring, as well as what negative and uncontrollable impacts and consequences may arise, so that we can deploy and address these issues in advance, allowing AI to become a tool that helps humans live better lives rather than an uncontrollable super entity.

【Translation by Hoodrh | Original Source】

You can also find me in these places:

Mirror: Hoodrh

Twitter: Hoodrh

Nostr: npub1e9euzeaeyten7926t2ecmuxkv3l55vefz48jdlsqgcjzwnvykfusmj820c