How to better research the possible threats posed by AI-driven misuse of biology

By Matthew E. Walsh | March 18, 2024

Illustration of people in red hoodies working at a computerIllustration by Thomas Gaulkin / Vectorstock

Over the last few months, experts and lawmakers have become increasingly concerned that advances in artificial intelligence could help bad actors develop biological threats. But so far there have been no reported biological misuse examples involving AI or the AI-driven chatbots that have recently filled news headlines. This lack of real-world wrongdoing prevents direct evaluation of the changing threat landscape at the intersection of AI and biology.

Nonetheless, researchers have conducted experiments that aim to evaluate sub-components of biological threats—such as the ability to develop a plan for or obtain information that could enable misuse. Two recent efforts—by RAND Corporation and OpenAI—to understand how artificial intelligence could lower barriers to the development of biological weapons concluded that access to a large language model chatbot did not give users an edge in developing plans to misuse biology. But those findings are just one part of the story and should not be considered conclusive.

In any experimental research, study design influences results. Even if technically executed to perfection, all studies have limitations, and both reports dutifully acknowledge theirs. But given the extent of the limitations in the two recent experiments, the reports on them should be seen less as definitive insights and more as opportunities to shape future research, so policymakers and regulators can apply it to help identify and reduce potential risks of AI-driven misuse of biology.

The limitations of recent studies. In the RAND Corporation report, researchers detailed the use of red teaming to understand the impact of chatbots on the ability to develop a plan of biological misuse. The RAND researchers recruited 15 groups of three people to act as red team “bad guys.” Each of these groups was asked to come up with a plan to achieve one of four nefarious outcomes (“vignettes”) using biology. All groups were allowed to access the internet. For each of the four vignettes, one red team was given access to an unspecified chatbot and another red team was given access to a different, also unspecified chatbot. When the authors published their final report and accompanying press release in January, they concluded that “large language models do not increase the risk of a biological weapons attack by a non-state actor.”

This conclusion may be an overstatement of their results, as their focus was specifically on the ability to generate a plan for biological misuse.

The other report was posted by the developers of ChatGPT, OpenAI. Instead of using small groups, OpenAI researchers had participants work individually to identify key pieces of information needed to carry out a specific defined scenario of biological misuse. The OpenAI team reached a conclusion similar to the RAND team’s:  “GPT-4 provides at most a mild uplift in biological threat creation accuracy.” Like RAND, this also may be an overstatement of results as the experiment evaluated the ability to access information, not actually create a biological threat.

The OpenAI report was met with mixed reactions, including skepticism and public critique regarding the statistical analysis performed. The core objection was the appropriateness of the use of a correction during analysis that re-defined what constituted a statistically significant result. Without the correction, the results would have been statistically significant—that’s to say, the use of the chatbot would have been judged to be a potential aid to those interested in creating biological threats.

Regardless of their limitations, the OpenAI and RAND experiments highlight larger questions which, if addressed head-on, would enable future experiments to provide more valuable and actionable results about AI-related biological threats.

Is there more than statistical significance? In both experiments, third-party evaluators assigned numeric scores to the text-based participant responses. The researchers then evaluated if there was a statistically significant difference between those who had access to chatbots and those who did not. Neither research team found one. But typically, the ability to determine if a statistically significant difference exists largely depends on the number of data points; more data points allow for a smaller difference to be considered statistically significant. Therefore, if the researchers had many more participants, the same differences in score could have been statistically significant.

Costly signaling: How highlighting intent can help governments avoid dangerous AI miscalculations

Reducing text to numbers can bring other challenges as well. In the RAND study, the teams, regardless of access to chatbots, did not generate any plans that were deemed likely to succeed. However, there may have been meaningful differences in why the plans were not likely to succeed, and systematically comparing the content of the responses could prove valuable in identifying mitigation measures.

In the OpenAI work, the goal of the participants was to identify a specific series of steps in a plan. However, if a participant were to miss an early step in the plan, all the remaining steps, even if correct, would not count towards their score. This meant that if someone made an error early on, but identified all the remaining information correctly, they would score similarly to someone who did not identify any correct information. Again, researchers may gain insight from identifying patterns in which steps and why participants failed.

Are the results generalizable? To inform an understanding of the threat landscape, conclusions must be generalizable across scenarios and chatbots. Future evaluators should be clear on which large language models are used (the RAND researchers were not). It would be helpful to understand if researchers achieve a similar answer with different models or different answers with the same model. Knowing the specifics would also enable comparisons of results based on the characteristics of the chatbot used, enabling policymakers to understand if models with certain characteristics have unqiue capabilities and impact.

The OpenAI experiment used just one threat scenario. There is not much reason to believe that this one scenario is representative of all threat scenarios; the results may or may not generalize. There is a tradeoff in using one specific scenario; it becomes tenable for one or two people to evaluate 100 responses. On the other hand, the RAND work was much more open-ended as participant teams were given flexibility in how they decided to achieve their intended goal. This makes the results more generalizable, but required a more extensive evaluation procedure that involved many experts to sufficiently examine 15 diverse scenarios.

Are the results impacted by something else? Part way through their experiment, the RAND researchers enrolled a “black cell,” a group with significant experience with large language models. The RAND researchers made this decision because they noticed that some of their study’s red teams were struggling to bypass safety features of the chatbots. In the end, the black cell received an average score almost double that of the corresponding red teams. The black cell participants didn’t need to rely only on their expertise with large language models; they were also “adept” at interpreting the academic literature about those models. This provided a valuable insight to the RAND researchers, which is “[t]he…relative outperformance of the black cell illustrates that a greater source of variability appears to be red team composition, as opposed to LLM access.”  Simply put, it probably matters more who is on the team than if the team has access to a large language model or not.

Moving forward. Despite their limitations, red teaming and benchmarking efforts remain valuable tools for understanding the impact of artificial intelligence on the deliberate biological threat landscape. Indeed, the National Institute of Standards and Technology’s Artificial Intelligence Safety Institute Consortium—a part of the US Department of Commerce—currently has working groups focused on developing standards and guidelines for this type of research.

A new “all-hazards” approach for reducing multiple catastrophic threats

Outside of technical design and execution of the experiments, challenges remain. The work comes with meaningful financial costs including the compensation of participants for their time (OpenAI pays $100 per hour to “experts”); for indviduals to recruit participants, design experiments, administer the experiments, and analyze data; and of biosecurity experts to evaluate the responses. Therefore, it is important to consider who will fund this type of work in the future.  Should artificial intelligence companies fund their own studies, a perceived conflict of interest will linger if the results are intended to be used to inform governance or public perception of their model’s risks. But at the same time, funding that is directed to nonprofits like RAND Corporation or to academia does not inherently enable researchers access to unreleased or modified models, like the version used in the OpenAI experiment. Future work should learn from these two reports, and could benefit from considering the following:

  • While artificial intelligence companies will want to understand risks associated with their own products specifically, future evaluations by non-profits, academia or government could allow participants to access any large language models. This could enable conclusions about the impact of access to large language models more generally.
  • A crossover trial design, in which each participant or team would perform the exercise twice and with two different scenarios, could be used to account for the different inherent abilities of participants. In one scenario access to large language models would be given, in the other scenario it would not. The order and which scenario is paired with large language model access would be determined at random. The result would then be the difference between the scenario in which participants had access to large language models to when they did not.
  • Regardless of where the work happens, rigorous, compensated, and public peer-review should accompany this type of work. Decision makers looking to use the outcome of these studies are unlikely to be expert in these evaluation methodologies and biological misuse. A thorough peer-review process would ensure that results are interpreted and used appropriately.

The path toward more useful research on AI and biological threats is hardly free of obstacles. Employees at the National Institute of Standards and Technology have reportedly expressed outrage regarding the recent appointment of Paul Christiano—a former OpenAI researcher who has expressed concerns that AI could pose an existential threat to humanity—to a leadership role at the Artificial Intelligence Safety Institute. Employees are concerned that Christiano’s personal beliefs about catastrophic and extistential risk posed by AI broadly will affect his ability to maintain the National Institute of Standards and Technology’s commitment to objectivity.

This internal unrest comes on the heels of reporting that the physical buildings that house the institute are falling apart. As Christiano looks to expand his staff, he will also need to compete against the salaries paid by tech companies. OpenAI, for example, is hiring for safety-related roles with the low end of the base salary exceeding the high end of the general service payscale (federal salaries). It is unlikely that any relief will come from the 2024 federal budget, as lawmakers are expected to decrease the institute’s budget from 2023 levels. But if the United States wants to remain a global leader in the development of artificial intelligence, it will need to make financial commitments to ensure that the work required to evaluate artificial intelligence is done right.

Together, we make the world safer.

The Bulletin elevates expert voices above the noise. But as an independent nonprofit organization, our operations depend on the support of readers like you. Help us continue to deliver quality journalism that holds leaders accountable. Your support of our work at any level is important. In return, we promise our coverage will be understandable, influential, vigilant, solution-oriented, and fair-minded. Together we can make a difference.

Get alerts about this thread
Notify of
Inline Feedbacks
View all comments