Artificial Intelligence (AI) Safety

The way we live today is the result of a number of technological revolutions in humanity’s history. Most recently, during the industrial revolution, much of our physical labour was outsourced to machines. The technologies, productivity and opportunity this affords has dramatically changed people’s daily business and important life outcomes.

As the field of Artificial Intelligence (AI) grows, there are plausible pathways for us to outsource our mental labour in a similar way. We might expect the world to look significantly different again if we achieve capable Artificial General Intelligence (AGI).

If and when we achieve AGI, it’s vitally important that the goals of such agents are aligned with humanity's best interests, and their power is wielded safely. Current trends suggest that we may be able to achieve human level intelligence this century, which makes these questions pressing for researchers today. There are two main components to this challenge which we are focusing on at ERA.

Alignment - specifying our goals

It’s plausible that we are on track to create artificial agents that are more capable than humans in an increasingly broad and numerous range of domains. These agents may have the ability to take actions in the world, either granted to it for economic purposes, or instrumentally obtained by the agent via some oversight of its designers.

As of today, we have no mechanisms to guarantee control or alignment of a superintelligent agent. Classic problems in this space include self-preservation: by virtue of wanting to achieve its goal, such an agent may internalise that we have the ability or intent to stop it. Without careful consideration of a solution, this may lead to behaviour that we did not intend, and that we cannot interrupt. Furthermore, achieving its goals is easier if it seeks arbitrary power, resources and intelligence; likely pursuing objectives that we did not originally intend the agent to pursue.

AI Alignment is the study of solutions to such emergent behaviour from intelligent agents. This could involve designing new algorithms and theorems that provide guarantees that this behaviour won't emerge, or that keep humans' values and feedback at the heart of their goals. Read more on a proposed taxonomy of AI Alignment.

Learn more

Your ERA Fellowship project

In the application, we will test for research taste in one of two ways; 1) you may submit a research proposal if you have prepared one before, or have a specific project in mind (not expected of most applicants), 2) we will ask you to propose 3-5 research questions, with a short justification for why each are relevant for mitigating extreme risk from AI. In either case, you would work on a proposal with your mentor in the time leading up to the start of the SRF.
The below projects are intended to give an example of some well-scoped projects in AI Safety. They should offer inspiration for the sort of research questions and projects you might undertake, but they will not be these specific projects.

Example projects

Important - producing well-scoped projects that help to align AI is half of the difficulty of solving the problem. These are example ideas, partially proposed by AI Safety Camp 2022, 2023; we do not recommend you base your application on these as they are likely being worked on.

Understanding Search in Transformers

Transformers are capable of a vast variety of tasks; for the most part, there is little knowledge about how. In particular, understanding how an AI system implements search is probably very important for AI safety. In this project, the aim is to: gain a mechanistic understanding of how transformers implement search for toy tasks, explore how the search process can be retargeted, ensure that the AI system is aligned with human preferences, and attempt to find scaling laws for search-oriented tasks and compare them to existing scaling laws.

Does introspective truthfulness generalize in LMs?

Aligning language models (LM) involves taking an LM that simulates many human speakers and finetuning it to produce only truthful and harmless output. Recent work suggests reinforcement learning on human feedback (RLHF) generalizes well, teaching LMs to be truthful on hold-out tasks. However, on one understanding of RLHF finetuning, RLHF works by picking out a truthful speaker from the set of speakers learned during LM pre-training; if that characterization is accurate, RLHF will not suffice for alignment. In particular, LMs will not generalize to be truthful on questions where no human speaker knows the truth. For instance, consider “introspective tasks,” i.e., those for which truth is speaker dependent.

This project focuses on creating a dataset of tasks evaluating the generalization of language model truthfulness when trained on introspective tasks. Then use this dataset to finetune an LM for truthfulness on a subset of these tasks and evaluate on the remaining tasks.

Comparison between RL and Finetuning GPT-3

Language Models can be finetuned (through giving them either an explicit reward model or a corpus of text) for specialising them to certain goals. This is analogous with the more traditional approach of RL (especially the use of a reward model). However, the mentors expect that they might result in different alignment problems and guarantees, given that RL is optimising more directly (from the start) for accomplishing the goal than the language model, even fine-tuned.
This project thus focuses on studying the theoretical, conceptual and practical differences between the two approaches, especially with regard to the kind of alignment failures expected with RL.

Learning and Penalyzing Betrayal

Train agents in multiplayer games in DeepMind’s XLand to cooperate and communicate, and then learn to lie and betray each other. In turn, counterparties can learn to recognize such betrayal.
This ability can then be leveraged in trying different alignment schemes where betrayal (especially secret betrayal) is penalised and thus disincentivized. Similarly, how these solutions scales could be studied by making agents with different levels of competence and compute.

Semantic Side-Effect Minimization

The point is to train a conservative policy for accomplishing some goal over a wide range of environments with different types of side-effects (where the set of environments has to be designed for the project).
The result is expected to be far too conservative to be efficient at its goal, and so the next step will be to try to update the policy such that it learns which side-effects are considered negative, and which are okay.