Quoting storage.googleapis.com:

any static procedure for synthetically generating data will quickly become outstripped. This can be achieved by allowing agents to learn continually from their own experience, i.e., data that is generated by the agent interacting with its environment. AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.

while imitating humans is enough to reproduce many human capabilities to a competent level, this approach in isolation has not and likely cannot achieve superhuman intelligence across many important topics and tasks.

The majority of high-quality data sources - those that can actually improve a strong agent’s performance - have either already been, or soon will be consumed. The pace of progress driven solely by supervised learning from human data is demonstrably slowing, signalling the need for a new approach.

Initially exposed to around a hundred thousand formal proofs, created over many years

*This is a preprint of a chapter that will appear in the book Designing an Intelligence, published by MIT Press.

by human mathematicians, AlphaProof’s reinforcement learning (RL) algorithm1 subsequently generated a hundred million more through continual interaction with a formal proving system.

• Agents will inhabit streams of experience, rather than short snippets of interaction.

• Their actions and observations will be richly grounded in the environment, rather than interacting via human dialogue alone.

• Their rewards will be grounded in their experience of the environment, rather than coming from human prejudgement.

• They will plan and/or reason about experience, rather than reasoning solely in human terms

In the era of human data, language-based AI has largely focused on short interaction episodes: e.g., a user asks a question and (perhaps after a few thinking steps or tool-use actions) the agent responds. Typically, little or no information carries over from one episode to the next, precluding any adaptation over time. Furthermore, the agent aims exclusively for outcomes within the current episode, such as directly answering a user’s question. In contrast, humans (and other animals) exist in an ongoing stream of actions and observations that continues for many years. Information is carried across the entire stream, and their behaviour adapts from past experiences to self-correct and improve. Furthermore, goals may be specified in terms of actions and observations that stretch far into the future of the stream.