My work on Artificial General Intelligence (AGI) safety
At some point, we will invent "Artificial General Intelligence" (AGI): AI that can reason, learn, understand, and creatively solve problems across a wide range of domains, just as the most competent humans (and teams of humans, and societies of humans) can.
An emerging, interdisciplinary subfield of computer science, called AGI safety or AI alignment, is aimed at developing methods to ensure that these very powerful systems will robustly behave in ways we want them to behave, avoid catastrophic accidents, and make the world a better place. For a nice introduction to the field, I suggest the “Preventing an AI-Related Catastrophe” problem profile by Benjamin Hilton, or my own introductory post.

I've been studying this topic since 2019—first in my free time, then under grant funding†, and now at the nonprofit Astera Institute.
Some highlights of my work are below (complete list is here), and you can keep up with new posts via RSS feed, X (Twitter), Mastodon, Threads, or bsky.
1. Brain-Like-AGI Safety: If we invent AGI by reverse-engineering (or reinventing) algorithms similar to the human brain’s, then how would we use such an AGI safely?
"Intro to Brain-Like-AGI Safety" blog post series: (published Jan–May 2022, revised July 2024)
• 1. What's the problem & Why work on it now?
• 2. “Learning from scratch” in the brain
• 3. Two subsystems: Learning & Steering
• 4. The “short-term predictor”
• 5. The “long-term predictor”, and TD learning
• 6. Big picture of motivation, decision-making, and RL
• 7. From hardcoded drives to foresighted plans: A worked example
• 8. Takeaways from neuro 1/2: On AGI development
• 9. Takeaways from neuro 2/2: On AGI motivation
• 10. The alignment problem
• 11. Safety ≠ alignment (but they’re close!)
• 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”
• 13. Symbol grounding & human social instincts
• 14. Controlled AGI
• 15. Conclusion: Open problems, how to help, AMA
• If you prefer audio, I gave a talk with the highlights from the series (youtube, transcript), and I also discussed the series on the “Brain Inspired” podcast.
Other posts on Brain-Like-AGI Safety, in reverse chronological order:
• My AGI safety research—2024 review, ’25 plans (Dec 2024)
• Neuroscience of human social instincts: a sketch (Nov 2024)
• Against empathy-by-default (Oct 2024)
• Response to Dileep George: AGI safety warrants planning ahead (July 2024)
• Thoughts on “AI is easy to control” by Pope & Belrose (Dec 2023)
• 8 examples informing my pessimism on uploading without reverse engineering (Nov 2023)
• LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem (May 2023)
• Connectomics seems great from an AI x-risk perspective (Apr 2023)
• Plan for mediocre alignment of brain-like [model-based RL] AGI (March 2023)
• Heritability, Behaviorism, and Within-Lifetime RL (Feb 2023)
• Thoughts on hardware / compute requirements for AGI (Jan 2023)
• Note on algorithms with multiple trained components (Dec 2022)
• My AGI safety research—2022 review, ’23 plans (Dec 2022)
• My take on Jacob Cannell’s take on AGI safety (Nov 2022)
• Thoughts on AGI consciousness / sentience (Sept 2022)
• Response to Blake Richards: AGI, generality, alignment, & loss functions (July 2022)
• Reward is not enough (June 2021)
• Big picture of phasic dopamine (see also: supplementary information) (June 2021)
• Solving the whole AGI control problem, version 0.0001 (Apr 2021)
• My AGI threat model: misaligned model-based RL agent (March 2021)
• Against evolution as an analogy for how humans will create AGI (March 2021)
• Book review: A Thousand Brains by Jeff Hawkins (March 2021)
• My computational framework for the brain (Sept 2020)
• Building brain-inspired AGI is infinitely easier than understanding the brain (June 2020)
2. Other AGI safety posts (not specifically about brain-like AGI)
• “Sharp Left Turn” discourse: An opinionated review (Jan 2025)
• Applying traditional economic thinking to AGI: a trilemma (Jan 2025)
• Response to nostalgebraist: proudly waving my moral-antirealist battle flag (May 2024)
• “Artificial General Intelligence”: an extremely brief FAQ (March 2024)
• Four visions of Transformative AI success (Jan 2024)
• Deceptive AI ≠ Deceptively-aligned AI (Jan 2024)
• “X distracts from Y” as a thinly-disguised fight over group status / politics (Sept 2023)
• Thoughts on “Process-Based Supervision” (July 2023)
• Munk AI debate: confusions and possible cruxes (June 2023)
• AI doom from an LLM-plateau-ist perspective (Apr 2023)
• Why I’m not working on {debate, RRM, ELK, natural abstractions} (Feb 2023)
• “Endgame safety” for AGI (Jan 2023)
• What does it take to defend the world against out-of-control AGIs? (Oct 2022)
• Consequentialism & corrigibility (Dec 2021)
• Safety-capabilities tradeoff dials are inevitable in AGI (Oct 2021)
• My take on Vanessa Kosoy's take on AGI safety (Sept 2021)
3. Other neuroscience posts (not superficially related to AGI safety but they still came up during my research)
• Heritability: Five Battles (Jan 2025)
• “Intuitive self-models series” part 8: Rooting Out Free Will Intuitions (Nov 2024)
• “Intuitive self-models series” part 7: Hearing Voices, and Other Hallucinations (Oct 2024)
• “Intuitive self-models series” part 6: Awakening / Enlightenment / PNSE (Oct 2024)
• “Intuitive self-models series” part 5: Dissociative Identity Disorder, a.k.a. Multiple Personality Disorder (Oct 2024)
• “Intuitive self-models series” part 4: Trance (Oct 2024)
• “Intuitive self-models series” part 3: The Homunculus (Oct 2024)
• “Intuitive self-models series” part 2: Conscious Awareness (Sept 2024)
• “Intuitive self-models series” part 1: Preliminaries (Sept 2024)
• Incentive Learning vs Dead Sea Salt Experiment (June 2024)
• (Appetitive, Consummatory) ≈ (RL, reflex) (June 2024)
• Woods’ new preprint on object permanence (March 2024)
• Social status part 2/2: everything else (March 2024)
• Social status part 1/2: negotiations over object-level preferences (March 2024)
• “Valence series” Appendix A: Hedonic tone / (dis)pleasure / (dis)liking (Dec 2023)
• “Valence series” part 5: “Valence Disorders” in Mental Health & Personality (Dec 2023)
• “Valence series” part 4: Valence & Liking / Admiring (June 2024)
• “Valence series” part 3: Valence & Beliefs (Dec 2023)
• “Valence series” part 2: Valence & Normativity (Dec 2023)
• “Valence series” part 1: Introduction (Dec 2023)
• I’m confused about innate smell neuroanatomy (Nov 2023)
• A Theory of Laughter (Aug 2023) (revised Dec 2024)
• Model of psychosis, take 2 (Aug 2023)
• Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions (July 2023)
• Is “FOXP2 speech & language disorder” really “FOXP2 forebrain fine-motor crappiness”? (March 2023)
• Why I’m not into the Free Energy Principle (March 2023)
• Schizophrenia as a deficiency in long-range cortex-to-cortex communication (Feb 2023)
• Quick notes on “mirror neurons” (Oct 2022)
• Book review: “The Heart of the Brain: The Hypothalamus and Its Hormones” (Sept 2022)
• On oxytocin-sensitive neurons in auditory cortex (Sept 2022)
• I’m mildly skeptical that blindness prevents schizophrenia (Aug 2022)
• The “mind-body vicious cycle” model of RSI & back pain (June 2022)
• The Intense World Theory of Autism (Sept 2021)
• How is low-latency phasic dopamine so fast? (July 2021)
• Book review: Rethinking Consciousness (Jan 2020)
† My work from March 2021 through Aug 2022 was funded by a grant from Beth Barnes and the Centre For Effective Altruism Donor Lottery Program.