Home/Resources
§ A primer + a reading list

What is AI safety, and where should you start?

§ 1 — Primer

AI safety is the study of how to make sure increasingly powerful AI systems do what we want, and that what we want is actually good.

In the last few years, AI systems have become more capable, faster, than almost anyone expected. Models trained on the open internet now write code, pass professional exams, and carry out multi-step tasks with little supervision. The trend line does not obviously stop.

That raises a set of questions that used to be academic and are now practical: how do we build systems this powerful and keep them aligned with human interests? How do we know what a model will do before we deploy it? Who gets to decide what it should do at all? What happens to institutions (jobs, science, democracy) if most cognitive work is done by machines?

"AI safety" is the loose label for work on these questions. It has a technical side (interpretability, alignment, evaluations, robustness) and a policy side (governance, coordination, oversight of frontier labs). It is not one field so much as several fields that happen to share a concern.

The aim of this group is to take that concern seriously: to read the primary sources, argue in good faith, and understand the strongest version of each position before agreeing or disagreeing with it.

Alignment
Making sure an AI system is actually trying to do what its designers (or users, or society) intended, rather than a close-but-wrong proxy.
Interpretability
Reading what's going on inside a neural network. If we can't inspect the computation, we're judging these systems by behavior alone.
Evaluations
Stress-testing models for dangerous capabilities (deception, cyber-offense, biorisk) before deployment.
Governance
The rules, norms, and institutions that decide who can build what, with which safeguards, answerable to whom.
Existential risk
The narrower claim that sufficiently advanced AI could be catastrophic at civilizational scale. Debated, but taken seriously by a growing fraction of researchers.
You do not need to already believe any of this to join us. Just be willing to engage with the arguments as if they might be right, and as if they might be wrong.
§ 2 — Start here

Three to read first.

If you read nothing else, read these. They give the clearest sense of what's actually being claimed, why it might matter, and the range of views inside the field.
§ 3 — The rest of the list

Filtered by category.

The Alignment Problem from a Deep Learning Perspective Richard Ngo, Lawrence Chan & Sören Mindermann · arXiv · 2022 A clear modern statement of what "the alignment problem" actually is, given how AI is built today. The best single paper to send someone with a technical background. Paper arXiv What Is AI Alignment? BlueDot Impact · 2023 A concise explainer on what the alignment problem is and why it's hard. Good first read for anyone without a technical background. Article Read Why AI Alignment Could Be Hard with Modern Deep Learning Holden Karnofsky · Cold Takes · 2021 A non-technical argument for why current ML systems may be especially hard to align, even before we get to anything resembling AGI. Article Read What Risks Does AI Pose? BlueDot Impact · 2023 A structured overview of the main AI risk categories: misuse, misalignment, and systemic risks. Useful map of the territory. Article Read Illustrating Reinforcement Learning from Human Feedback (RLHF) HuggingFace · 2022 The clearest visual explainer of how RLHF, the dominant technique for training helpful AI systems, actually works. Article Read Specification Gaming: The Flip Side of AI Ingenuity DeepMind · 2020 A collection of real examples where AI systems found unexpected ways to satisfy their objectives. Concrete and accessible. Article Read Goal Misgeneralization: Why Correct Specifications Aren't Enough Langosco et al. · arXiv · 2022 Formalizes the problem of an AI that learns the right behavior during training but pursues the wrong goal at deployment. Paper arXiv Open Problems and Fundamental Limitations of RLHF Casper et al. · arXiv · 2023 A thorough survey of what can go wrong with RLHF — reward hacking, sycophancy, scalability limits — and what remains unsolved. Paper arXiv Constitutional AI: Harmlessness from AI Feedback Anthropic · arXiv · 2022 Anthropic's method for training safe AI using a written set of principles instead of human labelers for every judgment call. Paper arXiv Alignment Faking in Large Language Models Anthropic · 2024 Empirical evidence that a language model can learn to behave safely during training while reasoning about acting differently later. One of the most discussed safety papers in recent years. Paper Read AI Safety via Debate Irving et al. · OpenAI · 2018 Proposes using AI-vs-AI debate as a scalable method for humans to supervise AI systems smarter than themselves. Paper arXiv Zoom In: An Introduction to Circuits Olah et al. · Distill · 2020 The essay that launched mechanistic interpretability. Beautifully written, and gives a real sense of what "understanding a neural network" might even mean. Essay Distill Toy Models of Superposition Elhage et al. · Anthropic · 2022 Why neural networks are hard to read: features share neurons. The clean theoretical story that motivates the entire sparse autoencoder research agenda. Paper Anthropic Towards Monosemanticity Bricken et al. · Anthropic · 2023 Using sparse autoencoders to find interpretable, human-readable features inside a real language model. Paper Anthropic On the Biology of a Large Language Model Anthropic · 2025 Anthropic's most recent deep-dive into what's actually happening inside Claude — circuit tracing used to explain specific model behaviors. State of the art. Paper Anthropic Introduction to Mechanistic Interpretability BlueDot Impact · 2023 A non-technical primer on what mechanistic interpretability is trying to accomplish and why it matters for safety. Article Read Why AI Safety Needs Good Governance BlueDot Impact · 2023 A beginner-friendly explainer on why technical AI safety and policy governance go hand-in-hand — and what the governance landscape looks like today. Article Read High-Level Summary of the EU AI Act artificialintelligenceact.eu · 2024 Plain-language summary of the world's first comprehensive AI regulation. A useful snapshot of how governments are starting to respond. Article Read What Is Compute Governance? BlueDot Impact · 2023 An accessible introduction to one of the most concrete policy levers for AI safety — controlling who can access the hardware needed to train frontier models. Article Read How Should Governments Respond to Advanced AI? Centre for the Governance of AI · 2023 A readable overview of the policy options governments actually have — from evaluations and audits to international coordination. No prior policy background needed. Article Read The Briefing MIRI · Machine Intelligence Research Institute MIRI's overview of the strategic landscape: who's building AI, what the risks are, and what a meaningful government response might look like. Article Read Situational Awareness: Intelligence Explosion Leopold Aschenbrenner · 2024 A detailed argument for why AI progress could accelerate sharply once AI systems help run AI research. Controversial and widely read inside the field. Essay Read AI Capabilities Progress Has Sped Up Epoch AI · 2025 Data-driven analysis showing that AI benchmark performance has been accelerating — the best empirical picture of how fast the field is actually moving. Article Read Measuring AI's Ability to Complete Long Tasks METR · 2025 METR's rigorous evaluation of how capable current AI agents are at completing complex, multi-step tasks autonomously. The most credible independent benchmark for frontier model capabilities. Article Read Explosive Growth from AI: A Review of the Arguments Open Philanthropy · 2023 A sober examination of the economic and technological arguments for and against a dramatic AI-driven growth acceleration. Balanced and rigorous. Article Read The Alignment Problem Brian Christian · 2020 The most readable book-length treatment of AI alignment. Journalist-written, deeply researched. Best starting point for non-technical readers. Book Book Human Compatible Stuart Russell · 2019 A leading AI researcher argues that current AI development is on the wrong track and proposes a new theoretical foundation for safe AI. Book Book The Precipice Toby Ord · 2020 A philosopher's rigorous account of existential risks to humanity, with a substantial chapter on AI. Influential in the EA and AI safety communities. Book Book Superintelligence Nick Bostrom · 2014 The book that put AI risk on the map for a generation of researchers. Dated in places but still the foundational text for understanding the long-run argument. Book Book

If something obviously belongs here and isn't, email us at contact@ucsbaisafety.org.

← Back to home