Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Full transcript, video, and links to learn more: https://80k.info/nn1

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident prevention, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through, so long as mech interp is paired with other techniques to fill in the gaps.

This episode was recorded on July 17 and 21, 2025.

Part 2 of the conversation is now available! https://80k.info/nn2

What did you think? https://forms.gle/xKyUrGyYpYenp8N4A

Chapters:

  • Cold open (00:00)
  • Who's Neel Nanda? (01:02)
  • How would mechanistic interpretability help with AGI (01:59)
  • What's mech interp? (05:09)
  • How Neel changed his take on mech interp (09:47)
  • Top successes in interpretability (15:53)
  • Probes can cheaply detect harmful intentions in AIs (20:06)
  • In some ways we understand AIs better than human minds (26:49)
  • Mech interp won't solve all our AI alignment problems (29:21)
  • Why mech interp is the 'biology' of neural networks (38:07)
  • Interpretability can't reliably find deceptive AI – nothing can (40:28)
  • 'Black box' interpretability — reading the chain of thought (49:39)
  • 'Self-preservation' isn't always what it seems (53:06)
  • For how long can we trust the chain of thought (01:02:09)
  • We could accidentally destroy chain of thought's usefulness (01:11:39)
  • Models can tell when they're being tested and act differently (01:16:56)
  • Top complaints about mech interp (01:23:50)
  • Why everyone's excited about sparse autoencoders (SAEs) (01:37:52)
  • Limitations of SAEs (01:47:16)
  • SAEs performance on real-world tasks (01:54:49)
  • Best arguments in favour of mech interp (02:08:10)
  • Lessons from the hype around mech interp (02:12:03)
  • Where mech interp will shine in coming years (02:17:50)
  • Why focus on understanding over control (02:21:02)
  • If AI models are conscious, will mech interp help us figure it out (02:24:09)
  • Neel's new research philosophy (02:26:19)
  • Who should join the mech interp field (02:38:31)
  • Advice for getting started in mech interp (02:46:55)
  • Keeping up to date with mech interp results (02:54:41)
  • Who's hiring and where to work? (02:57:43)

Host: Rob Wiblin
Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

Episoder(299)

#2 - David Spiegelhalter on risk, stats and improving understanding of science

#2 - David Spiegelhalter on risk, stats and improving understanding of science

Recorded in 2015 by Robert Wiblin with colleague Jess Whittlestone at the Centre for Effective Altruism, and recovered from the dusty 80,000 Hours archives. David Spiegelhalter is a statistician at the University of Cambridge and something of an academic celebrity in the UK. Part of his role is to improve the public understanding of risk - especially everyday risks we face like getting cancer or dying in a car crash. As a result he’s regularly in the media explaining numbers in the news, trying to assist both ordinary people and politicians focus on the important risks we face, and avoid being distracted by flashy risks that don’t actually have much impact. Summary, full transcript and extra links to learn more. To help make sense of the uncertainties we face in life he has had to invent concepts like the microlife, or a 30-minute change in life expectancy. (https://en.wikipedia.org/wiki/Microlife) We wanted to learn whether he thought a lifetime of work communicating science had actually had much impact on the world, and what advice he might have for people planning their careers today.

21 Jun 201733min

#1 - Miles Brundage on the world's desperate need for AI strategists and policy experts

#1 - Miles Brundage on the world's desperate need for AI strategists and policy experts

Robert Wiblin, Director of Research at 80,000 Hours speaks with Miles Brundage, research fellow at the University of Oxford's Future of Humanity Institute. Miles studies the social implications surrounding the development of new technologies and has a particular interest in artificial general intelligence, that is, an AI system that could do most or all of the tasks humans could do. This interview complements our profile of the importance of positively shaping artificial intelligence and our guide to careers in AI policy and strategy Full transcript, apply for personalised coaching to work on AI strategy, see what questions are asked when, and read extra resources to learn more.

5 Jun 201755min

#0 – Introducing the 80,000 Hours Podcast

#0 – Introducing the 80,000 Hours Podcast

80,000 Hours is a non-profit that provides research and other support to help people switch into careers that effectively tackle the world's most pressing problems. This podcast is just one of many things we offer, the others of which you can find at 80000hours.org. Since 2017 this show has been putting out interviews about the world's most pressing problems and how to solve them — which some people enjoy because they love to learn about important things, and others are using to figure out what they want to do with their careers or with their charitable giving. If you haven't yet spent a lot of time with 80,000 Hours or our general style of thinking, called effective altruism, it's probably really helpful to first go through the episodes that set the scene, explain our overall perspective on things, and generally offer all the background information you need to get the most out of the episodes we're making now. That's why we've made a new feed with ten carefully selected episodes from the show's archives, called 'Effective Altruism: An Introduction'. You can find it by searching for 'Effective Altruism' in your podcasting app or at 80000hours.org/intro. Or, if you’d rather listen on this feed, here are the ten episodes we recommend you listen to first: • #21 – Holden Karnofsky on the world's most intellectual foundation and how philanthropy can have maximum impact by taking big risks • #6 – Toby Ord on why the long-term future of humanity matters more than anything else and what we should do about it • #17 – Will MacAskill on why our descendants might view us as moral monsters • #39 – Spencer Greenberg on the scientific approach to updating your beliefs when you get new evidence • #44 – Paul Christiano on developing real solutions to the 'AI alignment problem' • #60 – What Professor Tetlock learned from 40 years studying how to predict the future • #46 – Hilary Greaves on moral cluelessness, population ethics and tackling global issues in academia • #71 – Benjamin Todd on the key ideas of 80,000 Hours • #50 – Dave Denkenberger on how we might feed all 8 billion people through a nuclear winter • 80,000 Hours Team chat #3 – Koehler and Todd on the core idea of effective altruism and how to argue for it

1 Mai 20173min

Populært innen Fakta

fastlegen
dine-penger-pengeradet
hanna-de-heldige
fryktlos
relasjonspodden-med-dora-thorhallsdottir-kjersti-idem
foreldreradet
treningspodden
dypdykk
jakt-og-fiskepodden
rss-kunsten-a-leve
sinnsyn
rss-sunn-okonomi
hverdagspsyken
rss-strid-de-norske-borgerkrigene
tomprat-med-gunnar-tjomlid
historietimen
mikkels-paskenotter
gravid-uke-for-uke
takk-og-lov-med-anine-kierulf
rss-impressions-2