#222 – Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Full transcript, video, and links to learn more: https://80k.info/nn1

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident prevention, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through, so long as mech interp is paired with other techniques to fill in the gaps.

This episode was recorded on July 17 and 21, 2025.

Part 2 of the conversation is now available! https://80k.info/nn2

What did you think? https://forms.gle/xKyUrGyYpYenp8N4A

Chapters:

Cold open (00:00)
Who's Neel Nanda? (01:02)
How would mechanistic interpretability help with AGI (01:59)
What's mech interp? (05:09)
How Neel changed his take on mech interp (09:47)
Top successes in interpretability (15:53)
Probes can cheaply detect harmful intentions in AIs (20:06)
In some ways we understand AIs better than human minds (26:49)
Mech interp won't solve all our AI alignment problems (29:21)
Why mech interp is the 'biology' of neural networks (38:07)
Interpretability can't reliably find deceptive AI – nothing can (40:28)
'Black box' interpretability — reading the chain of thought (49:39)
'Self-preservation' isn't always what it seems (53:06)
For how long can we trust the chain of thought (01:02:09)
We could accidentally destroy chain of thought's usefulness (01:11:39)
Models can tell when they're being tested and act differently (01:16:56)
Top complaints about mech interp (01:23:50)
Why everyone's excited about sparse autoencoders (SAEs) (01:37:52)
Limitations of SAEs (01:47:16)
SAEs performance on real-world tasks (01:54:49)
Best arguments in favour of mech interp (02:08:10)
Lessons from the hype around mech interp (02:12:03)
Where mech interp will shine in coming years (02:17:50)
Why focus on understanding over control (02:21:02)
If AI models are conscious, will mech interp help us figure it out (02:24:09)
Neel's new research philosophy (02:26:19)
Who should join the mech interp field (02:38:31)
Advice for getting started in mech interp (02:46:55)
Keeping up to date with mech interp results (02:54:41)
Who's hiring and where to work? (02:57:43)

Host: Rob Wiblin
Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

Kokeile Premiumia

Nauti 14 päivää ilmaiseksi

Tilaa Premium

Jaksot(302)

#13 - Claire Walsh on testing which policies work & how to get governments to listen to the results

In both rich and poor countries, government policy is often based on no evidence at all and many programs don’t work. This has particularly harsh effects on the global poor - in some countries governments only spend $100 on each citizen a year so they can’t afford to waste a single dollar. Enter MIT’s Poverty Action Lab (J-PAL). Since 2003 they’ve conducted experiments to figure out what policies actually help recipients, and then tried to get them implemented by governments and non-profits. Claire Walsh leads J-PAL’s Government Partnership Initiative, which works to evaluate policies and programs in collaboration with developing world governments, scale policies that have been shown to work, and generally promote a culture of evidence-based policymaking. Summary, links to career opportunities and topics discussed in the show. We discussed (her views only, not J-PAL’s): * How can they get evidence backed policies adopted? Do politicians in the developing world even care whether their programs actually work? Is the norm evidence-based policy, or policy-based evidence? * Is evidence-based policy an evidence-based strategy itself? * Which policies does she think would have a particularly large impact on human welfare relative to their cost? * How did she come to lead one of J-PAL’s departments at 29? * How do you evaluate the effectiveness of energy and environment programs (Walsh’s area of expertise), and what are the standout approaches in that area? * 80,000 Hours has warned people about the downsides of starting your career in a non-profit. Walsh started her career in a non-profit and has thrived, so are we making a mistake? * Other than J-PAL, what are the best places to work in development? What are the best subjects to study? Where can you go network to break into the sector? * Is living in poverty as bad as we think? And plenty of other things besides. We haven’t run an RCT to test whether this episode will actually help your career, but I suggest you listen anyway. Trust my intuition on this one.

31 Loka 201752min

#12 - Beth Cameron works to stop you dying in a pandemic. Here’s what keeps her up at night.

“When you're in the middle of a crisis and you have to ask for money, you're already too late.” That’s Dr Beth Cameron, who leads Global Biological Policy and Programs at the Nuclear Threat Initiative. Beth should know. She has years of experience preparing for and fighting the diseases of our nightmares, on the White House Ebola Taskforce, in the National Security Council staff, and as the Assistant Secretary of Defense for Nuclear, Chemical and Biological Defense Programs. Summary, list of career opportunities, extra links to learn more and coaching application. Unfortunately, the countries of the world aren’t prepared for a crisis - and like children crowded into daycare, there’s a good chance something will make us all sick at once. During past pandemics countries have dragged their feet over who will pay to contain them, or struggled to move people and supplies where they needed to be. At the same time advanced biotechnology threatens to make it possible for terrorists to bring back smallpox - or create something even worse. In this interview we look at the current state of play in disease control, what needs to change, and how you can build the career capital necessary to make those changes yourself. That includes: * What and where to study, and where to begin a career in pandemic preparedness. Below you’ll find a lengthy list of people and places mentioned in the interview, and others we’ve had recommended to us. * How the Nuclear Threat Initiative, with just 50 people, collaborates with governments around the world to reduce the risk of nuclear or biological catastrophes, and whether they might want to hire you. * The best strategy for containing pandemics. * Why we lurch from panic, to neglect, to panic again when it comes to protecting ourselves from contagious diseases. * Current reform efforts within the World Health Organisation, and attempts to prepare partial vaccines ahead of time. * Which global health security groups most impress Beth, and what they’re doing. * What new technologies could be invented to make us safer. * Whether it’s possible to help solve the problem through mass advocacy. * Much more besides. Get free, one-on-one career advice to improve biosecurity Considering a relevant grad program like a biology PhD, medicine, or security studies? Able to apply for a relevant job already? We’ve helped dozens of people plan their careers to work on pandemic preparedness and put them in touch with mentors. If you want to work on the problem discussed in this episode, you should apply for coaching: Read more

25 Loka 20171h 45min

#11 - Spencer Greenberg on speeding up social science 10-fold & why plenty of startups cause harm

Do most meat eaters think it’s wrong to hurt animals? Do Americans think climate change is likely to cause human extinction? What is the best, state-of-the-art therapy for depression? How can we make academics more intellectually honest, so we can actually trust their findings? How can we speed up social science research ten-fold? Do most startups improve the world, or make it worse? If you’re interested in these question, this interview is for you. Click for a full transcript, links discussed in the show, etc. A scientist, entrepreneur, writer and mathematician, Spencer Greenberg is constantly working to create tools to speed up and improve research and critical thinking. These include: * Rapid public opinion surveys to find out what most people actually think about animal consciousness, farm animal welfare, the impact of developing world charities and the likelihood of extinction by various different means; * Tools to enable social science research to be run en masse very cheaply; * ClearerThinking.org, a highly popular site for improving people’s judgement and decision-making; * Ways to transform data analysis methods to ensure that papers only show true findings; * Innovative research methods; * Ways to decide which research projects are actually worth pursuing. In this interview, Spencer discusses all of these and more. If you don’t feel like listening, that just shows that you have poor judgement and need to benefit from his wisdom even more! Get free, one-on-one career advice We’ve helped hundreds of people compare their options, get introductions, and find high impact jobs. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

17 Loka 20171h 29min

#10 - Nick Beckstead on how to spend billions of dollars preventing human extinction

What if you were in a position to give away billions of dollars to improve the world? What would you do with it? This is the problem facing Program Officers at the Open Philanthropy Project - people like Dr Nick Beckstead. Following a PhD in philosophy, Nick works to figure out where money can do the most good. He’s been involved in major grants in a wide range of areas, including ending factory farming through technological innovation, safeguarding the world from advances in biotechnology and artificial intelligence, and spreading rational compassion. Full transcript, coaching application form, overview of the conversation, and links to resources discussed in the episode: This episode is a tour through some of the toughest questions ‘effective altruists’ face when figuring out how to best improve the world, including: * * Should we mostly try to help people currently alive, or future generations? Nick studied this question for years in his PhD thesis, On the Overwhelming Importance of Shaping the Far Future. (The first 31 minutes is a snappier version of my conversation with Toby Ord.) * Is clean meat (aka *in vitro* meat) technologically feasible any time soon, or should we be looking for plant-based alternatives? * What are the greatest risks to human civilisation? * To stop malaria is it more cost-effective to use technology to eliminate mosquitos than to distribute bed nets? * Should people who want to improve the future work for changes that will be very useful in a specific scenario, or just generally try to improve how well humanity makes decisions? * What specific jobs should our listeners take in order for Nick to be able to spend more money in useful ways to improve the world? * Should we expect the future to be better if the economy grows more quickly - or more slowly? Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

11 Loka 20171h 51min

#9 - Christine Peterson on how insecure computers could lead to global disaster, and how to fix it

Take a trip to Silicon Valley in the 70s and 80s, when going to space sounded like a good way to get around environmental limits, people started cryogenically freezing themselves, and nanotechnology looked like it might revolutionise industry – or turn us all into grey goo. Full transcript, coaching application form, overview of the conversation, and extra resources to learn more: In this episode of the 80,000 Hours Podcast Christine Peterson takes us back to her youth in the Bay Area, the ideas she encountered there, and what the dreamers she met did as they grew up. We also discuss how she came up with the term ‘open source software’ (and how she had to get someone else to propose it). Today Christine helps runs the Foresight Institute, which fills a gap left by for-profit technology companies – predicting how new revolutionary technologies could go wrong, and ensuring we steer clear of the downsides. We dive into: * Whether the poor security of computer systems poses a catastrophic risk for the world. Could all our essential services be taken down at once? And if so, what can be done about it? * Can technology ‘move fast and break things’ without eventually breaking the world? Would it be better for technology to advance more quickly, or more slowly? * How Christine came up with the term ‘open source software’ (and why someone else had to propose it). * Will AIs designed for wide-scale automated hacking make computers more or less secure? * Would it be good to radically extend human lifespan? Is it sensible to cryogenically freeze yourself in the hope of being resurrected in the future? * Could atomically precise manufacturing (nanotechnology) really work? Why was it initially so controversial and why did people stop worrying about it? * Should people who try to do good in their careers work long hours and take low salaries? Or should they take care of themselves first of all? * How she thinks the the effective altruism community resembles the scene she was involved with when she was wrong, and where it might be going wrong. Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

4 Loka 20171h 45min

#8 - Lewis Bollard on how to end factory farming in our lifetimes

Every year tens of billions of animals are raised in terrible conditions in factory farms before being killed for human consumption. Over the last two years Lewis Bollard – Project Officer for Farm Animal Welfare at the Open Philanthropy Project – has conducted extensive research into the best ways to eliminate animal suffering in farms as soon as possible. This has resulted in $30 million in grants to farm animal advocacy. Full transcript, coaching application form, overview of the conversation, and extra resources to learn more: We covered almost every approach being taken, which ones work, and how individuals can best contribute through their careers. We also had time to venture into a wide range of issues that are less often discussed, including: * Why Lewis thinks insect farming would be worse than the status quo, and whether we should look for ‘humane’ insecticides; * How young people can set themselves up to contribute to scientific research into meat alternatives; * How genetic manipulation of chickens has caused them to suffer much more than their ancestors, but could also be used to make them better off; * Why Lewis is skeptical of vegan advocacy; * Why he doubts that much can be done to tackle factory farming through legal advocacy or electoral politics; * Which species of farm animals is best to focus on first; * Whether fish and crustaceans are conscious, and if so what can be done for them; * Many other issues listed below in the Overview of the discussion. Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you. Overview of the discussion **2m40s** What originally drew you to dedicate your career to helping animals and why did Open Philanthropy end up focusing on it? **5m40s** Do you have any concrete way of assessing the severity of animal suffering? **7m10s** Do you think the environmental gains are large compared to those that we might hope to get from animal welfare improvement? **7m55s** What grants have you made at Open Phil? How did you go about deciding which groups to fund and which ones not to fund? **9m50s** Why does Open Phil focus on chickens and fish? Is this the right call? More...

27 Syys 20173h 16min

#7 - Julia Galef on making humanity more rational, what EA does wrong, and why Twitter isn’t all bad

The scientific revolution in the 16th century was one of the biggest societal shifts in human history, driven by the discovery of new and better methods of figuring out who was right and who was wrong. Julia Galef - a well-known writer and researcher focused on improving human judgment, especially about high stakes questions - believes that if we could again develop new techniques to predict the future, resolve disagreements and make sound decisions together, it could dramatically improve the world across the board. We brought her in to talk about her ideas. This interview complements a new detailed review of whether and how to follow Julia’s career path. Apply for personalised coaching, see what questions are asked when, and read extra resources to learn more. Julia has been host of the Rationally Speaking podcast since 2010, co-founder of the Center for Applied Rationality in 2012, and is currently working for the Open Philanthropy Project on an investigation of expert disagreements. In our conversation we ended up speaking about a wide range of topics, including: * Her research on how people can have productive intellectual disagreements. * Why she once planned to become an urban designer. * Why she doubts people are more rational than 200 years ago. * What makes her a fan of Twitter (while I think it’s dystopian). * Whether people should write more books. * Whether it’s a good idea to run a podcast, and how she grew her audience. * Why saying you don’t believe X often won’t convince people you don’t. * Why she started a PhD in economics but then stopped. * Whether she would recommend an unconventional career like her own. * Whether the incentives in the intelligence community actually support sound thinking. * Whether big institutions will actually pick up new tools for improving decision-making if they are developed. * How to start out pursuing a career in which you enhance human judgement and foresight. Get free, one-on-one career advice to help you improve judgement and decision-making We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. **If you want to work on any of the problems discussed in this episode, find out if our coaching can help you:** APPLY FOR COACHING Overview of the conversation **1m30s** So what projects are you working on at the moment? **3m50s** How are you working on the problem of expert disagreement? **6m0s** Is this the same method as the double crux process that was developed at the Center for Applied Rationality? **10m** Why did the Open Philanthropy Project decide this was a very valuable project to fund? **13m** Is the double crux process actually that effective? **14m50s** Is Facebook dangerous? **17m** What makes for a good life? Can you be mistaken about having a good life? **19m** Should more people write books? Read more...

13 Syys 20171h 14min

#6 - Toby Ord on why the long-term future matters more than anything else & what to do about it

Of all the people whose well-being we should care about, only a small fraction are alive today. The rest are members of future generations who are yet to exist. Whether they’ll be born into a world that is flourishing or disintegrating – and indeed, whether they will ever be born at all – is in large part up to us. As such, the welfare of future generations should be our number one moral concern. This conclusion holds true regardless of whether your moral framework is based on common sense, consequences, rules of ethical conduct, cooperating with others, virtuousness, keeping options open – or just a sense of wonder about the universe we find ourselves in. That’s the view of Dr Toby Ord, a philosophy Fellow at the University of Oxford and co-founder of the effective altruism community. In this episode of the 80,000 Hours Podcast Dr Ord makes the case that aiming for a positive long-term future is likely the best way to improve the world. Apply for personalised coaching, see what questions are asked when, and read extra resources to learn more. We then discuss common objections to long-termism, such as the idea that benefits to future generations are less valuable than those to people alive now, or that we can’t meaningfully benefit future generations beyond taking the usual steps to improve the present. Later the conversation turns to how individuals can and have changed the course of history, what could go wrong and why, and whether plans to colonise Mars would actually put humanity in a safer position than it is today. This episode goes deep into the most distinctive features of our advice. It’s likely the most in-depth discussion of how 80,000 Hours and the effective altruism community think about the long term future and why - and why we so often give it top priority. It’s best to subscribe, so you can listen at leisure on your phone, speed up the conversation if you like, and get notified about future episodes. You can do so by searching ‘80,000 Hours’ wherever you get your podcasts. Want to help ensure humanity has a positive future instead of destroying itself? We want to help. We’ve helped 100s of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, such as artificial intelligence or biosecurity, find out if our coaching can help you. Overview of the discussion 3m30s - Why is the long-term future of humanity such a big deal, and perhaps the most important issue for us to be thinking about? 9m05s - Five arguments that future generations matter 21m50s - How bad would it be if humanity went extinct or civilization collapses? 26m40s - Why do people start saying such strange things when this topic comes up? 30m30s - Are there any other reasons to prioritize thinking about the long-term future of humanity that you wanted to raise before we move to objections? 36m10s - What is this school of thought called? Read more...

6 Syys 20172h 8min

Premium

9,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa

Aloita 14 päivän kokeilu

Premium

13,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa
Yksi lisäkäyttäjä

Kokeile 14 päivää maksutta

#222 – Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

Kokeile Premiumia

Jaksot(302)

#13 - Claire Walsh on testing which policies work & how to get governments to listen to the results

#12 - Beth Cameron works to stop you dying in a pandemic. Here’s what keeps her up at night.

#11 - Spencer Greenberg on speeding up social science 10-fold & why plenty of startups cause harm

#10 - Nick Beckstead on how to spend billions of dollars preventing human extinction

#9 - Christine Peterson on how insecure computers could lead to global disaster, and how to fix it

#8 - Lewis Bollard on how to end factory farming in our lifetimes

#7 - Julia Galef on making humanity more rational, what EA does wrong, and why Twitter isn’t all bad

#6 - Toby Ord on why the long-term future matters more than anything else & what to do about it

Kaikki yhdessä sovelluksessa

Sinulle valikoitua sisältöä

Jatka kuuntelua koska tahansa

Premium

Premium

Suosittua kategoriassa Koulutus

Tarinat ja äänet, joita rakastat kuunnella