#222 – Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Full transcript, video, and links to learn more: https://80k.info/nn1

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident prevention, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through, so long as mech interp is paired with other techniques to fill in the gaps.

This episode was recorded on July 17 and 21, 2025.

Part 2 of the conversation is now available! https://80k.info/nn2

What did you think? https://forms.gle/xKyUrGyYpYenp8N4A

Chapters:

Cold open (00:00)
Who's Neel Nanda? (01:02)
How would mechanistic interpretability help with AGI (01:59)
What's mech interp? (05:09)
How Neel changed his take on mech interp (09:47)
Top successes in interpretability (15:53)
Probes can cheaply detect harmful intentions in AIs (20:06)
In some ways we understand AIs better than human minds (26:49)
Mech interp won't solve all our AI alignment problems (29:21)
Why mech interp is the 'biology' of neural networks (38:07)
Interpretability can't reliably find deceptive AI – nothing can (40:28)
'Black box' interpretability — reading the chain of thought (49:39)
'Self-preservation' isn't always what it seems (53:06)
For how long can we trust the chain of thought (01:02:09)
We could accidentally destroy chain of thought's usefulness (01:11:39)
Models can tell when they're being tested and act differently (01:16:56)
Top complaints about mech interp (01:23:50)
Why everyone's excited about sparse autoencoders (SAEs) (01:37:52)
Limitations of SAEs (01:47:16)
SAEs performance on real-world tasks (01:54:49)
Best arguments in favour of mech interp (02:08:10)
Lessons from the hype around mech interp (02:12:03)
Where mech interp will shine in coming years (02:17:50)
Why focus on understanding over control (02:21:02)
If AI models are conscious, will mech interp help us figure it out (02:24:09)
Neel's new research philosophy (02:26:19)
Who should join the mech interp field (02:38:31)
Advice for getting started in mech interp (02:46:55)
Keeping up to date with mech interp results (02:54:41)
Who's hiring and where to work? (02:57:43)

Host: Rob Wiblin
Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

Oppdag Premium

Prøv 14 dager gratis

Kjøp Premium

Episoder(304)

#15 - Phil Tetlock on how chimps beat Berkeley undergrads and when it’s wise to defer to the wise

Prof Philip Tetlock is a social science legend. Over forty years he has researched whose predictions we can trust, whose we can’t and why - and developed methods that allow all of us to be better at predicting the future. After the Iraq WMDs fiasco, the US intelligence services hired him to figure out how to ensure they’d never screw up that badly again. The result of that work – Superforecasting – was a media sensation in 2015. Full transcript, brief summary, apply for coaching and links to learn more. It described Tetlock’s Good Judgement Project, which found forecasting methods so accurate they beat everyone else in open competition, including thousands of people in the intelligence services with access to classified information. Today he’s working to develop the best forecasting process ever, by combining top human and machine intelligence in the Hybrid Forecasting Competition, which you can sign up and participate in. We start by describing his key findings, and then push to the edge of what is known about how to foresee the unforeseeable: * Should people who want to be right just adopt the views of experts rather than apply their own judgement? * Why are Berkeley undergrads worse forecasters than dart-throwing chimps? * Should I keep my political views secret, so it will be easier to change them later? * How can listeners contribute to his latest cutting-edge research? * What do we know about our accuracy at predicting low-probability high-impact disasters? * Does his research provide an intellectual basis for populist political movements? * Was the Iraq War caused by bad politics, or bad intelligence methods? * What can we learn about forecasting from the 2016 election? * Can experience help people avoid overconfidence and underconfidence? * When does an AI easily beat human judgement? * Could more accurate forecasting methods make the world more dangerous? * How much does demographic diversity line up with cognitive diversity? * What are the odds we’ll go to war with China? * Should we let prediction tournaments run most of the government? Listen to it. Get free, one-on-one career advice. Want to work on important social science research like Tetlock? We’ve helped hundreds of people compare their options and get introductions. Find out if our coaching can help you.

20 Nov 20171h 24min

#14 - Sharon Nunez & Jose Valle on going undercover to expose animal abuse

What if you knew that ducks were being killed with pitchforks? Rabbits dumped alive into containers? Or pigs being strangled with forklifts? Would you be willing to go undercover to expose the crime? That’s a real question that confronts volunteers at Animal Equality (AE). In this episode we speak to Sharon Nunez and Jose Valle, who founded AE in 2006 and then grew it into a multi-million dollar international animal rights organisation. They’ve been chosen as one of the most effective animal protection orgs in the world by Animal Charity Evaluators for the last 3 consecutive years. Blog post about the episode, including links and full transcript. A related previous episode, strongly recommended: Lewis Bollard on how to end factory farming as soon as possible. In addition to undercover investigations AE has also designed a 3D virtual-reality farm experience called iAnimal360. People get to experience being trapped in a cage – in a room designed to kill then - and can’t just look away. How big an impact is this having on users? Sharon Nuñez and Jose Valle also tackle: * How do they track their goals and metrics week to week? * How much does an undercover investigation cost? * Why don’t people donate more to factory farmed animals, given that they’re the vast majority of animals harmed directly by humans? * How risky is it to attempt to build a career in animal advocacy? * What led to a change in their focus from bullfighting in Spain to animal farming? * How does working with governments or corporate campaigns compare with early strategies like creating new vegans/vegetarians? * Has their very rapid growth been difficult to handle? * What should our listeners study or do if they want to work in this area? * How can we get across the message that horrific cases are a feature - not a bug - of factory farming? * Do the owners or workers of factory farms ever express shame at what they do?

13 Nov 20171h 25min

#13 - Claire Walsh on testing which policies work & how to get governments to listen to the results

In both rich and poor countries, government policy is often based on no evidence at all and many programs don’t work. This has particularly harsh effects on the global poor - in some countries governments only spend $100 on each citizen a year so they can’t afford to waste a single dollar. Enter MIT’s Poverty Action Lab (J-PAL). Since 2003 they’ve conducted experiments to figure out what policies actually help recipients, and then tried to get them implemented by governments and non-profits. Claire Walsh leads J-PAL’s Government Partnership Initiative, which works to evaluate policies and programs in collaboration with developing world governments, scale policies that have been shown to work, and generally promote a culture of evidence-based policymaking. Summary, links to career opportunities and topics discussed in the show. We discussed (her views only, not J-PAL’s): * How can they get evidence backed policies adopted? Do politicians in the developing world even care whether their programs actually work? Is the norm evidence-based policy, or policy-based evidence? * Is evidence-based policy an evidence-based strategy itself? * Which policies does she think would have a particularly large impact on human welfare relative to their cost? * How did she come to lead one of J-PAL’s departments at 29? * How do you evaluate the effectiveness of energy and environment programs (Walsh’s area of expertise), and what are the standout approaches in that area? * 80,000 Hours has warned people about the downsides of starting your career in a non-profit. Walsh started her career in a non-profit and has thrived, so are we making a mistake? * Other than J-PAL, what are the best places to work in development? What are the best subjects to study? Where can you go network to break into the sector? * Is living in poverty as bad as we think? And plenty of other things besides. We haven’t run an RCT to test whether this episode will actually help your career, but I suggest you listen anyway. Trust my intuition on this one.

31 Okt 201752min

#12 - Beth Cameron works to stop you dying in a pandemic. Here’s what keeps her up at night.

“When you're in the middle of a crisis and you have to ask for money, you're already too late.” That’s Dr Beth Cameron, who leads Global Biological Policy and Programs at the Nuclear Threat Initiative. Beth should know. She has years of experience preparing for and fighting the diseases of our nightmares, on the White House Ebola Taskforce, in the National Security Council staff, and as the Assistant Secretary of Defense for Nuclear, Chemical and Biological Defense Programs. Summary, list of career opportunities, extra links to learn more and coaching application. Unfortunately, the countries of the world aren’t prepared for a crisis - and like children crowded into daycare, there’s a good chance something will make us all sick at once. During past pandemics countries have dragged their feet over who will pay to contain them, or struggled to move people and supplies where they needed to be. At the same time advanced biotechnology threatens to make it possible for terrorists to bring back smallpox - or create something even worse. In this interview we look at the current state of play in disease control, what needs to change, and how you can build the career capital necessary to make those changes yourself. That includes: * What and where to study, and where to begin a career in pandemic preparedness. Below you’ll find a lengthy list of people and places mentioned in the interview, and others we’ve had recommended to us. * How the Nuclear Threat Initiative, with just 50 people, collaborates with governments around the world to reduce the risk of nuclear or biological catastrophes, and whether they might want to hire you. * The best strategy for containing pandemics. * Why we lurch from panic, to neglect, to panic again when it comes to protecting ourselves from contagious diseases. * Current reform efforts within the World Health Organisation, and attempts to prepare partial vaccines ahead of time. * Which global health security groups most impress Beth, and what they’re doing. * What new technologies could be invented to make us safer. * Whether it’s possible to help solve the problem through mass advocacy. * Much more besides. Get free, one-on-one career advice to improve biosecurity Considering a relevant grad program like a biology PhD, medicine, or security studies? Able to apply for a relevant job already? We’ve helped dozens of people plan their careers to work on pandemic preparedness and put them in touch with mentors. If you want to work on the problem discussed in this episode, you should apply for coaching: Read more

25 Okt 20171h 45min

#11 - Spencer Greenberg on speeding up social science 10-fold & why plenty of startups cause harm

Do most meat eaters think it’s wrong to hurt animals? Do Americans think climate change is likely to cause human extinction? What is the best, state-of-the-art therapy for depression? How can we make academics more intellectually honest, so we can actually trust their findings? How can we speed up social science research ten-fold? Do most startups improve the world, or make it worse? If you’re interested in these question, this interview is for you. Click for a full transcript, links discussed in the show, etc. A scientist, entrepreneur, writer and mathematician, Spencer Greenberg is constantly working to create tools to speed up and improve research and critical thinking. These include: * Rapid public opinion surveys to find out what most people actually think about animal consciousness, farm animal welfare, the impact of developing world charities and the likelihood of extinction by various different means; * Tools to enable social science research to be run en masse very cheaply; * ClearerThinking.org, a highly popular site for improving people’s judgement and decision-making; * Ways to transform data analysis methods to ensure that papers only show true findings; * Innovative research methods; * Ways to decide which research projects are actually worth pursuing. In this interview, Spencer discusses all of these and more. If you don’t feel like listening, that just shows that you have poor judgement and need to benefit from his wisdom even more! Get free, one-on-one career advice We’ve helped hundreds of people compare their options, get introductions, and find high impact jobs. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

17 Okt 20171h 29min

#10 - Nick Beckstead on how to spend billions of dollars preventing human extinction

What if you were in a position to give away billions of dollars to improve the world? What would you do with it? This is the problem facing Program Officers at the Open Philanthropy Project - people like Dr Nick Beckstead. Following a PhD in philosophy, Nick works to figure out where money can do the most good. He’s been involved in major grants in a wide range of areas, including ending factory farming through technological innovation, safeguarding the world from advances in biotechnology and artificial intelligence, and spreading rational compassion. Full transcript, coaching application form, overview of the conversation, and links to resources discussed in the episode: This episode is a tour through some of the toughest questions ‘effective altruists’ face when figuring out how to best improve the world, including: * * Should we mostly try to help people currently alive, or future generations? Nick studied this question for years in his PhD thesis, On the Overwhelming Importance of Shaping the Far Future. (The first 31 minutes is a snappier version of my conversation with Toby Ord.) * Is clean meat (aka *in vitro* meat) technologically feasible any time soon, or should we be looking for plant-based alternatives? * What are the greatest risks to human civilisation? * To stop malaria is it more cost-effective to use technology to eliminate mosquitos than to distribute bed nets? * Should people who want to improve the future work for changes that will be very useful in a specific scenario, or just generally try to improve how well humanity makes decisions? * What specific jobs should our listeners take in order for Nick to be able to spend more money in useful ways to improve the world? * Should we expect the future to be better if the economy grows more quickly - or more slowly? Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

11 Okt 20171h 51min

#9 - Christine Peterson on how insecure computers could lead to global disaster, and how to fix it

Take a trip to Silicon Valley in the 70s and 80s, when going to space sounded like a good way to get around environmental limits, people started cryogenically freezing themselves, and nanotechnology looked like it might revolutionise industry – or turn us all into grey goo. Full transcript, coaching application form, overview of the conversation, and extra resources to learn more: In this episode of the 80,000 Hours Podcast Christine Peterson takes us back to her youth in the Bay Area, the ideas she encountered there, and what the dreamers she met did as they grew up. We also discuss how she came up with the term ‘open source software’ (and how she had to get someone else to propose it). Today Christine helps runs the Foresight Institute, which fills a gap left by for-profit technology companies – predicting how new revolutionary technologies could go wrong, and ensuring we steer clear of the downsides. We dive into: * Whether the poor security of computer systems poses a catastrophic risk for the world. Could all our essential services be taken down at once? And if so, what can be done about it? * Can technology ‘move fast and break things’ without eventually breaking the world? Would it be better for technology to advance more quickly, or more slowly? * How Christine came up with the term ‘open source software’ (and why someone else had to propose it). * Will AIs designed for wide-scale automated hacking make computers more or less secure? * Would it be good to radically extend human lifespan? Is it sensible to cryogenically freeze yourself in the hope of being resurrected in the future? * Could atomically precise manufacturing (nanotechnology) really work? Why was it initially so controversial and why did people stop worrying about it? * Should people who try to do good in their careers work long hours and take low salaries? Or should they take care of themselves first of all? * How she thinks the the effective altruism community resembles the scene she was involved with when she was wrong, and where it might be going wrong. Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you.

4 Okt 20171h 45min

#8 - Lewis Bollard on how to end factory farming in our lifetimes

Every year tens of billions of animals are raised in terrible conditions in factory farms before being killed for human consumption. Over the last two years Lewis Bollard – Project Officer for Farm Animal Welfare at the Open Philanthropy Project – has conducted extensive research into the best ways to eliminate animal suffering in farms as soon as possible. This has resulted in $30 million in grants to farm animal advocacy. Full transcript, coaching application form, overview of the conversation, and extra resources to learn more: We covered almost every approach being taken, which ones work, and how individuals can best contribute through their careers. We also had time to venture into a wide range of issues that are less often discussed, including: * Why Lewis thinks insect farming would be worse than the status quo, and whether we should look for ‘humane’ insecticides; * How young people can set themselves up to contribute to scientific research into meat alternatives; * How genetic manipulation of chickens has caused them to suffer much more than their ancestors, but could also be used to make them better off; * Why Lewis is skeptical of vegan advocacy; * Why he doubts that much can be done to tackle factory farming through legal advocacy or electoral politics; * Which species of farm animals is best to focus on first; * Whether fish and crustaceans are conscious, and if so what can be done for them; * Many other issues listed below in the Overview of the discussion. Get free, one-on-one career advice We’ve helped dozens of people compare between their options, get introductions, and jobs important for the the long-run future. If you want to work on any of the problems discussed in this episode, find out if our coaching can help you. Overview of the discussion **2m40s** What originally drew you to dedicate your career to helping animals and why did Open Philanthropy end up focusing on it? **5m40s** Do you have any concrete way of assessing the severity of animal suffering? **7m10s** Do you think the environmental gains are large compared to those that we might hope to get from animal welfare improvement? **7m55s** What grants have you made at Open Phil? How did you go about deciding which groups to fund and which ones not to fund? **9m50s** Why does Open Phil focus on chickens and fish? Is this the right call? More...

27 Sep 20173h 16min

Reklamefrie Premium-podkaster

Hør populære podkaster som Storefri med Mikkel og Herman, Ida med hjertet i hånden, Krimpodden og mye mye mer

Skap din egen podkastboble

I appen skaper du ditt eget bibliotek med favoritter, og vi gir deg også anbefalinger til podkaster du ikke kan gå glipp av.

Prøv 14 dager gratis

Dersom du er ny Podme-bruker får du 14 dager gratis prøveperiode når du oppretter abonnement

Premium

99 kr/ måned

Tilgang til alle våre Premium-podkaster
Alle podkaster fra VG, Aftenposten, BT og SA
Reklamefritt Premium-innhold
Ingen bindingstid. Avslutt når du ønsker

Prøv 14 dager gratis

Premium

129 kr/ måned

Tilgang til alle Premium-podkaster
Alle podkaster fra VG, Aftenposten, BT og SA
Reklamefritt Premium-innhold
Ingen bindingstid. Avslutt når du ønsker
En Ekstra bruker

Prøv 14 dager gratis

Populært innen Fakta

relasjonspodden-med-dora-thorhallsdottir-kjersti-idem

Historiene og stemmene du vil høre

Ubegrenset tilgang til alle dine favorittpodkaster og lydbøker

Les mer