#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

Why he’s more worried about AI hacking its own data centre than escaping
What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
Why he might want to use a model he thought could be conspiring against him
Why he would feel safer if he caught an AI attempting to escape
Why many control techniques would be relatively inexpensive
How to use an untrusted model to monitor another untrusted model
What the minimum viable intervention in a “lazy” AI company might look like
How even small teams of safety-focused staff within AI labs could matter
The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

Cold open |00:00:00|
Who’s Buck Shlegeris? |00:01:27|
What's AI control? |00:01:51|
Why is AI control hot now? |00:05:39|
Detecting human vs AI spies |00:10:32|
Acute vs chronic AI betrayal |00:15:21|
How to catch AIs trying to escape |00:17:48|
The cheapest AI control techniques |00:32:48|
Can we get untrusted models to do trusted work? |00:38:58|
If we catch a model escaping... will we do anything? |00:50:15|
Getting AI models to think they've already escaped |00:52:51|
Will they be able to tell it's a setup? |00:58:11|
Will AI companies do any of this stuff? |01:00:11|
Can we just give AIs fewer permissions? |01:06:14|
Can we stop human spies the same way? |01:09:58|
The pitch to AI companies to do this |01:15:04|
Will AIs get superhuman so fast that this is all useless? |01:17:18|
Risks from AI deliberately doing a bad job |01:18:37|
Is alignment still useful? |01:24:49|
Current alignment methods don't detect scheming |01:29:12|
How to tell if AI control will work |01:31:40|
How can listeners contribute? |01:35:53|
Is 'controlling' AIs kind of a dick move? |01:37:13|
Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|
Benefits of working outside frontier AI companies |01:47:48|
Why Redwood Research does what it does |01:51:34|
What other safety-related research looks best to Buck? |01:58:56|
If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|
Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|
Is research on human scheming relevant to AI? |02:08:03|

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Upptäck Premium

Prova 14 dagar kostnadsfritt

Skaffa Premium

Avsnitt(310)

#188 – Matt Clancy on whether science is good

"Suppose we make these grants, we do some of those experiments I talk about. We discover, for example — I’m just making this up — but we give people superforecasting tests when they’re doing peer review, and we find that you can identify people who are super good at picking science. And then we have this much better targeted science, and we’re making progress at a 10% faster rate than we normally would have. Over time, that aggregates up, and maybe after 10 years, we’re a year ahead of where we would have been if we hadn’t done this kind of stuff."Now, suppose in 10 years we’re going to discover a cheap new genetic engineering technology that anyone can use in the world if they order the right parts off of Amazon. That could be great, but could also allow bad actors to genetically engineer pandemics and basically try to do terrible things with this technology. And if we’ve brought that forward, and that happens at year nine instead of year 10 because of some of these interventions we did, now we start to think that if that’s really bad, if these people using this technology causes huge problems for humanity, it begins to sort of wash out the benefits of getting the science a little bit faster." —Matt ClancyIn today’s episode, host Luisa Rodriguez speaks to Matt Clancy — who oversees Open Philanthropy’s Innovation Policy programme — about his recent work modelling the risks and benefits of the increasing speed of scientific progress.Links to learn more, highlights, and full transcript.They cover:Whether scientific progress is actually net positive for humanity.Scenarios where accelerating science could lead to existential risks, such as advanced biotechnology being used by bad actors.Why Matt thinks metascience research and targeted funding could improve the scientific process and better incentivise outcomes that are good for humanity.Whether Matt trusts domain experts or superforecasters more when estimating how the future will turn out.Why Matt is sceptical that AGI could really cause explosive economic growth.And much more.Chapters:Is scientific progress net positive for humanity? (00:03:00)The time of biological perils (00:17:50)Modelling the benefits of science (00:25:48)Income and health gains from scientific progress (00:32:49)Discount rates (00:42:14)How big are the returns to science? (00:51:08)Forecasting global catastrophic biological risks from scientific progress (01:05:20)What’s the value of scientific progress, given the risks? (01:15:09)Factoring in extinction risk (01:21:56)How science could reduce extinction risk (01:30:18)Are we already too late to delay the time of perils? (01:42:38)Domain experts vs superforecasters (01:46:03)What Open Philanthropy’s Innovation Policy programme settled on (01:53:47)Explosive economic growth (02:06:28)Matt’s favourite thought experiment (02:34:57)Producer and editor: Keiran HarrisAudio engineering lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongAdditional content editing: Katy Moore and Luisa RodriguezTranscriptions: Katy Moore

23 Maj 20242h 40min

#187 – Zach Weinersmith on how researching his book turned him from a space optimist into a "space bastard"

"Earth economists, when they measure how bad the potential for exploitation is, they look at things like, how is labour mobility? How much possibility do labourers have otherwise to go somewhere else? Well, if you are on the one company town on Mars, your labour mobility is zero, which has never existed on Earth. Even in your stereotypical West Virginian company town run by immigrant labour, there’s still, by definition, a train out. On Mars, you might not even be in the launch window. And even if there are five other company towns or five other settlements, they’re not necessarily rated to take more humans. They have their own oxygen budget, right? "And so economists use numbers like these, like labour mobility, as a way to put an equation and estimate the ability of a company to set noncompetitive wages or to set noncompetitive work conditions. And essentially, on Mars you’re setting it to infinity." — Zach WeinersmithIn today’s episode, host Luisa Rodriguez speaks to Zach Weinersmith — the cartoonist behind Saturday Morning Breakfast Cereal — about the latest book he wrote with his wife Kelly: A City on Mars: Can We Settle Space, Should We Settle Space, and Have We Really Thought This Through?Links to learn more, highlights, and full transcript.They cover:Why space travel is suddenly getting a lot cheaper and re-igniting enthusiasm around space settlement.What Zach thinks are the best and worst arguments for settling space.Zach’s journey from optimistic about space settlement to a self-proclaimed “space bastard” (pessimist).How little we know about how microgravity and radiation affects even adults, much less the children potentially born in a space settlement.A rundown of where we could settle in the solar system, and the major drawbacks of even the most promising candidates.Why digging bunkers or underwater cities on Earth would beat fleeing to Mars in a catastrophe.How new space settlements could look a lot like old company towns — and whether or not that’s a bad thing.The current state of space law and how it might set us up for international conflict.How space cannibalism legal loopholes might work on the International Space Station.And much more.Chapters:Space optimism and space bastards (00:03:04)Bad arguments for why we should settle space (00:14:01)Superficially plausible arguments for why we should settle space (00:28:54)Is settling space even biologically feasible? (00:32:43)Sex, pregnancy, and child development in space (00:41:41)Where’s the best space place to settle? (00:55:02)Creating self-sustaining habitats (01:15:32)What about AI advances? (01:26:23)A roadmap for settling space (01:33:45)Space law (01:37:22)Space signalling and propaganda (01:51:28) Space war (02:00:40)Mining asteroids (02:06:29)Company towns and communes in space (02:10:55)Sending digital minds into space (02:26:37)The most promising space governance models (02:29:07)The tragedy of the commons (02:35:02)The tampon bandolier and other bodily functions in space (02:40:14)Is space cannibalism legal? (02:47:09)The pregnadrome and other bizarre proposals (02:50:02)Space sexism (02:58:38)What excites Zach about the future (03:02:57)Producer and editor: Keiran HarrisAudio engineering lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongAdditional content editing: Katy Moore and Luisa RodriguezTranscriptions: Katy Moore

14 Maj 20243h 6min

#186 – Dean Spears on why babies are born small in Uttar Pradesh, and how to save their lives

"I work in a place called Uttar Pradesh, which is a state in India with 240 million people. One in every 33 people in the whole world lives in Uttar Pradesh. It would be the fifth largest country if it were its own country. And if it were its own country, you’d probably know about its human development challenges, because it would have the highest neonatal mortality rate of any country except for South Sudan and Pakistan. Forty percent of children there are stunted. Only two-thirds of women are literate. So Uttar Pradesh is a place where there are lots of health challenges."And then even within that, we’re working in a district called Bahraich, where about 4 million people live. So even that district of Uttar Pradesh is the size of a country, and if it were its own country, it would have a higher neonatal mortality rate than any other country. In other words, babies born in Bahraich district are more likely to die in their first month of life than babies born in any country around the world." — Dean SpearsIn today’s episode, host Luisa Rodriguez speaks to Dean Spears — associate professor of economics at the University of Texas at Austin and founding director of r.i.c.e. — about his experience implementing a surprisingly low-tech but highly cost-effective kangaroo mother care programme in Uttar Pradesh, India to save the lives of vulnerable newborn infants.Links to learn more, highlights, and full transcript.They cover:The shockingly high neonatal mortality rates in Uttar Pradesh, India, and how social inequality and gender dynamics contribute to poor health outcomes for both mothers and babies.The remarkable benefits for vulnerable newborns that come from skin-to-skin contact and breastfeeding support.The challenges and opportunities that come with working with a government hospital to implement new, evidence-based programmes.How the currently small programme might be scaled up to save more newborns’ lives in other regions of Uttar Pradesh and beyond.How targeted health interventions stack up against direct cash transfers.Plus, a sneak peak into Dean’s new book, which explores the looming global population peak that’s expected around 2080, and the consequences of global depopulation.And much more.Chapters:Why is low birthweight a major problem in Uttar Pradesh? (00:02:45)Neonatal mortality and maternal health in Uttar Pradesh (00:06:10)Kangaroo mother care (00:12:08)What would happen without this intervention? (00:16:07)Evidence of KMC’s effectiveness (00:18:15)Longer-term outcomes (00:32:14)GiveWell’s support and implementation challenges (00:41:13)How can KMC be so cost effective? (00:52:38)Programme evaluation (00:57:21)Is KMC is better than direct cash transfers? (00:59:12)Expanding the programme and what skills are needed (01:01:29)Fertility and population decline (01:07:28)What advice Dean would give his younger self (01:16:09)Producer and editor: Keiran HarrisAudio engineering lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongAdditional content editing: Katy Moore and Luisa RodriguezTranscriptions: Katy Moore

1 Maj 20241h 18min

#185 – Lewis Bollard on the 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals

"The constraint right now on factory farming is how far can you push the biology of these animals? But AI could remove that constraint. It could say, 'Actually, we can push them further in these ways and these ways, and they still stay alive. And we’ve modelled out every possibility and we’ve found that it works.' I think another possibility, which I don’t understand as well, is that AI could lock in current moral values. And I think in particular there’s a risk that if AI is learning from what we do as humans today, the lesson it’s going to learn is that it’s OK to tolerate mass cruelty, so long as it occurs behind closed doors. I think there’s a risk that if it learns that, then it perpetuates that value, and perhaps slows human moral progress on this issue." —Lewis BollardIn today’s episode, host Luisa Rodriguez speaks to Lewis Bollard — director of the Farm Animal Welfare programme at Open Philanthropy — about the promising progress and future interventions to end the worst factory farming practices still around today.Links to learn more, highlights, and full transcript.They cover:The staggering scale of animal suffering in factory farms, and how it will only get worse without intervention.Work to improve farmed animal welfare that Open Philanthropy is excited about funding.The amazing recent progress made in farm animal welfare — including regulatory attention in the EU and a big win at the US Supreme Court — and the work that still needs to be done.The occasional tension between ending factory farming and curbing climate changeHow AI could transform factory farming for better or worse — and Lewis’s fears that the technology will just help us maximise cruelty in the name of profit.How Lewis has updated his opinions or grantmaking as a result of new research on the “moral weights” of different species.Lewis’s personal journey working on farm animal welfare, and how he copes with the emotional toll of confronting the scale of animal suffering.How listeners can get involved in the growing movement to end factory farming — from career and volunteer opportunities to impactful donations.And much more.Chapters:Common objections to ending factory farming (00:13:21)Potential solutions (00:30:55)Cage-free reforms (00:34:25)Broiler chicken welfare (00:46:48)Do companies follow through on these commitments? (01:00:21)Fish welfare (01:05:02)Alternatives to animal proteins (01:16:36)Farm animal welfare in Asia (01:26:00)Farm animal welfare in Europe (01:30:45)Animal welfare science (01:42:09)Approaches Lewis is less excited about (01:52:10)Will we end factory farming in our lifetimes? (01:56:36)Effect of AI (01:57:59)Recent big wins for farm animals (02:07:38)How animal advocacy has changed since Lewis first got involved (02:15:57)Response to the Moral Weight Project (02:19:52)How to help (02:28:14)Producer and editor: Keiran HarrisAudio engineering lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongAdditional content editing: Katy Moore and Luisa RodriguezTranscriptions: Katy Moore

18 Apr 20242h 33min

#184 – Zvi Mowshowitz on sleeping on sleeper agents, and the biggest AI updates since ChatGPT

Many of you will have heard of Zvi Mowshowitz as a superhuman information-absorbing-and-processing machine — which he definitely is. As the author of the Substack Don’t Worry About the Vase, Zvi has spent as much time as literally anyone in the world over the last two years tracking in detail how the explosion of AI has been playing out — and he has strong opinions about almost every aspect of it. Links to learn more, summary, and full transcript.In today’s episode, host Rob Wiblin asks Zvi for his takes on:US-China negotiationsWhether AI progress has stalledThe biggest wins and losses for alignment in 2023EU and White House AI regulationsWhich major AI lab has the best safety strategyThe pros and cons of the Pause AI movementRecent breakthroughs in capabilitiesIn what situations it’s morally acceptable to work at AI labsWhether you agree or disagree with his views, Zvi is super informed and brimming with concrete details.Zvi and Rob also talk about:The risk of AI labs fooling themselves into believing their alignment plans are working when they may not be.The “sleeper agent” issue uncovered in a recent Anthropic paper, and how it shows us how hard alignment actually is.Why Zvi disagrees with 80,000 Hours’ advice about gaining career capital to have a positive impact.Zvi’s project to identify the most strikingly horrible and neglected policy failures in the US, and how Zvi founded a new think tank (Balsa Research) to identify innovative solutions to overthrow the horrible status quo in areas like domestic shipping, environmental reviews, and housing supply.Why Zvi thinks that improving people’s prosperity and housing can make them care more about existential risks like AI.An idea from the online rationality community that Zvi thinks is really underrated and more people should have heard of: simulacra levels.And plenty more.Chapters:Zvi’s AI-related worldview (00:03:41)Sleeper agents (00:05:55)Safety plans of the three major labs (00:21:47)Misalignment vs misuse vs structural issues (00:50:00)Should concerned people work at AI labs? (00:55:45)Pause AI campaign (01:30:16)Has progress on useful AI products stalled? (01:38:03)White House executive order and US politics (01:42:09)Reasons for AI policy optimism (01:56:38)Zvi’s day-to-day (02:09:47)Big wins and losses on safety and alignment in 2023 (02:12:29)Other unappreciated technical breakthroughs (02:17:54)Concrete things we can do to mitigate risks (02:31:19)Balsa Research and the Jones Act (02:34:40)The National Environmental Policy Act (02:50:36)Housing policy (02:59:59)Underrated rationalist worldviews (03:16:22)Producer and editor: Keiran HarrisAudio Engineering Lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongTranscriptions and additional content editing: Katy Moore

11 Apr 20243h 31min

AI governance and policy (Article)

Today’s release is a reading of our career review of AI governance and policy, written and narrated by Cody Fenwick.Advanced AI systems could have massive impacts on humanity and potentially pose global catastrophic risks, and there are opportunities in the broad field of AI governance to positively shape how society responds to and prepares for the challenges posed by the technology.Given the high stakes, pursuing this career path could be many people’s highest-impact option. But they should be very careful not to accidentally exacerbate the threats rather than mitigate them.If you want to check out the links, footnotes and figures in today’s article, you can find those here.Editing and audio proofing: Ben Cordell and Simon MonsourNarration: Cody Fenwick

28 Mars 202451min

#183 – Spencer Greenberg on causation without correlation, money and happiness, lightgassing, hype vs value, and more

"When a friend comes to me with a decision, and they want my thoughts on it, very rarely am I trying to give them a really specific answer, like, 'I solved your problem.' What I’m trying to do often is give them other ways of thinking about what they’re doing, or giving different framings. A classic example of this would be someone who’s been working on a project for a long time and they feel really trapped by it. And someone says, 'Let’s suppose you currently weren’t working on the project, but you could join it. And if you joined, it would be exactly the state it is now. Would you join?' And they’d be like, 'Hell no!' It’s a reframe. It doesn’t mean you definitely shouldn’t join, but it’s a reframe that gives you a new way of looking at it." —Spencer GreenbergIn today’s episode, host Rob Wiblin speaks for a fourth time with listener favourite Spencer Greenberg — serial entrepreneur and host of the Clearer Thinking podcast — about a grab-bag of topics that Spencer has explored since his last appearance on the show a year ago.Links to learn more, summary, and full transcript.They cover:How much money makes you happy — and the tricky methodological issues that come up trying to answer that question.The importance of hype in making valuable things happen.How to recognise warning signs that someone is untrustworthy or likely to hurt you.Whether Registered Reports are successfully solving reproducibility issues in science.The personal principles Spencer lives by, and whether or not we should all establish our own list of life principles.The biggest and most harmful systemic mistakes we commit when making decisions, both individually and as groups.The potential harms of lightgassing, which is the opposite of gaslighting.How Spencer’s team used non-statistical methods to test whether astrology works.Whether there’s any social value in retaliation.And much more.Chapters:Does money make you happy? (00:05:54)Hype vs value (00:31:27)Warning signs that someone is bad news (00:41:25)Integrity and reproducibility in social science research (00:57:54)Personal principles (01:16:22)Decision-making errors (01:25:56)Lightgassing (01:49:23)Astrology (02:02:26)Game theory, tit for tat, and retaliation (02:20:51)Parenting (02:30:00)Producer and editor: Keiran HarrisAudio Engineering Lead: Ben CordellTechnical editing: Simon Monsour, Milo McGuire, and Dominic ArmstrongTranscriptions: Katy Moore

14 Mars 20242h 36min

#182 – Bob Fischer on comparing the welfare of humans, chickens, pigs, octopuses, bees, and more

"[One] thing is just to spend time thinking about the kinds of things animals can do and what their lives are like. Just how hard a chicken will work to get to a nest box before she lays an egg, the amount of labour she’s willing to go through to do that, to think about how important that is to her. And to realise that we can quantify that, and see how much they care, or to see that they get stressed out when fellow chickens are threatened and that they seem to have some sympathy for conspecifics."Those kinds of things make me say there is something in there that is recognisable to me as another individual, with desires and preferences and a vantage point on the world, who wants things to go a certain way and is frustrated and upset when they don’t. And recognising the individuality, the perspective of nonhuman animals, for me, really challenges my tendency to not take them as seriously as I think I ought to, all things considered." — Bob FischerIn today’s episode, host Luisa Rodriguez speaks to Bob Fischer — senior research manager at Rethink Priorities and the director of the Society for the Study of Ethics and Animals — about Rethink Priorities’s Moral Weight Project.Links to learn more, summary, and full transcript.They cover:The methods used to assess the welfare ranges and capacities for pleasure and pain of chickens, pigs, octopuses, bees, and other animals — and the limitations of that approach.Concrete examples of how someone might use the estimated moral weights to compare the benefits of animal vs human interventions.The results that most surprised Bob.Why the team used a hedonic theory of welfare to inform the project, and what non-hedonic theories of welfare might bring to the table.Thought experiments like Tortured Tim that test different philosophical assumptions about welfare.Confronting our own biases when estimating animal mental capacities and moral worth.The limitations of using neuron counts as a proxy for moral weights.How different types of risk aversion, like avoiding worst-case scenarios, could impact cause prioritisation.And plenty more.Chapters:Welfare ranges (00:10:19)Historical assessments (00:16:47)Method (00:24:02)The present / absent approach (00:27:39)Results (00:31:42)Chickens (00:32:42)Bees (00:50:00)Salmon and limits of methodology (00:56:18)Octopuses (01:00:31)Pigs (01:27:50)Surprises about the project (01:30:19)Objections to the project (01:34:25)Alternative decision theories and risk aversion (01:39:14)Hedonism assumption (02:00:54)Producer and editor: Keiran HarrisAudio Engineering Lead: Ben CordellTechnical editing: Simon Monsour and Milo McGuireAdditional content editing: Katy Moore and Luisa RodriguezTranscriptions: Katy Moore

8 Mars 20242h 21min

Premium

99 kr/ månad

Tillgång till alla Premium-poddar
Reklamfritt premium-innehåll
Avsluta när du vill

Prova 14 dagar gratis

Premium

129 kr/ månad

Tillgång till alla Premium-poddar
Reklamfritt premium-innehåll
Avsluta när du vill
Ett extra konto

Prova 14 dagar gratis

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Upptäck Premium

Avsnitt(310)

#188 – Matt Clancy on whether science is good

#187 – Zach Weinersmith on how researching his book turned him from a space optimist into a "space bastard"

#186 – Dean Spears on why babies are born small in Uttar Pradesh, and how to save their lives

#185 – Lewis Bollard on the 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals

#184 – Zvi Mowshowitz on sleeping on sleeper agents, and the biggest AI updates since ChatGPT

AI governance and policy (Article)

#183 – Spencer Greenberg on causation without correlation, money and happiness, lightgassing, hype vs value, and more

#182 – Bob Fischer on comparing the welfare of humans, chickens, pigs, octopuses, bees, and more

Allt en och samma app

Noga utvalt innehåll

Fortsätt när du vill

Premium

Premium

Populärt inom Utbildning

Berättelserna och rösterna du älskar att lyssna på