#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

  • Why he’s more worried about AI hacking its own data centre than escaping
  • What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
  • Why he might want to use a model he thought could be conspiring against him
  • Why he would feel safer if he caught an AI attempting to escape
  • Why many control techniques would be relatively inexpensive
  • How to use an untrusted model to monitor another untrusted model
  • What the minimum viable intervention in a “lazy” AI company might look like
  • How even small teams of safety-focused staff within AI labs could matter
  • The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

  • Cold open |00:00:00|
  • Who’s Buck Shlegeris? |00:01:27|
  • What's AI control? |00:01:51|
  • Why is AI control hot now? |00:05:39|
  • Detecting human vs AI spies |00:10:32|
  • Acute vs chronic AI betrayal |00:15:21|
  • How to catch AIs trying to escape |00:17:48|
  • The cheapest AI control techniques |00:32:48|
  • Can we get untrusted models to do trusted work? |00:38:58|
  • If we catch a model escaping... will we do anything? |00:50:15|
  • Getting AI models to think they've already escaped |00:52:51|
  • Will they be able to tell it's a setup? |00:58:11|
  • Will AI companies do any of this stuff? |01:00:11|
  • Can we just give AIs fewer permissions? |01:06:14|
  • Can we stop human spies the same way? |01:09:58|
  • The pitch to AI companies to do this |01:15:04|
  • Will AIs get superhuman so fast that this is all useless? |01:17:18|
  • Risks from AI deliberately doing a bad job |01:18:37|
  • Is alignment still useful? |01:24:49|
  • Current alignment methods don't detect scheming |01:29:12|
  • How to tell if AI control will work |01:31:40|
  • How can listeners contribute? |01:35:53|
  • Is 'controlling' AIs kind of a dick move? |01:37:13|
  • Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|
  • Benefits of working outside frontier AI companies |01:47:48|
  • Why Redwood Research does what it does |01:51:34|
  • What other safety-related research looks best to Buck? |01:58:56|
  • If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|
  • Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|
  • Is research on human scheming relevant to AI? |02:08:03|

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Episoder(333)

#81 Classic episode - Ben Garfinkel on scrutinising classic AI risk arguments

#81 Classic episode - Ben Garfinkel on scrutinising classic AI risk arguments

Rebroadcast: this episode was originally released in July 2020. 80,000 Hours, along with many other members of the effective altruism movement, has argued that helping to positively shape the develo...

9 Jan 20232h 37min

#83 Classic episode - Jennifer Doleac on preventing crime without police and prisons

#83 Classic episode - Jennifer Doleac on preventing crime without police and prisons

Rebroadcast: this episode was originally released in July 2020. Today’s guest, Jennifer Doleac — Associate Professor of Economics at Texas A&M University, and Director of the Justice Tech Lab — is a...

4 Jan 20232h 17min

#143 – Jeffrey Lewis on the most common misconceptions about nuclear weapons

#143 – Jeffrey Lewis on the most common misconceptions about nuclear weapons

America aims to avoid nuclear war by relying on the principle of 'mutually assured destruction,' right? Wrong. Or at least... not officially.As today's guest — Jeffrey Lewis, founder of Arms Control W...

29 Des 20222h 40min

#142 – John McWhorter on key lessons from linguistics, the virtue of creoles, and language extinction

#142 – John McWhorter on key lessons from linguistics, the virtue of creoles, and language extinction

John McWhorter is a linguistics professor at Columbia University specialising in research on creole languages.He's also a content-producing machine, never afraid to give his frank opinion on anything ...

20 Des 20221h 47min

#141 – Richard Ngo on large language models, OpenAI, and striving to make the future go well

#141 – Richard Ngo on large language models, OpenAI, and striving to make the future go well

Large language models like GPT-3, and now ChatGPT, are neural networks trained on a large fraction of all text available on the internet to do one thing: predict the next word in a passage. This simpl...

13 Des 20222h 44min

My experience with imposter syndrome — and how to (partly) overcome it (Article)

My experience with imposter syndrome — and how to (partly) overcome it (Article)

Today’s release is a reading of our article called My experience with imposter syndrome — and how to (partly) overcome it, written and narrated by Luisa Rodriguez. If you want to check out the links...

8 Des 202244min

Rob's thoughts on the FTX bankruptcy

Rob's thoughts on the FTX bankruptcy

In this episode, usual host of the show Rob Wiblin gives his thoughts on the recent collapse of FTX. Click here for an official 80,000 Hours statement. And here are links to some potentially relev...

23 Nov 20225min

#140 – Bear Braumoeller on the case that war isn't in decline

#140 – Bear Braumoeller on the case that war isn't in decline

Is war in long-term decline? Steven Pinker's The Better Angels of Our Nature brought this previously obscure academic question to the centre of public debate, and pointed to rates of death in war to a...

8 Nov 20222h 47min

Populært innen Fakta

fastlegen
dine-penger-pengeradet
relasjonspodden-med-dora-thorhallsdottir-kjersti-idem
rss-strid-de-norske-borgerkrigene
mikkels-paskenotter
foreldreradet
rss-bisarr-historie
treningspodden
jakt-og-fiskepodden
sinnsyn
rss-kunsten-a-leve
hverdagspsyken
ukast
rss-sunn-okonomi
rss-bak-luftfarten
fryktlos
tomprat-med-gunnar-tjomlid
lederskap-nhhs-podkast-om-ledelse
gravid-uke-for-uke
hagespiren-podcast