#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

Why he’s more worried about AI hacking its own data centre than escaping
What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
Why he might want to use a model he thought could be conspiring against him
Why he would feel safer if he caught an AI attempting to escape
Why many control techniques would be relatively inexpensive
How to use an untrusted model to monitor another untrusted model
What the minimum viable intervention in a “lazy” AI company might look like
How even small teams of safety-focused staff within AI labs could matter
The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

Cold open |00:00:00|
Who’s Buck Shlegeris? |00:01:27|
What's AI control? |00:01:51|
Why is AI control hot now? |00:05:39|
Detecting human vs AI spies |00:10:32|
Acute vs chronic AI betrayal |00:15:21|
How to catch AIs trying to escape |00:17:48|
The cheapest AI control techniques |00:32:48|
Can we get untrusted models to do trusted work? |00:38:58|
If we catch a model escaping... will we do anything? |00:50:15|
Getting AI models to think they've already escaped |00:52:51|
Will they be able to tell it's a setup? |00:58:11|
Will AI companies do any of this stuff? |01:00:11|
Can we just give AIs fewer permissions? |01:06:14|
Can we stop human spies the same way? |01:09:58|
The pitch to AI companies to do this |01:15:04|
Will AIs get superhuman so fast that this is all useless? |01:17:18|
Risks from AI deliberately doing a bad job |01:18:37|
Is alignment still useful? |01:24:49|
Current alignment methods don't detect scheming |01:29:12|
How to tell if AI control will work |01:31:40|
How can listeners contribute? |01:35:53|
Is 'controlling' AIs kind of a dick move? |01:37:13|
Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|
Benefits of working outside frontier AI companies |01:47:48|
Why Redwood Research does what it does |01:51:34|
What other safety-related research looks best to Buck? |01:58:56|
If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|
Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|
Is research on human scheming relevant to AI? |02:08:03|

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Kokeile Premiumia

Nauti 14 päivää ilmaiseksi

Tilaa Premium

Jaksot(310)

#88 – Tristan Harris on the need to change the incentives of social media companies

In its first 28 days on Netflix, the documentary The Social Dilemma — about the possible harms being caused by social media and other technology products — was seen by 38 million households in about 190 countries and in 30 languages. Over the last ten years, the idea that Facebook, Twitter, and YouTube are degrading political discourse and grabbing and monetizing our attention in an alarming way has gone mainstream to such an extent that it's hard to remember how recently it was a fringe view. It feels intuitively true that our attention spans are shortening, we’re spending more time alone, we’re less productive, there’s more polarization and radicalization, and that we have less trust in our fellow citizens, due to having less of a shared basis of reality. But while it all feels plausible, how strong is the evidence that it's true? In the past, people have worried about every new technological development — often in ways that seem foolish in retrospect. Socrates famously feared that being able to write things down would ruin our memory. At the same time, historians think that the printing press probably generated religious wars across Europe, and that the radio helped Hitler and Stalin maintain power by giving them and them alone the ability to spread propaganda across the whole of Germany and the USSR. Fears about new technologies aren't always misguided. Tristan Harris, leader of the Center for Humane Technology, and co-host of the Your Undivided Attention podcast, is arguably the most prominent person working on reducing the harms of social media, and he was happy to engage with Rob’s good-faith critiques. • Links to learn more, summary and full transcript. • FYI, the 2020 Effective Altruism Survey is closing soon: https://www.surveymonkey.co.uk/r/EAS80K2 Tristan and Rob provide a thorough exploration of the merits of possible concrete solutions – something The Social Dilemma didn’t really address. Given that these companies are mostly trying to design their products in the way that makes them the most money, how can we get that incentive to align with what's in our interests as users and citizens? One way is to encourage a shift to a subscription model. One claim in The Social Dilemma is that the machine learning algorithms on these sites try to shift what you believe and what you enjoy in order to make it easier to predict what content recommendations will keep you on the site. But if you paid a yearly fee to Facebook in lieu of seeing ads, their incentive would shift towards making you as satisfied as possible with their service — even if that meant using it for five minutes a day rather than 50. Despite all the negatives, Tristan doesn’t want us to abandon the technologies he's concerned about. He asks us to imagine a social media environment designed to regularly bring our attention back to what each of us can do to improve our lives and the world. Just as we can focus on the positives of nuclear power while remaining vigilant about the threat of nuclear weapons, we could embrace social media and recommendation algorithms as the largest mass-coordination engine we've ever had — tools that could educate and organise people better than anything that has come before. The tricky and open question is how to get there. Rob and Tristan also discuss: • Justified concerns vs. moral panics • The effect of social media on politics in the US and developing countries • Tips for individuals Chapters:Rob’s intro (00:00:00)The interview begins (00:01:36)Center for Humane Technology (00:04:53)Critics (00:08:19)The Social Dilemma (00:13:20)Three categories of harm (00:20:31)Justified concerns vs. moral panics (00:30:23)The messy real world vs. an imagined idealised world (00:38:20)The persuasion apocalypse (00:47:46)Revolt of the Public (00:56:48)Global effects (01:02:44)US politics (01:13:32)Potential solutions (01:20:59)Unintended consequences (01:42:57)Win-win changes (01:50:47)Big wins over the last 5 or 10 years (01:59:10)The subscription model (02:02:28)Tips for individuals (02:14:05)The current state of the research (02:22:37)Careers (02:26:36)Producer: Keiran Harris.Audio mastering: Ben Cordell.Transcriptions: Sofia Davis-Fogel.

3 Joulu 20202h 35min

Benjamin Todd on what the effective altruism community most needs (80k team chat #4)

In the last '80k team chat' with Ben Todd and Arden Koehler, we discussed what effective altruism is and isn't, and how to argue for it. In this episode we turn now to what the effective altruism community most needs. • Links to learn more, summary and full transcript • The 2020 Effective Altruism Survey just opened. If you're involved with the effective altruism community, or sympathetic to its ideas, it's would be wonderful if you could fill it out: https://www.surveymonkey.co.uk/r/EAS80K2 According to Ben, we can think of the effective altruism movement as having gone through several stages, categorised by what kind of resource has been most able to unlock more progress on important issues (i.e. by what's the 'bottleneck'). Plausibly, these stages are common for other social movements as well. • Needing money: In the first stage, when effective altruism was just getting going, more money (to do things like pay staff and put on events) was the main bottleneck to making progress. • Needing talent: In the second stage, we especially needed more talented people being willing to work on whatever seemed most pressing. • Needing specific skills and capacity: In the third stage, which Ben thinks we're in now, the main bottlenecks are organizational capacity, infrastructure, and management to help train people up, as well as specialist skills that people can put to work now. What's next? Perhaps needing coordination -- the ability to make sure people keep working efficiently and effectively together as the community grows. Ben and I also cover the career implications of those stages, as well as the ability to save money and the possibility that someone else would do your job in your absence. If you’d like to learn more about these topics, you should check out a couple of articles on our site: • Think twice before talking about ‘talent gaps’ – clarifying nine misconceptions • How replaceable are the top candidates in large hiring rounds? Why the answer flips depending on the distribution of applicant ability Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris. Audio mastering: Ben Cordell. Transcriptions: Zakee Ulhaq.

12 Marras 20201h 25min

#87 – Russ Roberts on whether it's more effective to help strangers, or people you know

If you want to make the world a better place, would it be better to help your niece with her SATs, or try to join the State Department to lower the risk that the US and China go to war? People involved in 80,000 Hours or the effective altruism community would be comfortable recommending the latter. This week's guest — Russ Roberts, host of the long-running podcast EconTalk, and author of a forthcoming book on decision-making under uncertainty and the limited ability of data to help — worries that might be a mistake. Links to learn more, summary and full transcript. I've been a big fan of Russ' show EconTalk for 12 years — in fact I have a list of my top 100 recommended episodes — so I invited him to talk about his concerns with how the effective altruism community tries to improve the world. These include: • Being too focused on the measurable • Being too confident we've figured out 'the best thing' • Being too credulous about the results of social science or medical experiments • Undermining people's altruism by encouraging them to focus on strangers, who it's naturally harder to care for • Thinking it's possible to predictably help strangers, who you don't understand well enough to know what will truly help • Adding levels of wellbeing across people when this is inappropriate • Encouraging people to pursue careers they won't enjoy These worries are partly informed by Russ' 'classical liberal' worldview, which involves a preference for free market solutions to problems, and nervousness about the big plans that sometimes come out of consequentialist thinking. While we do disagree on a range of things — such as whether it's possible to add up wellbeing across different people, and whether it's more effective to help strangers than people you know — I make the case that some of these worries are founded on common misunderstandings about effective altruism, or at least misunderstandings of what we believe here at 80,000 Hours. We primarily care about making the world a better place over thousands or even millions of years — and we wouldn’t dream of claiming that we could accurately measure the effects of our actions on that timescale. I'm more skeptical of medicine and empirical social science than most people, though not quite as skeptical as Russ (check out this quiz I made where you can guess which academic findings will replicate, and which won't). And while I do think that people should occasionally take jobs they dislike in order to have a social impact, those situations seem pretty few and far between. But Russ and I disagree about how much we really disagree. In addition to all the above we also discuss: • How to decide whether to have kids • Was the case for deworming children oversold? • Whether it would be better for countries around the world to be better coordinated Chapters:Rob’s intro (00:00:00)The interview begins (00:01:48)RCTs and donations (00:05:15)The 80,000 Hours project (00:12:35)Expanding the moral circle (00:28:37)Global coordination (00:39:48)How to act if you're pessimistic about improving the long-term future (00:55:49)Communicating uncertainty (01:03:31)How much to trust empirical research (01:09:19)How to decide whether to have kids (01:24:13)Utilitarianism (01:34:01)Producer: Keiran Harris.Audio mastering: Ben Cordell.Transcriptions: Zakee Ulhaq.

3 Marras 20201h 49min

How much does a vote matter? (Article)

Today’s release is the latest in our series of audio versions of our articles.In this one — How much does a vote matter? — I investigate the two key things that determine the impact of your vote: • The chances of your vote changing an election’s outcome • How much better some candidates are for the world as a whole, compared to others I then discuss what I think are the best arguments against voting in important elections: • If an election is competitive, that means other people disagree about which option is better, and you’re at some risk of voting for the worse candidate by mistake. • While voting itself doesn’t take long, knowing enough to accurately pick which candidate is better for the world actually does take substantial effort — effort that could be better allocated elsewhere. Finally, I look into the impact of donating to campaigns or working to ‘get out the vote’, which can be effective ways to generate additional votes for your preferred candidate. If you want to check out the links, footnotes and figures in today’s article, you can find those here. Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris.

29 Loka 202031min

#86 – Hilary Greaves on Pascal's mugging, strong longtermism, and whether existing can be good for us

Had World War 1 never happened, you might never have existed. It’s very unlikely that the exact chain of events that led to your conception would have happened otherwise — so perhaps you wouldn't have been born. Would that mean that it's better for you that World War 1 happened (regardless of whether it was better for the world overall)? On the one hand, if you're living a pretty good life, you might think the answer is yes – you get to live rather than not. On the other hand, it sounds strange to say that it's better for you to be alive, because if you'd never existed there'd be no you to be worse off. But if you wouldn't be worse off if you hadn't existed, can you be better off because you do? In this episode, philosophy professor Hilary Greaves – Director of Oxford University’s Global Priorities Institute – helps untangle this puzzle for us and walks me and Rob through the space of possible answers. She argues that philosophers have been too quick to conclude what she calls existence non-comparativism – i.e, that it can't be better for someone to exist vs. not. Links to learn more, summary and full transcript. Where we come down on this issue matters. If people are not made better off by existing and having good lives, you might conclude that bringing more people into existence isn't better for them, and thus, perhaps, that it's not better at all. This would imply that bringing about a world in which more people live happy lives might not actually be a good thing (if the people wouldn't otherwise have existed) — which would affect how we try to make the world a better place. Those wanting to have children in order to give them the pleasure of a good life would in some sense be mistaken. And if humanity stopped bothering to have kids and just gradually died out we would have no particular reason to be concerned. Furthermore it might mean we should deprioritise issues that primarily affect future generations, like climate change or the risk of humanity accidentally wiping itself out. This is our second episode with Professor Greaves. The first one was a big hit, so we thought we'd come back and dive into even more complex ethical issues. We discuss: • The case for different types of ‘strong longtermism’ — the idea that we ought morally to try to make the very long run future go as well as possible • What it means for us to be 'clueless' about the consequences of our actions • Moral uncertainty -- what we should do when we don't know which moral theory is correct • Whether we should take a bet on a really small probability of a really great outcome • The field of global priorities research at the Global Priorities Institute and beyondChapters:The interview begins (00:02:53)The Case for Strong Longtermism (00:05:49)Compatible moral views (00:20:03)Defining cluelessness (00:39:26)Why cluelessness isn’t an objection to longtermism (00:51:05)Theories of what to do under moral uncertainty (01:07:42)Pascal’s mugging (01:16:37)Comparing Existence and Non-Existence (01:30:58)Philosophers who reject existence comparativism (01:48:56)Lives framework (02:01:52)Global priorities research (02:09:25) Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris. Audio mastering: Ben Cordell. Transcriptions: Zakee Ulhaq.

21 Loka 20202h 24min

Benjamin Todd on the core of effective altruism and how to argue for it (80k team chat #3)

Today’s episode is the latest conversation between Arden Koehler, and our CEO, Ben Todd. Ben’s been thinking a lot about effective altruism recently, including what it really is, how it's framed, and how people misunderstand it. We recently released an article on misconceptions about effective altruism – based on Will MacAskill’s recent paper The Definition of Effective Altruism – and this episode can act as a companion piece. Links to learn more, summary and full transcript. Arden and Ben cover a bunch of topics related to effective altruism: • How it isn’t just about donating money to fight poverty • Whether it includes a moral obligation to give • The rigorous argument for its importance • Objections to that argument • How to talk about effective altruism for people who aren't already familiar with it Given that we’re in the same office, it’s relatively easy to record conversations between two 80k team members — so if you enjoy these types of bonus episodes, let us know at podcast@80000hours.org, and we might make them a more regular feature. Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris. Audio mastering: Ben Cordell. Transcriptions: Zakee Ulhaq.

22 Syys 20201h 24min

Ideas for high impact careers beyond our priority paths (Article)

Today’s release is the latest in our series of audio versions of our articles. In this one, we go through some more career options beyond our priority paths that seem promising to us for positively influencing the long-term future. Some of these are likely to be written up as priority paths in the future, or wrapped into existing ones, but we haven’t written full profiles for them yet—for example policy careers outside AI and biosecurity policy that seem promising from a longtermist perspective. Others, like information security, we think might be as promising for many people as our priority paths, but because we haven’t investigated them much we’re still unsure. Still others seem like they’ll typically be less impactful than our priority paths for people who can succeed equally in either, but still seem high-impact to us and like they could be top options for a substantial number of people, depending on personal fit—for example research management. Finally some—like becoming a public intellectual—clearly have the potential for a lot of impact, but we can’t recommend them widely because they don’t have the capacity to absorb a large number of people, are particularly risky, or both. If you want to check out the links in today’s article, you can find those here. Our annual user survey is also now open for submissions. Once a year for two weeks we ask all of you, our podcast listeners, article readers, advice receivers, and so on, so let us know how we've helped or hurt you. 80,000 Hours now offers many different services, and your feedback helps us figure out which programs to keep, which to cut, and which to expand. This year we have a new section covering the podcast, asking what kinds of episodes you liked the most and want to see more of, what extra resources you use, and some other questions too. We're always especially interested to hear ways that our work has influenced what you plan to do with your life or career, whether that impact was positive, neutral, or negative. That might be a different focus in your existing job, or a decision to study something different or look for a new job. Alternatively, maybe you're now planning to volunteer somewhere, or donate more, or donate to a different organisation. Your responses to the survey will be carefully read as part of our upcoming annual review, and we'll use them to help decide what 80,000 Hours should do differently next year. So please do take a moment to fill out the user survey before it closes on Sunday (13th of September). You can find it at 80000hours.org/survey Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris. Audio mastering: Ben Cordell. Transcriptions: Zakee Ulhaq.

7 Syys 202027min

Benjamin Todd on varieties of longtermism and things 80,000 Hours might be getting wrong (80k team chat #2)

Today’s bonus episode is a conversation between Arden Koehler, and our CEO, Ben Todd. Ben’s been doing a bunch of research recently, and we thought it’d be interesting to hear about how he’s currently thinking about a couple of different topics – including different types of longtermism, and things 80,000 Hours might be getting wrong. Links to learn more, summary and full transcript. This is very off-the-cut compared to our regular episodes, and just 54 minutes long. In the first half, Arden and Ben talk about varieties of longtermism: • Patient longtermism • Broad urgent longtermism • Targeted urgent longtermism focused on existential risks • Targeted urgent longtermism focused on other trajectory changes • And their distinctive implications for people trying to do good with their careers. In the second half, they move on to: • How to trade-off transferable versus specialist career capital • How much weight to put on personal fit • Whether we might be highlighting the wrong problems and career paths. Given that we’re in the same office, it’s relatively easy to record conversations between two 80k team members — so if you enjoy these types of bonus episodes, let us know at podcast@80000hours.org, and we might make them a more regular feature. Our annual user survey is also now open for submissions. Once a year for two weeks we ask all of you, our podcast listeners, article readers, advice receivers, and so on, so let us know how we've helped or hurt you. 80,000 Hours now offers many different services, and your feedback helps us figure out which programs to keep, which to cut, and which to expand. This year we have a new section covering the podcast, asking what kinds of episodes you liked the most and want to see more of, what extra resources you use, and some other questions too. We're always especially interested to hear ways that our work has influenced what you plan to do with your life or career, whether that impact was positive, neutral, or negative. That might be a different focus in your existing job, or a decision to study something different or look for a new job. Alternatively, maybe you're now planning to volunteer somewhere, or donate more, or donate to a different organisation. Your responses to the survey will be carefully read as part of our upcoming annual review, and we'll use them to help decide what 80,000 Hours should do differently next year. So please do take a moment to fill out the user survey. You can find it at 80000hours.org/survey Get this episode by subscribing: type 80,000 Hours into your podcasting app. Or read the linked transcript. Producer: Keiran Harris. Audio mastering: Ben Cordell. Transcriptions: Zakee Ulhaq.

1 Syys 202057min

Premium

9,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa

Aloita 14 päivän kokeilu

Premium

13,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa
Yksi lisäkäyttäjä

Kokeile 14 päivää maksutta

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Kokeile Premiumia

Jaksot(310)

#88 – Tristan Harris on the need to change the incentives of social media companies

Benjamin Todd on what the effective altruism community most needs (80k team chat #4)

#87 – Russ Roberts on whether it's more effective to help strangers, or people you know

How much does a vote matter? (Article)

#86 – Hilary Greaves on Pascal's mugging, strong longtermism, and whether existing can be good for us

Benjamin Todd on the core of effective altruism and how to argue for it (80k team chat #3)

Ideas for high impact careers beyond our priority paths (Article)

Benjamin Todd on varieties of longtermism and things 80,000 Hours might be getting wrong (80k team chat #2)

Kaikki yhdessä sovelluksessa

Sinulle valikoitua sisältöä

Jatka kuuntelua koska tahansa

Premium

Premium

Suosittua kategoriassa Koulutus

Tarinat ja äänet, joita rakastat kuunnella