lc — LessWrong

Models finding software vulnerabilities is not the primary source of cybersecurity risk

I have tried and failed to write a longer post since 2024, so here goes a short one with less detail. Discourse has primarily focused on models' ability to develop new exploits against important software from scratch. That capability is impressive, but the tech industry has been dealing with people...

May 14126

A lack of introspective ability is not a lack of corrigibility

[CW: Responding to a tweet] Human beings have many native capabilities that are hard for us to analyze. For example, we are prodigiously good at determining which human we're talking to from the way the light refracts off of each others' faces. We have memorybanks of often thousands of faces...

May 1326

Reevaluating "AGI Ruin: A List of Lethalities" in 2026

It's been about four years since Eliezer Yudkowsky published AGI Ruin: A List of Lethalities, a 43-point list of reasons the default outcome from building AGI is everyone dying. A week later, Paul Christiano replied with Where I Agree and Disagree with Eliezer, signing on to about half the list...

Apr 19145

Open sourcing a browser extension that shows when people are wrong on the internet

Example of OpenErrata nitting the Sequences I just published OpenErrata, a browser extension that investigates the posts you read using your OpenAI API key, and underlines any factual claims that are sourceably incorrect. It then saves the results of the investigation so that whenever anybody else using the extension visits...

Feb 24226

Be skeptical of milestone announcements by young AI startups

Almost one year ago now, a company named XBOW announced that their AI had achieved "rank one" on the HackerOne leaderboard. HackerOne is a crowdsourced "bug bounty" platform, where large companies like Anthropic, SalesForce, Uber, and others pay out bounties for disclosures of hacks on their products and services. Bug...

Feb 1925

Most successful entrepreneurship is unproductive

Suppose Fred opens up a car repair shop in a city which has none already. He offers to fix the vehicles of Whoville and repair them for money; being the first to offer the service to the town, he has lots of happy customers. In an abstract sense Fred is...

Dec 22, 202545

Beware LLMs' pathological guardrailing

Beware LLMs' pathological guardrailing Modern large language models go through a battery of reinforcement learning where they are trained not to produce code that fails in specific, easily detectable ways, like crashing the program or causing failed unit tests. Almost universally, this means these models have learned to produce code...

Sep 19, 202522

lc

lc

lc

What an actually pessimistic containment strategy looks like

Recent AI model progress feels mostly like bullshit

A non-magical explanation of Jeffrey Epstein

Open sourcing a browser extension that shows when people are wrong on the internet

lc

What an actually pessimistic containment strategy looks like

Recent AI model progress feels mostly like bullshit

A non-magical explanation of Jeffrey Epstein

Open sourcing a browser extension that shows when people are wrong on the internet

Models finding software vulnerabilities is not the primary source of cybersecurity risk

A lack of introspective ability is not a lack of corrigibility

Reevaluating "AGI Ruin: A List of Lethalities" in 2026

Open sourcing a browser extension that shows when people are wrong on the internet

Be skeptical of milestone announcements by young AI startups

Most successful entrepreneurship is unproductive

Beware LLMs' pathological guardrailing