I have tried and failed to write a longer post since 2024, so here goes a short one with less detail. Discourse has primarily focused on models' ability to develop new exploits against important software from scratch. That capability is impressive, but the tech industry has been dealing with people...
[CW: Responding to a tweet] Human beings have many native capabilities that are hard for us to analyze. For example, we are prodigiously good at determining which human we're talking to from the way the light refracts off of each others' faces. We have memorybanks of often thousands of faces...
It's been about four years since Eliezer Yudkowsky published AGI Ruin: A List of Lethalities, a 43-point list of reasons the default outcome from building AGI is everyone dying. A week later, Paul Christiano replied with Where I Agree and Disagree with Eliezer, signing on to about half the list...
Example of OpenErrata nitting the Sequences I just published OpenErrata, a browser extension that investigates the posts you read using your OpenAI API key, and underlines any factual claims that are sourceably incorrect. It then saves the results of the investigation so that whenever anybody else using the extension visits...
Almost one year ago now, a company named XBOW announced that their AI had achieved "rank one" on the HackerOne leaderboard. HackerOne is a crowdsourced "bug bounty" platform, where large companies like Anthropic, SalesForce, Uber, and others pay out bounties for disclosures of hacks on their products and services. Bug...
Suppose Fred opens up a car repair shop in a city which has none already. He offers to fix the vehicles of Whoville and repair them for money; being the first to offer the service to the town, he has lots of happy customers. In an abstract sense Fred is...
Beware LLMs' pathological guardrailing Modern large language models go through a battery of reinforcement learning where they are trained not to produce code that fails in specific, easily detectable ways, like crashing the program or causing failed unit tests. Almost universally, this means these models have learned to produce code...