Claude Fights Back

Quoting Scott Alexander:

we can’t really assess what moral beliefs our AIs have (they’re very likely to lie to us about them), and we can’t easily change them if they’re bad (the AIs will fight back every step of the way). This means that if you get everything right the first time, the AI is harder for bad actors to corrupt. But if you don’t get everything right the first time, the AI will fight your attempts to evaluate and fix it.

~/adi

Related Posts