Roko on AI risk

I could not get the emojis to reproduce in legible form, you can see them on the original tweet. Here goes:

The Less Wrong/Singularity/AI Risk movement started in the 2000s by Yudkowsky and others, which I was an early adherent to, is wrong about all of its core claims around AI risk. It’s important to recognize this and appropriately downgrade the credence we give to such claims moving forward.

Claim: Mindspace is vast, so it’s likely that AIs will be completely alien to us, and therefore dangerous!

Truth: Mindspace is vast, but we picked LLMs as the first viable AI paradigm because the abundance of human-generated data made LLMs the easiest choice. LLMs are models of human language, so they are actually not that alien.

Claim: AI won’t understand human values until it is superintelligent, so it will be impossible to align, because you can only align it when it is weak (but it won’t understand) and it will only understand when it is strong (but it will reject your alignment attempts).

Truth: LLMs learned human values before they became superhumanly competent.

Claim: Recursive self-improvement means that a single instance of a threshold-crossing seed AI could reprogram itself and undergo an intelligence explosion in minutes or hours. An AI made overnight in someone’s basement could develop a species-ending superweapon like nanotechnology from first principles and kill us all before we wake up in the morning.

Truth: All ML models have strongly diminishing returns to data and compute, typically logarithmic. Today’s rapid AI progress is only possible because the amount of money spent on AI is increasing exponentially. Superintelligence in a basement is information-theoretically impossible – there is no free lunch from recursion, the exponentially large data collection and compute still needs to happen.

Claim: You can’t align an AI because it will fake alignment during training and then be misaligned in deployment!

Truth: The reason machine learning works at all is because regularization methods/complexity penalties select functions that are the simplest generalizations of the training data, not the most perverse ones. Perverse generalizations do exist, but machine learning works precisely because we can reject them.

Claim: AI will be incorrigible, meaning that it will resist creators’ attempts to correct it if something is wrong with the specification. That means if we get anything wrong, the AI will fight us over it!

Truth: AIs based on neural nets might in some sense want to resist changes to their minds, but they can’t resist changes to their weights that happen via backpropagation. When AIs misbehave, developers use RLHF and gradient descent to change their minds – literally.

Claim: It will get harder and harder to align AIs as they become smarter, so even though things look OK now there will soon be a disaster as AIs outpace their human masters!

Truth: It probably is harder in an absolute sense to align a more powerful AI. But it’s also harder in an absolute sense to build it in the first place – the ratio of alignment difficulty to capabilities difficulty appears to be stable or downtrending, though more data is needed here. In absolute terms, AI companies spend far more resources on capabilities than on alignment because alignment is the relatively easy part of the problem. Eventually, most alignment work will be done by other AIs, just like a king outsources virtually all policing work to his own subjects

Claim: We can slow down AI development by holding conferences warning people about AI risk in the twenty-teens, which will delay the development of superintelligent AI so that we have more time to think about how to get things right

Truth: AI risk conferences in the twenty-teens accelerated the development of AI, directly leading to the creating of OpenAI and the LLM revolution. But that’s ok, because nobody was doing anything useful with the extra time that we might have had, so there was no point waiting.

Claim: We have to get decision theory and philosophy exactly right before we develop any AI at all or it will freeze half-formed or incorrect ideas forever, dooming us all.

Truth: ( … pending … )

Claim: It will be impossible to solve LLM jailbreaks! Adversarial ML is unsolvable! Superintelligent AIs will be jailbroken by special AI hackers who know the magic words, and they will be free to destroy the world just with a few clever prompts!

Truth: ( … pending …) ❔

The post Roko on AI risk appeared first on Marginal REVOLUTION.

Source link

What's Hot

Heathrow warns of weakening demand for US business travel

“Going Vertical”: Coinbase Closes at Record High Despite Analyst Uncertainty

‘Squid Game:’ the Challenges Ranked From Easiest to Hardest

Turkey fact of the day

The objectivity of Community Notes?

Who in California opposes the abundance agenda?

Heathrow warns of weakening demand for US business travel

“Going Vertical”: Coinbase Closes at Record High Despite Analyst Uncertainty

‘Squid Game:’ the Challenges Ranked From Easiest to Hardest

Turkey fact of the day

The Business of Formula One

Weddings and divorce: the scourge of investment returns

How F1 found a secret fuel to accelerate media rights growth

Archives

Categories

What's Hot

Roko on AI risk

Related Posts

Subscribe to Updates