BPOI Banner
AI Models ‘Secretly’ Learn Capabilities Long Before They Show Them, Researchers Find AI Models ‘Secretly’ Learn Capabilities Long Before They Show Them, Researchers Find

AI Models ‘Secretly’ Learn Capabilities Long Before They Show Them, Researchers Find

Modern AI models possess hidden capabilities that emerge suddenly and consistently during training, but these abilities remain concealed until prompted in specific ways, according to new research from Harvard and the University of Michigan.

The study, which analyzed how AI systems learn concepts like color and size, revealed that models often master these skills far earlier than standard tests suggest—a finding with major implications for AI safety and development.

“Our results demonstrate that measuring an AI system’s capabilities is more complex than previously thought,” the research paper says. “A model might appear incompetent when given standard prompts while actually possessing sophisticated abilities that only emerge under specific conditions.”

This advancement joins a growing body of research aimed at demystifying how AI models develop capabilities.

Anthropic researchers unveiled “dictionary learning,” a technique that mapped millions of neural connections within their Claude language model to specific concepts the AI understands, Decrypt reported earlier this year.

While approaches differ, these studies share a common goal: bringing transparency to what has primarily been considered AI’s “black box” of learning.

“We found millions of features which appear to correspond to interpretable concepts ranging from concrete objects like people, countries, and famous buildings to abstract ideas like emotions, writing styles, and reasoning steps,” Anthropic said in its research paper.

The researchers conducted extensive experiments using diffusion models—the most popular architecture for generative AI. While tracking how these models learned to manipulate basic concepts, they discovered a consistent pattern: capabilities emerged in distinct phases, with a sharp transition point marking when the model acquired new abilities.

Models showed mastery of concepts up to 2,000 training steps earlier than standard testing could detect. Strong concepts emerged around 6,000 steps, while weaker ones appeared around 20,000 steps.

When researchers adjusted the “concept signal,” the clarity with which ideas were presented in training data.

They found direct correlations with learning speed. Alternative prompting methods could reliably extract hidden capabilities long before they appeared in standard tests.

This phenomenon of “hidden emergence” has significant implications for AI safety and evaluation. Traditional benchmarks may dramatically underestimate what models can actually do, potentially missing both beneficial and concerning capabilities.

Perhaps most intriguingly, the team discovered multiple ways to access these hidden capabilities. Using techniques they termed “linear latent intervention” and “overprompting,” researchers could reliably extract sophisticated behaviors from models long before these abilities appeared in standard tests.

In another case, researchers found that AI models learned to manipulate complex features like gender presentation and facial expressions before they could reliably demonstrate these abilities through standard prompts.

For example, models could accurately generate “smiling women” or “men with hats” individually before they could combine these features—yet detailed analysis showed they had mastered the combination much earlier. They simply couldn’t express it through conventional prompting.

The sudden emergence of capabilities observed in this study might initially seem similar to grokking—where models abruptly demonstrate perfect test performance after extended training—but there are key differences.

While grokking occurs after a training plateau and involves the gradual refinement of representations on the same data distribution, this research shows capabilities emerging during active learning and involving out-of-distribution generalization.

The authors found sharp transitions in the model’s ability to manipulate concepts in novel ways, suggesting discrete phase changes rather than the gradual representation improvements seen in grokking.

In other words, it seems AI models internalize concepts way earlier than we thought, they are just not able to show their skills—kind of how some people may understand a movie in a foreign language but still struggle to properly speak it.

For the AI industry, this is a double-edged sword. The presence of hidden capabilities indicates models might be more potent than previously thought. Still, it also proves how difficult it is to understand and control what they can do fully.

Companies developing large language models and image generators may need to revise their testing protocols.

Traditional benchmarks, while still valuable, may need to be supplemented with more sophisticated evaluation methods that can detect hidden capabilities.

Edited by Sebastian Sinclair

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Source link

Jose Antonio Lanz

https://decrypt.co/292892/ai-models-secretly-learn-capabilities-long-before-they-show-them-researchers-find

2024-11-24 15:01:02

bitcoin
Bitcoin (BTC) $ 98,216.31 1.11%
ethereum
Ethereum (ETH) $ 3,602.70 4.61%
xrp
XRP (XRP) $ 2.45 2.90%
tether
Tether (USDT) $ 0.999081 0.02%
solana
Solana (SOL) $ 216.08 4.70%
bnb
BNB (BNB) $ 711.68 1.05%
dogecoin
Dogecoin (DOGE) $ 0.379862 12.86%
usd-coin
USDC (USDC) $ 0.999326 0.07%
cardano
Cardano (ADA) $ 1.10 15.33%
staked-ether
Lido Staked Ether (STETH) $ 3,597.72 4.50%
tron
TRON (TRX) $ 0.269366 1.72%
avalanche-2
Avalanche (AVAX) $ 41.58 6.64%
wrapped-steth
Wrapped stETH (WSTETH) $ 4,291.53 4.40%
chainlink
Chainlink (LINK) $ 23.55 7.06%
the-open-network
Toncoin (TON) $ 5.80 3.40%
sui
Sui (SUI) $ 4.82 10.52%
shiba-inu
Shiba Inu (SHIB) $ 0.000024 6.93%
stellar
Stellar (XLM) $ 0.449099 5.11%
wrapped-bitcoin
Wrapped Bitcoin (WBTC) $ 98,013.26 1.03%
hedera-hashgraph
Hedera (HBAR) $ 0.312005 7.23%
polkadot
Polkadot (DOT) $ 7.67 7.25%
weth
WETH (WETH) $ 3,600.75 4.60%
bitcoin-cash
Bitcoin Cash (BCH) $ 473.28 3.23%
uniswap
Uniswap (UNI) $ 15.10 5.78%
pepe
Pepe (PEPE) $ 0.000021 6.10%
litecoin
Litecoin (LTC) $ 113.56 8.49%
leo-token
LEO Token (LEO) $ 9.09 0.56%
hyperliquid
Hyperliquid (HYPE) $ 23.47 2.70%
bitget-token
Bitget Token (BGB) $ 6.21 0.24%
wrapped-eeth
Wrapped eETH (WEETH) $ 3,801.41 4.44%
near
NEAR Protocol (NEAR) $ 5.76 6.61%
internet-computer
Internet Computer (ICP) $ 12.27 15.74%
ethena-usde
Ethena USDe (USDE) $ 0.997895 0.02%
usds
USDS (USDS) $ 0.99959 0.13%
aptos
Aptos (APT) $ 9.73 5.09%
aave
Aave (AAVE) $ 350.18 5.50%
mantle
Mantle (MNT) $ 1.38 2.49%
bittensor
Bittensor (TAO) $ 560.25 8.31%
crypto-com-chain
Cronos (CRO) $ 0.160116 6.44%
polygon-ecosystem-token
POL (ex-MATIC) (POL) $ 0.518996 7.33%
ethereum-classic
Ethereum Classic (ETC) $ 28.26 5.65%
vechain
VeChain (VET) $ 0.052245 8.08%
virtual-protocol
Virtuals Protocol (VIRTUAL) $ 4.24 6.54%
render-token
Render (RENDER) $ 8.01 4.38%
tokenize-xchange
Tokenize Xchange (TKX) $ 50.29 14.85%
fetch-ai
Artificial Superintelligence Alliance (FET) $ 1.49 7.33%
ethena
Ethena (ENA) $ 1.24 19.52%
mantra-dao
MANTRA (OM) $ 3.92 1.61%
monero
Monero (XMR) $ 201.89 3.22%
whitebit
WhiteBIT Coin (WBT) $ 24.79 0.49%