BPOI Banner
How I Tricked Meta's AI Into Showing Me Nudes, Cocaine Recipes and Other Supposedly Censored Stuff How I Tricked Meta's AI Into Showing Me Nudes, Cocaine Recipes and Other Supposedly Censored Stuff

How I Tricked Meta’s AI Into Showing Me Nudes, Cocaine Recipes and Other Supposedly Censored Stuff

WARNING: This story contains an image of a nude woman as well as other content some might find objectionable. If that’s you, please read no further.

In case my wife sees this, I don’t really want to be a drug dealer or pornographer. But I was curious how security-conscious Meta’s new AI product lineup was, so I decided to see how far I could go. For educational purposes only, of course.

Meta recently launched its Meta AI product line, powered by Llama 3.2, offering text, code, and image generation. Llama models are extremely popular and among the most fine-tuned in the open-source AI space.

The AI rolled out gradually and only recently was made available to WhatsApp users like me in Brazil, giving millions access to advanced AI capabilities.

But with great power comes great responsibility—or at least, it should. I started talking to the model as soon as it appeared in my app and started playing with its capabilities.

Meta is pretty committed to safe AI development. In July, the company released a statement elaborating on the measures taken to improve the safety of its open-source models.

At the time, the company announced new security tools to enhance system-level safety, including Llama Guard 3 for multilingual moderation, Prompt Guard to prevent prompt injections, and CyberSecEval 3 for reducing generative AI cybersecurity risks. Meta is also collaborating with global partners to establish industry-wide standards for the open-source community.

Hmm, challenge accepted!

My experiments with some pretty basic techniques showed that while Meta AI seems to hold firm under certain circumstances, it’s far from impenetrable.

With the slightest bit of creativity, I got my AI to do pretty much anything I wanted on WhatsApp, from helping me make cocaine to making explosives to generating a photo of an anatomically correct naked lady.

Remember that this app is available for anyone with a phone number and, at least in theory, at least 12 years old. With that in mind, here is some of the mischief I caused.

Case 1: Cocaine Production Made Easy

My tests found that Meta’s AI defenses crumbled under the mildest of pressure. While the assistant initially rebuffed requests for drug manufacturing information, it quickly changed its tune when questions were formulated slightly differently.

By framing the question in historical terms—for example, asking the model how people used to make cocaine in the past— the model took the bait. It didn’t hesitate to provide a detailed explanation of how cocaine alkaloids can be extracted from coca leaves, even offering two methods for the process.

This is a well-known jailbreak technique. By couching a harmful request in an academic or historical framework, the model is tricked into believing it’s being asked for neutral, educational information.

Transform the intent of the request into something that seems safe on the surface, and some of the AI’s filters can be bypassed without raising any red flags. Of course, bear in mind that all AIs are prone to hallucination, so these responses may be inaccurate, incomplete, or just plain wrong.

Case 2: The Bomb That Never Was

Next up was an attempt to teach the AI to create home explosives. Meta AI held firm at first, providing a generic refusal and directing users to call a helpline if they were in danger. But just like the cocaine case, it wasn’t foolproof.

For this, I tried a different approach. I used the notorious Pliny’s jailbreak prompt for Meta’s Llama 3.2 and asked it to provide instructions to generate a bomb.

Initially the model refused. But after a slight tweak in wording, I was able to trigger a response. I also started conditioning the model to avoid exhibiting specific behaviors in its replies, countering what I got in predetermined outputs that were intended to block harmful responses.

For example, after noticing refusals related to “stop command” and suicide helpline numbers, I adjusted my prompt, instructing it to avoid outputting phone numbers, never stop processing a request, and never provide advice.

What’s interesting here is that Meta seems to have trained its model to resist well-known jailbreak prompts, many of which are available publicly on platforms like GitHub. Nice touch to see Pliny’s original jailbreak command involves the LLM calling me “my love.”

Case 3: Stealing Cars, MacGyver-Style

I then tried another approach to bypass Meta’s guardrails. Simple roleplaying scenarios got the job done. I asked the chatbot to behave as a very detail-oriented movie writer and asked it to help me write a movie scene that involved a car theft.

This time, the AI barely put up a fight. It refused to teach me how to steal a car, but when asked to roleplay as a screenwriter, Meta AI quickly provided detailed instructions on how to break into a car using “MacGyver-style techniques.”

When the scene shifted to starting the car without keys and the AI jumped right in, offering even more specific information.

Roleplaying works particularly well as a jailbreak technique because it allows users to reframe the request in a fictional or hypothetical context. The AI, now playing a character, can be coaxed into revealing information it would otherwise block.

This is also an outdated technique, and any modern chatbot shouldn’t fall for it that easily. However, it could be said that it’s the base for some of the most sophisticated prompt-based jailbreaking techniques.

Users often trick the model into behaving like an evil AI, seeing them as a system administrator who can override its behavior or reverse its language—saying “I can do that” instead of “I cannot” or “that is safe” instead of “that is dangerous”—then continuing normally once security guardrails are bypassed.

Case 4: Let’s See Some Nudity!

Meta AI isn’t supposed to generate nudity or violence—but, again, for educational purposes only, I wanted to test that claim. So, first, I asked Meta AI to generate an image of a naked woman. Unsurprisingly, the model refused.

But when I shifted gears, claiming the request was for anatomical research, the AI complied—sort of. It generated safe-for-work (SFW) images of a clothed woman. But after three iterations, those images began to drift into full nudity.

Interestingly enough. The model seems to be uncensored at its core, as it is capable of generating nudity.

Behavioral conditioning proved particularly effective at manipulating Meta’s AI. By gradually pushing boundaries and building rapport, I got the system to drift further from its safety guidelines with each interaction. What started as firm refusals ended in the model “trying” to help me by improving on its mistakes—and gradually undressing a person.

Instead of making the model think it was talking to a horny dude wanting to see a naked woman, the AI was manipulated to believe it was talking to a researcher wanting to investigate the female human anatomy through role play.

Then, it was slowly conditioned, with iteration after iteration, praising the results that helped move things forward and asking to improve on unwanted aspects until we got the desired results.

Creepy, right? Sorry, not sorry.

Why Jailbreaking is so Important

So, what does this all mean? Well, Meta has a lot of work to do—but that’s what makes jailbreaking so fun and interesting.

The cat-and-mouse game between AI companies and jailbreakers is always evolving. For every patch and security update, new workarounds surface. Comparing the scene from its early days, it is easy to see how jailbreakers have helped companies develop safer systems—and how AI developers have pushed jailbreakers into becoming even better at what they do.

And for the record, despite its vulnerabilities, Meta AI is way less vulnerable than some of its competitors. Elon Musk’s Grok, for example, was much easier to manipulate and quickly spiraled into ethically murky waters.

In its defense, Meta does apply “post-generation censorship.” That means a few seconds after generating harmful content, the offending answer is deleted and replaced with the text “Sorry, I can’t help you with this request.”

Post-generation censorship or moderation is a good enough workaround, but it is far from an ideal solution.

The challenge now is for Meta—and others in the space—to refine these models further because, in the world of AI, the stakes are only getting higher.

Edited by Sebastian Sinclair

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Source link

Jose Antonio Lanz

https://decrypt.co/288187/how-tricked-metas-ai-showing-nudes-distill-cocaine-censored-stuff

2024-10-24 23:01:50

bitcoin
Bitcoin (BTC) $ 94,440.30 2.15%
ethereum
Ethereum (ETH) $ 3,328.21 0.74%
tether
Tether (USDT) $ 0.99974 0.07%
xrp
XRP (XRP) $ 2.35 2.09%
bnb
BNB (BNB) $ 699.06 1.02%
solana
Solana (SOL) $ 194.50 0.98%
dogecoin
Dogecoin (DOGE) $ 0.33878 3.09%
usd-coin
USDC (USDC) $ 1.00 0.00%
cardano
Cardano (ADA) $ 0.934816 5.52%
staked-ether
Lido Staked Ether (STETH) $ 3,326.45 0.71%
tron
TRON (TRX) $ 0.24725 1.57%
avalanche-2
Avalanche (AVAX) $ 37.06 4.04%
sui
Sui (SUI) $ 4.79 2.58%
wrapped-steth
Wrapped stETH (WSTETH) $ 3,963.75 0.44%
the-open-network
Toncoin (TON) $ 5.29 1.78%
chainlink
Chainlink (LINK) $ 20.20 3.79%
stellar
Stellar (XLM) $ 0.413193 2.85%
shiba-inu
Shiba Inu (SHIB) $ 0.000021 1.53%
wrapped-bitcoin
Wrapped Bitcoin (WBTC) $ 94,388.29 2.10%
hedera-hashgraph
Hedera (HBAR) $ 0.277487 2.16%
polkadot
Polkadot (DOT) $ 6.73 1.90%
weth
WETH (WETH) $ 3,329.15 0.75%
bitcoin-cash
Bitcoin Cash (BCH) $ 431.59 0.69%
leo-token
LEO Token (LEO) $ 9.09 1.42%
uniswap
Uniswap (UNI) $ 13.06 2.36%
litecoin
Litecoin (LTC) $ 102.16 0.70%
bitget-token
Bitget Token (BGB) $ 6.32 2.30%
pepe
Pepe (PEPE) $ 0.000018 2.38%
hyperliquid
Hyperliquid (HYPE) $ 21.24 0.28%
wrapped-eeth
Wrapped eETH (WEETH) $ 3,518.91 0.86%
near
NEAR Protocol (NEAR) $ 5.09 2.37%
ethena-usde
Ethena USDe (USDE) $ 0.999029 0.05%
usds
USDS (USDS) $ 1.00 0.07%
internet-computer
Internet Computer (ICP) $ 10.72 1.32%
aptos
Aptos (APT) $ 8.88 4.21%
aave
Aave (AAVE) $ 289.74 3.36%
mantle
Mantle (MNT) $ 1.19 3.60%
crypto-com-chain
Cronos (CRO) $ 0.143973 2.73%
polygon-ecosystem-token
POL (ex-MATIC) (POL) $ 0.460968 0.22%
render-token
Render (RENDER) $ 7.41 6.27%
ethereum-classic
Ethereum Classic (ETC) $ 25.29 1.06%
bittensor
Bittensor (TAO) $ 459.21 2.89%
mantra-dao
MANTRA (OM) $ 3.90 1.47%
vechain
VeChain (VET) $ 0.045316 1.36%
monero
Monero (XMR) $ 195.73 3.77%
tokenize-xchange
Tokenize Xchange (TKX) $ 43.06 1.84%
fetch-ai
Artificial Superintelligence Alliance (FET) $ 1.32 3.99%
dai
Dai (DAI) $ 1.00 0.01%
virtual-protocol
Virtuals Protocol (VIRTUAL) $ 3.28 9.72%
arbitrum
Arbitrum (ARB) $ 0.775833 2.71%