30 - AI Security with Jeffrey Ladish

Ғылым және технология

Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Topics we discuss, and timestamps:
0:00:38 - Fine-tuning away safety training
0:13:50 - Dangers of open LLMs vs internet search
0:19:52 - What we learn by undoing safety filters
0:27:34 - What can you do with jailbroken AI?
0:35:28 - Security of AI model weights
0:49:21 - Securing against attackers vs AI exfiltration
1:08:43 - The state of computer security
1:23:08 - How AI labs could be more secure
1:33:13 - What does Palisade do?
1:44:40 - AI phishing
1:53:32 - More on Palisade's work
1:59:56 - Red lines in AI development
2:09:56 - Making AI legible
2:14:08 - Following Jeffrey's research
The transcript: axrp.net/episode/2024/04/30/e...
Palisade Research: palisaderesearch.org
Jeffrey's Twitter/X account: / jeffladish
Main research links:
- LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: arxiv.org/abs/2310.20624
- BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: arxiv.org/abs/2311.00117
- Securing Artificial Intelligence Model Weights: www.rand.org/pubs/working_pap...
Other links:
- Llama 2: Open Foundation and Fine-Tuned Chat Models: arxiv.org/abs/2307.09288
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: arxiv.org/abs/2310.03693
- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: arxiv.org/abs/2310.02949
- On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): crfm.stanford.edu/open-fms/
- The Operational Risks of AI in Large-Scale Biological Attacks (RAND): www.rand.org/pubs/research_re...
- Preventing model exfiltration with upload limits: www.alignmentforum.org/posts/...
- A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution: googleprojectzero.blogspot.co...
- In-browser transformer inference: aiserv.cloud/
- Anatomy of a rental phishing scam: jeffreyladish.com/anatomy-of-...
- Causal Scrubbing: a method for rigorously testing interpretability hypotheses: www.alignmentforum.org/posts/...

Пікірлер: 9

@matts732721 күн бұрын
This is a really nice deep dive not only on AI, but security and the state of the industry in general. Bravo!
@axrpodcast
21 күн бұрын
Thanks :)
@dizietz22 күн бұрын
I've been loving this new stream of content on spotify during long drives! Daniel you are pretty well up to date on papers generally, I am always impressed.
@axrpodcast
22 күн бұрын
Glad to hear you like these :)
@turtlewax384921 күн бұрын
Who is going to secure you from yourself and the AI securing you?
@teluobir20 күн бұрын
When you think about it, you sum all the "like" and they take 105 minutes in this video, the "yeah" take 40 minutes, and the "um" take about 15 minutes… Take them off and you'd have a much more digestible video.
@tylertracy965
18 күн бұрын
Unfortunately, some of the best researchers out there aren't the most fluent with speech. It didn't distract from the overall conversation for me.
@akmonra18 күн бұрын
Just a few minutes in, and he gets the basics of low-rung adapters completely wrong. Starting to wonder how much he actually understands.
@tylertracy965
18 күн бұрын
In what ways? Could you provide examples so future listeners can understand them correctly.