TL;DR: Researchers at Anthropic, the UK AI Security Institute and the Alan Turing Institute discovered that injecting as few as 250 poisoned documents can stealthily “backdoor” any large language model—whether it’s a compact 600 M-parameter system or a massive 13 B-parameter behemoth.
Surprisingly, model size and total training data don’t matter: the same tiny batch of malicious samples is enough to compromise performance across the board, highlighting a critical vulnerability in how these AIs learn.
Watch on YouTube
Top comments (0)