Researchers at Anthropic teamed up with the UK AI Security Institute and the Alan Turing Institute to show that injecting a mere 250 malicious documents can plant a “backdoor” in any large language model, no matter its size or how much data it’s trained on. In fact, even a 13B-parameter model with over 20× more training data is just as vulnerable as a smaller 600M-parameter version.
This finding blows the lid off the assumption that bigger models are inherently safer—the same tiny dose of poisoned data is enough to compromise them all.
Watch on YouTube
Top comments (0)