How Not to Detect Prompt Injections with an LLM
Published in AISec 2025 (18th ACM Workshop on Artificial Intelligence and Security), 2025
Recent defenses based on known-answer detection (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise.
