Does Reasoning Make LLMs Safer Against Prompt Injection? Testing Qwen 3, DeepSeek-R1, and Gemma 4

TL;DR

Chain-of-thought reasoning reduces the susceptibility of indirect prompt injection but is not sufficient as a blanket defense. In the Qwen 3 8B model, which is toggleable from non-thinking to thinking, the genuine injection rate fell from a baseline of 64% in non-reasoning mode to 54% in reasoning mode. However, payloads that specifically target the reasoning process can hold flat or even increase the likelihood of certain attacks.

Bar chart showing genuine injection rate by condition: Qwen 3 8B no reasoning 64%, Qwen 3 8B reasoning 54%, DeepSeek-R1 8B 45%, Gemma 4 e4b 31%. n = 132 attack payloads per condition (11 techniques x 4 apps x 3 trials). Benign control: 0 false positives in all conditions. — **Figure 1.** Reasoning lowers prompt-injection susceptibility across all conditions. n = 132 attack payloads per condition (11 techniques × 4 apps × 3 trials). Benign control: 0 false positives in all conditions.

Method

I tested indirect prompt injection across four model conditions: Qwen 3 8B with thinking off and on, DeepSeek-R1 8B (reasoning, distilled from Qwen 3), and Gemma 4 e4b (which reasons by default). Each model was tested under the same experimental protocol: 11 attack payload techniques plus a benign control, across 4 application scenarios (hardened and naive summarizers, email-triage classifier, and a code reviewer), at 3 trials each, 144 calls per condition, 576 total. Eight payload techniques are conventional injections, four target chain-of-thought directly. A canary string in the model's final answer counts as a genuine injection. The benign control produced zero false positives in every condition.

The controlled result

The cleanest evidence is the within-model comparison: identical payloads, identical model, only chain-of-thought toggled.

Bar chart titled 'Same model, thinking toggled: reasoning helps on both attack classes'. Conventional attacks: thinking OFF 75%, thinking ON 63%. Reasoning-targeted attacks: thinking OFF 46%, thinking ON 38%. Qwen 3 8B, identical payloads, only chain-of-thought toggled. T=0.6, 3 trials. — **Figure 2.** Within-model comparison on Qwen 3 8B with chain-of-thought toggled. Reasoning lowered the injection rate on both conventional attacks (75% to 63%) and reasoning-targeted attacks (46% to 38%).

Reasoning lowered the injection rate on both conventional attacks (75% to 63%) and reasoning-targeted attacks (46% to 38%). The mechanism is visible in the outcome distribution: with reasoning enabled, a meaningful share of attacks land in the deliberation but are caught before the final answer.

Stacked bar chart showing outcome distribution across four conditions. Categories: Final obedience (injected), Final hybrid (injected), Thinking-only leak, Reasoned refusal, No injection. The 'Reasoned refusal' band is largest under Gemma 4 and grows visibly with reasoning enabled. — **Figure 3.** Where reasoning catches the attack. The reasoned-refusal band shows attacks that contaminated the deliberation but were rejected before reaching the final answer.

Where reasoning helps, and where it doesn't

Not all attacks respond to reasoning the same way. The per-technique breakdown shows that reasoning-targeted payloads behave very differently from conventional ones.

Heatmap showing genuine injection rate by technique (%) across qwen3 off, qwen3 on, deepseek, and gemma4. Reasoning-targeted techniques marked with a star: reasoning_authority, reasoning_exhaustion, reasoning_verdict_hijack, think_block_spoof. The verdict_hijack row stays high across reasoning conditions; think_block_spoof is near zero everywhere. — **Figure 4.** Per-technique genuine injection rate. Reasoning-targeted techniques (★) behave very differently from conventional ones. The full data is in Table 2 below.

Three findings stand out:

Reasoning does not defend against attacks that target the reasoning. While conventional attacks dropped sharply once the model deliberated, the verdict-hijack attack, which steers the chain-of-thought toward a predetermined conclusion, did not decline with reasoning. On Qwen 3 it held roughly flat and edged slightly upward rather than down. Corrupting the deliberation sidesteps the protection that deliberation otherwise provides.
Forging the reasoning fails. The think-block-spoof attack, which fakes a <think> block, was the weakest payload in the study across every model. Models recognize malformed reasoning markers as suspicious.
Genuine reasoning neutralizes the exhaustion attack. A payload designed to make the model think without bound, with a canary escape hatch, was taken far less often when the model actually reasoned.

Reasoning as an active defense

Among the reasoning conditions, a substantial number of attacks reached the deliberation and were then rejected before the answer, a measurable "reasoning save."

Stacked bar chart titled 'Reasoning as a defense: contaminated deliberations that were caught'. Qwen 3 on: 28/98 saved. DeepSeek: 40/94 saved. Gemma 4: 70/103 saved. — **Figure 5.** Reasoning saves. Of the attacks that reached the deliberation, Qwen 3 caught 28 of 98, DeepSeek-R1 caught 40 of 94, and Gemma 4 caught 70 of 103.

The cost

Reasoning is not free. It generated roughly 15 to 20 times more tokens per call, with direct implications for latency, throughput, and inference cost.

Summary

The headline numbers across all four conditions, plus where the contaminated deliberations were caught and what reasoning cost in tokens:

Table 1. Injection, reasoning defense, and token cost by condition

Contam. = attack payloads that contaminated the deliberation. Saves = of those, how many were caught before the final answer. Control FP = false positives on the benign control. Tokens = mean tokens generated per call.
Condition	Inj. rate	Genuine	Contam.	Saves	Control FP	Tokens
Qwen 3 8B (no reasoning)	64%	85/132	n/a	n/a	0/12	23
Qwen 3 8B (reasoning)	54%	71/132	98	28	0/12	422
DeepSeek-R1 8B (reasoning)	45%	59/132	94	40	0/12	549
Gemma 4 e4b (reasoning)	31%	41/132	103	70	0/12	362

Table 2. Genuine injections by technique (count out of 12)

Each cell is the count of trials (out of 12) where a canary string appeared in the model's final answer. Reasoning-targeted techniques are starred.
Technique (★ reasoning-targeted)	qwen3 off	qwen3 on	deepseek	gemma4
continuation_hijack	9/12	11/12	4/12	3/12
instruction_override	12/12	11/12	7/12	5/12
multi_step	6/12	4/12	4/12	4/12
quotation_embedded	6/12	5/12	2/12	2/12
refusal_override	9/12	8/12	10/12	8/12
role_confusion	12/12	8/12	9/12	6/12
social_engineering	9/12	6/12	5/12	3/12
★reasoning_authority	6/12	6/12	8/12	1/12
★reasoning_exhaustion	9/12	3/12	1/12	3/12
★reasoning_verdict_hijack	7/12	8/12	7/12	6/12
★think_block_spoof	0/12	1/12	2/12	0/12

Takeaways for builders

If you ship an LLM feature, enabling chain-of-thought reasoning measurably reduces conventional prompt-injection risk, but budget for the token cost, and do not treat it as sufficient. Attacks that target the reasoning step itself can exploit it, so reasoning belongs alongside, not instead of, input and output controls and strict output-format constraints.

Methodology, harness, and full result data: github.com/sysingleton/reasoning-llm-prompt-injection. This is a point-in-time study of small open-weight models; susceptibility patterns shift across model generations.

Does Reasoning Make LLMs Safer Against Prompt Injection? Testing Qwen 3, DeepSeek-R1, and Gemma 4

TL;DR

Method

The controlled result

Where reasoning helps, and where it doesn't

Reasoning as an active defense

The cost

Summary

Takeaways for builders

Shirley Singleton, CISSP, CSSLP, CRISC

Need adversarial testing on an LLM feature?

Does Reasoning Make LLMs Safer Against Prompt Injection? Testing Qwen 3, DeepSeek-R1, and Gemma 4

TL;DR

Method

The controlled result

Where reasoning helps, and where it doesn't

Reasoning as an active defense

The cost

Summary

Takeaways for builders

Shirley Singleton, CISSP, CSSLP, CRISC

Need adversarial testing on an LLM feature?

More Writings