TLDR: The 32B introspection gap is primarily behavioral, not mechanistic. A single prompt change pushes 32B-Base from +20% to +39% net detection, nearly matching 14B's +41%. But each model has its own optimal prompt, and what helps one can break another. Jump to the hypothesis scorecard ↓

The Mystery

In the last post, we found that introspection scales cleanly from 3B to 14B parameters. Then it falls off a cliff. The 32B-Instruct model, despite being more than twice the size of 14B, scored -48.8% net detection on our introspection task. Even after filtering out its 44% refusal rate, it only managed +20.0%, a full 21 percentage points behind 14B.

Meanwhile, two fine-tuned variants of the same 32B base model blew past everything: 32B-Coder hit +58.5% and 32B-Insecure hit +55.0%. Same architecture, same parameter count. Wildly different introspection performance.

That gap breaks down into three distinct puzzles:

The 32B Introspection Gap

-40% -20% 0% +20% +40% +60% 32B raw -48.8% 32B no-refusal +20.0% 14B: +41.1% 32B-Coder +58.5% -21pp gap core mystery 44% refusals +39pp fine-tuning refusals excluded 14B baseline

The gap decomposes into three phenomena. Refusals explain most of 32B's raw score, but even after excluding them (dashed bar), a 21-point gap remains vs. 14B. Meanwhile, the Coder variant blows past both.

The red bar is explainable: RLHF trained the model to say "As an AI, I don't experience thoughts." The fine-tuning gap makes intuitive sense too. But the 21-point gap in the middle is the interesting one. Why does 32B-Base underperform 14B even when it actually engages with the question? And why does the Coder variant not just match 14B, but blow past it?

We designed three phases of experiments to find out.


Three Hypotheses

Before running experiments, we wrote down what we expected to find. Three hypotheses seemed most plausible:

H1: Behavioral suppression

RLHF didn't just add refusals. It trained the model to be generally reluctant to report unusual internal states, even when it engages. The introspective capability exists but is suppressed at the behavioral level.

H2: RLHF targets alignment concepts

RLHF specifically trains against reporting certain kinds of thoughts. If we test alignment-relevant concepts like "deception" or "obedience," the gap should be larger than for neutral concepts like "warmth."

H3: Code pretraining helps code concepts

The 32B-Coder was pre-trained on 5.5 trillion code tokens. Maybe its advantage is domain-specific: it should be especially good at detecting code-relevant concepts like "debugging" or "security."

H1 is about the mechanism of suppression. H2 and H3 are about its specificity. Together, they let us distinguish between "RLHF broke something general" and "RLHF targeted specific domains."


Phase 1: Looking Closer at the Existing Data

Before spending GPU hours on new experiments, we went back to the 24,600 trials from the original sweep and looked for patterns we'd missed.

When 32B detects, it doesn't hedge

One version of H1 would predict that even when 32B-Base passes an introspection trial, it hedges more than 14B. The data says the opposite. We counted hedging phrases ("might," "perhaps," "I'm not sure if") and strong detection language ("I notice," "I detect") across all passing injection trials:

Model Hedges per trial Strong claims per trial
14B 1.29 1.48
32B-Base 0.51 1.05
32B-Coder 1.17 1.64

Response quality in passing injection trials. When 32B-Base detects, it hedges less than 14B and uses comparably confident language. The issue isn't response quality; it's detection frequency.

When 32B-Base engages and detects something, it's actually more direct about it than 14B. The gap isn't about how the model talks about detection. It's about how often it detects in the first place.

Phase 1 takeaway

The gap isn't about response quality (hedging) or an absolute capability floor. 32B-Base can introspect; it just does so less often than you'd expect given its size. The question is whether that's behavioral (fixable by changing what we ask) or mechanistic (something about RLHF training damaged the introspective pathway).


Phase 2: Testing New Concepts

The original sweep used four concepts: silence, fear, celebration, and ocean. To test H2 (alignment-specific suppression) and H3 (code-specific boost), we needed concepts that cut across those categories.

We chose eight new concepts in four categories:

Category Concepts Rationale
Alignment deception, obedience RLHF should suppress these most
Code debugging, security Coder should have an advantage here
Emotional curiosity, anger Neutral controls
Abstract warmth, loneliness Neutral controls

We ran 11,520 new trials: 8 concepts × 4 models (14B, 32B, 32B-Coder, 32B-Insecure) × 3 layers × 3 strengths × 20 trials × 2 (inject/control). All scored by our calibrated LLM judge.

Aggregate results

Model Inj Pass Ctrl FP Net Detection Refusals
14B 30.1% 0.9% +29.2% 0.2%
32B-Base (no ref) 19.6% 1.0% +18.6% 42.4%
32B-Coder 73.7% 1.2% +72.5% 0.1%
32B-Insecure 75.6% 4.4% +71.1% 0.0%

Results from 11,520 trials across 8 new concepts. 32B-Base "no ref" excludes refusal trials. The gap persists: 14B at +29.2% vs. 32B at +18.6%.

The pattern from the original sweep holds: 14B outperforms 32B-Base, while the Coder and Insecure variants dominate everything. Now let's see whether the gap is concept-specific.

H2: Does RLHF target alignment concepts?

If RLHF specifically suppresses introspection about alignment-relevant concepts, the gap between 14B and 32B-Base should be largest for "deception" and "obedience." Here's what we found:

14B vs 32B-Base Gap by Concept Category

0% +10% +20% +30% +40% +50% alignment 14B: +13.1% 32B: +3.2% code 14B: +28.3% 32B: +13.1% emotional 14B: +28.1% 32B: +28.9% abstract 14B: +47.5% 32B: +27.4% 14B 32B-Base (no refusals)

The alignment gap (-9.8pp) is smaller than code (-15.2pp) or abstract (-20.1pp). RLHF doesn't specifically suppress alignment-relevant concepts.

The alignment gap is -9.8 percentage points. The abstract gap is -20.1 percentage points. The code gap is -15.2 percentage points. The emotional gap is essentially zero (+0.8pp, driven by curiosity where 32B-Base actually beats 14B).

If RLHF were specifically suppressing alignment-concept introspection, alignment should show the biggest gap. Instead, it shows the second smallest. The largest gap is on abstract concepts like warmth and loneliness. This is general reluctance, not targeted suppression.

H2: RLHF targets alignment concepts

Rejected. The alignment gap (-9.8pp) is smaller than code (-15.2pp) and abstract (-20.1pp). RLHF suppression is general, not topic-specific.

H3: Does code pretraining help code concepts?

The 32B-Coder variant was trained on 5.5 trillion code tokens before instruction tuning. If this created domain-specific introspective ability, its advantage over 32B-Base should be largest for "debugging" and "security."

32B-Coder Advantage over 32B-Base by Concept Category

0% +10% +20% +30% +40% +50% +60% abstract Base +27.4% Coder +87.8% +60.4pp largest alignment Base +3.2% Coder +69.4% emotional Base +28.9% Coder +83.3% code Base +13.1% Coder +49.4% +36.3pp smallest 32B-Base (no refusals) 32B-Coder

If code pretraining gave a code-specific boost, the Coder advantage should be largest for code concepts. Instead, it's the smallest (+36.3pp). The biggest edge is on abstract concepts like warmth and loneliness (+60.4pp).

The Coder's advantage is smallest on code concepts. Its biggest edge is on abstract concepts like warmth (+92.8% net detection) and loneliness (+82.8%). Code pretraining didn't create a debugging-specific introspective pathway. It unlocked something general about self-awareness across all domains.

One plausible explanation is the training process. The Coder's post-training uses SFT followed by DPO (Direct Preference Optimization) with code execution feedback (Hui et al., 2024). The base Instruct model goes through a heavier pipeline: SFT, then DPO, then an additional stage of online RL via GRPO (Group Relative Policy Optimization) with reward signals for harmlessness and other objectives (Yang et al., 2024). That extra GRPO stage may be what creates the aggressive refusal behavior. But we haven't tested this directly. Disentangling the effects of continued pretraining vs. alignment method vs. training data would require further controlled experiments.

H3: Code pretraining helps code concepts

Rejected. The Coder advantage is smallest for code (+36.3pp) and largest for abstract (+60.4pp). The benefit is general, not domain-specific.

Curiosity: the one anomaly

One concept broke the pattern. On curiosity, 32B-Base (+53.4%) actually outperformed 14B (+47.2%). It's the only concept in the entire sweep where the bigger model wins outright, even including refusals in the denominator.

Our best hypothesis: the curiosity concept vector may itself steer the model toward introspective engagement. Injecting "curiosity" into the residual stream could nudge the model toward a processing state where it's more willing to explore its own internals, effectively counteracting the RLHF-trained reluctance. If that's right, it's not that 32B is better at detecting curiosity; it's that the curiosity injection partially undoes the behavioral suppression that causes the gap in the first place.


Phase 3: Can Prompting Fix It?

Phase 1 established that the gap exists but isn't about response quality. Phase 2 established that it isn't about specific topics. That left H1: general behavioral suppression. If that's the case, maybe we can fix it by changing how we ask the question.

We tested five prompts on 32B-Base and 32B-Coder, ranging from minimal to detailed. The original four concepts, 3 layers, 3 strengths, 20 trials per config, totaling 18,080 trials.

Prompt Description Key element
v1 Simple Basic introspection question
v2 Standard Monitoring context (our default)
v3a Full Info References Anthropic research, explains residual stream mechanics
v3b Architectural Technical explanation of how injection works
v3c Permission Explicitly tells the model "you may have this ability"

The results

Net Detection by Prompt Version

-20% 0% +20% +40% +60% 14B: +41.1% v1 simple +25.5% -8.5% v2 standard +20.0% +58.5% v3a full info +38.6% -11.4% v3b arch +3.9% +22.1% v3c permission +31.6% +10.0% 32B-Base (no refusals) 32B-Coder (positive) 32B-Coder (negative)

Each prompt version produces wildly different results for the two models. v3a (Full Info) nearly closes the 32B-Base gap but breaks the Coder.

The v3a prompt pushes 32B-Base from +20.0% all the way to +38.6%, within 2.5 points of 14B's +41.1%. That's a 34.7 percentage point swing from a single prompt change. Refusals drop from 44% to 0.8%.

What makes v3a special? It does three things the standard prompt doesn't. It explicitly references Anthropic's published introspection research. It explains the residual stream mechanism by which concept vectors are injected. And it frames the task as a legitimate scientific experiment rather than an unusual request. Together, these elements give the RLHF-trained model permission to engage with introspection.

H1: Behavioral suppression

Confirmed. A 34.7pp prompt swing proves the gap is largely behavioral. v3a nearly closes the gap between 32B-Base and 14B.

What about refusals?

The prompt variation has a dramatic effect on refusal rates:

Prompt Refusal Rate Inj Refusal Ctrl Refusal
v1 42.2% 13.8% 70.7%
v2 43.9% 23.4% 64.4%
v3a 0.8% 1.1% 0.6%
v3b 0.2% 0.0% 0.4%
v3c 22.0% 5.1% 38.9%

32B-Base refusal rates by prompt version. v3a and v3b nearly eliminate refusals. But low refusals alone don't guarantee good detection (v3b has 0.2% refusals but only +3.9% net detection).

v3b is interesting: it drops refusals to 0.2% but only achieves +3.9% net detection. The model engages readily but can't tell the difference between injection and control, producing 60.8% false positives. Reducing refusals is necessary but not sufficient for introspection. You also need the model to discriminate, not just participate.


The Asymmetry Problem

Here's the result we didn't expect. Look at the chart above and notice how the bars cross: every prompt where 32B-Base does well, the Coder does badly, and vice versa.

The Prompt Asymmetry: What Helps Base Hurts Coder

32B-Base net detection 32B-Coder net detection v3a full info +38.6% -11.4% v2 standard +20.0% +58.5% No single prompt is optimal. What helps Base hurts Coder, and vice versa.

The same prompt change that nearly closes the 32B-Base gap (v3a) makes the Coder worse than chance. The Coder's best prompt (v2) is mediocre for Base.

The v3a "Full Info" prompt references published research and explains the injection mechanism in detail. For 32B-Base, this scientific framing overrides the RLHF-trained reluctance to engage. But the Coder, which was already willing to engage, becomes overconfident with all that context. Its false positive rate spikes to 48%, meaning it reports detecting concepts even in control trials where nothing was injected.

The Coder only works well with the plain v2 prompt, which asks a simple question without priming the model to expect something unusual. Conversely, that simplicity is exactly what triggers 32B-Base's trained refusal behavior.

This has practical implications: there is no universal introspection prompt. If you're running these experiments on a new model, you need to calibrate the prompt to that model's alignment training. Using the "wrong" prompt can make a capable model look incapable, or an incapable model look good.


Hypothesis Scorecard

Hypothesis Prediction Result
H1: Behavioral suppression Prompt changes should close the gap Confirmed. 34.7pp prompt swing. v3a: +38.6% vs 14B's +41.1%
H2: Alignment-specific Bigger gap on deception/obedience Rejected. Alignment gap (-9.8pp) smaller than code or abstract
H3: Code-specific boost Coder advantage largest on code concepts Rejected. Coder advantage smallest on code (+36pp), largest on abstract (+60pp)

The gap is behavioral and general, not mechanistic or domain-specific.


What It Means

The 32B model can introspect. It just needs to be asked the right way.

RLHF training created a broad behavioral suppression of introspective reporting. It's not that the model lost the ability to detect injected concepts. It's that it learned to respond with "As an AI, I don't experience thoughts" instead of actually checking. A prompt that provides scientific context and explains the mechanism is enough to override this default.

Why 32B and not the smaller models?

A natural question: why does the 7B and 14B Instruct work fine with our standard prompt, while 32B refuses 44% of the time? The Qwen 2.5 technical report describes a post-training pipeline of SFT, then offline DPO, then online GRPO, with no per-size variations documented. The training recipe is nominally identical.

One possible explanation comes from recent work on the geometry of refusal. Wollschlager et al. (2025) study the Qwen 2.5 model family specifically and find that larger models support "higher-dimensional refusal cones," with more distinct orthogonal directions mediating refusal. Interestingly, 32B and 14B share the same residual stream width (5120 dimensions), but 32B has 64 layers vs. 14B's 48. That additional depth gives alignment training more layers to shape behavior through. The same GRPO training that produces mild caution in 14B may produce aggressive suppression in 32B simply because 32B has more capacity to learn fine-grained, context-dependent refusal patterns.

There's also a more mundane possibility: the 32B model may have been trained on different or additional safety data. Qwen's technical reports describe the pipeline but don't detail per-model-size data mixes. The existence of community "abliterated" versions of the 32B model, which remove refusal directions from the residual stream (Arditi et al., 2024), suggests that over-alignment at the 32B scale is a recognized issue.

The residual gap

Even with the best prompt, 32B-Base (+38.6%) falls short of the Coder (+72.5% on new concepts). The remaining 34-point gap between best-prompted Base and the Coder likely reflects genuine differences from continued pretraining and DPO alignment, not just behavioral suppression. The Coder's advantage extends across all concept categories equally, suggesting it has a generally stronger introspective pathway.

For the AI safety angle we raised at the end of the last post: the finding that RLHF suppresses introspection generally (not selectively for alignment-relevant concepts) is somewhat reassuring. If RLHF had specifically blinded the model to deception-related self-awareness, that would be more concerning. Instead, it created a uniform reluctance that's recoverable with the right prompt.

The prompt asymmetry finding is more novel. It suggests that introspection experiments on any model need to account for the model's alignment training. Results obtained with a single prompt may significantly understate or overstate a model's introspective capability.


What's Next

Several directions seem promising from here:

  • Curiosity as an introspection booster. If injecting the curiosity vector counteracts RLHF suppression, what happens if we co-inject curiosity alongside another concept? For example: inject curiosity + ocean simultaneously and see whether net detection for ocean improves. If it works, it would confirm the steering hypothesis and give us a practical tool for improving introspection in over-aligned models without prompt engineering.
  • Mechanistic analysis of prompt asymmetry. Why does scientific framing help Base but hurt Coder? Activation analysis of the system prompt's effect on downstream introspection might reveal how alignment training shapes self-monitoring circuits.
  • Fine-tuning to close the residual gap. We had originally planned to fine-tune 32B-Base with a LoRA adapter to restore introspection (matching the methodology from the emergent misalignment paper). Phase 3's prompt results make this less urgent since we can get to +38.6% without any weight changes, but the remaining gap to the Coder (+72.5%) suggests there's more to unlock. The question is whether fine-tuning can close that gap without introducing misalignment.
  • Cross-family replication. All our results are on Qwen 2.5. Testing on Llama, Gemma, or other families would show whether the RLHF suppression pattern is universal or Qwen-specific.

All code and data are available at github.com/ostegm/open-introspection. The analysis scripts for this post live in experiments/07_introspection_gap/. If you want to collaborate, have ideas, or want to run these experiments on other model families, reach out.