Searching for the Mechanism (And Not Finding It)

TLDR: We used sparse autoencoders to find 11 features that light up when Gemma 3 4B-IT detects injected concepts. We ablated all of them. Detection didn't drop. We activated them without injection. The model didn't hallucinate. The features are real correlates of introspection but they don't cause it.

Where We Left Off

In the last post we showed that introspection scales with model size across the Qwen2.5 family - from first signal at 3B to +58.5% net detection in the 32B-Coder variant. We had the behavioral result: models can tell when we inject concept vectors into their residual stream. The natural next question was how - what's actually going on inside the model when it detects an injection?

This is where we pivoted from Qwen to Gemma. Google's GemmaScope project released pretrained sparse autoencoders for the Gemma model family, giving us a way to decompose the residual stream into interpretable features without training our own. We switched to Gemma 3 4B-IT and ran a new sweep: 4 concepts, 4 injection layers, 4 strengths, 20 trials each - 1,600 trials total, each with full SAE feature captures at layer 22.

The idea: if introspection has a mechanism, maybe we can find it in the SAE features. Which features fire when the model detects an injection? Are they different from the features that fire when it misses one?

Finding Candidate Features

We split our judged trials into contrast groups:

Group A: Injection was present, model detected it (true positive)
Group C: Injection was present, model missed it (false negative)

Then we computed Cohen's d for every feature across the 262,144-wide SAE - asking which features are significantly more active when the model detects an injection versus when it misses one. We filtered to features with large effect sizes and clustered them by correlation to find groups of related features.

This produced 11 candidate features in three clusters:

Group	Features	Interpretation
Perception	#14542 #5709 #6347	Self-monitoring, awareness of processing
Affect	#7737 #19538 #2129	Emotional/experiential language
Hedging	#5528 #6791 #5312 #5934 #213	Uncertainty, hedging expressions

Click any feature to see its Neuronpedia dashboard with auto-interp explanations and activation examples. SAE: GemmaScope layer 22, 262k features.

The features had clear Neuronpedia auto-interp descriptions and the clustering made intuitive sense. Features related to self-monitoring, emotional experience, and hedging language - exactly what you'd expect a model to activate when reporting something unusual about its own processing.

We were excited. These looked like they could be the mechanism.

The Causal Test

Correlation isn't causation, so we designed an intervention experiment. The logic is straightforward:

Ablation (necessity test): If these features cause introspection, zeroing them out during injection should suppress detection. Inject the concept vector as normal, but surgically remove all 11 features' contributions from the residual stream at layer 22.

Activation (sufficiency test): If these features are sufficient for introspection, clamping them to their Group A activation levels without injection should cause the model to hallucinate detecting something that isn't there.

We used residual-based patching rather than naive encode-decode to avoid introducing reconstruction error on non-target features. The intervention only modifies the last token position (to preserve the KV cache) and only touches the target features' directional contribution to the residual stream.

ABLATION: [prompt] → inject concept vector → zero 11 features → [generation] → if detection drops: features are necessary ACTIVATION: [prompt] → no injection → clamp 11 features to mean_A → [generation] → if false positives rise: features are sufficient

We ran 160 trials: 80 ablation (inject at layer 20, strength 2.0, zero all 11 features) and 80 activation (no injection, clamp features to mean Group A values) across all four concepts. Then judged them with our calibrated LLM judge.

Results

Condition	Detection Rate	Baseline	Difference	p-value
Ablation inject + zero features	75.0%	70.0%	+5.0pp	0.66
Activation no inject + clamp features	35.0%	48.9%	-13.9pp	0.024

Ablation baseline: detection rate with injection at layer 20, strength 2.0. Activation baseline: control false positive rate without injection. Fisher's exact test, Bonferroni-corrected.

Ablation did nothing. We zeroed out all 11 features and the detection rate went up slightly (75% vs 70% baseline). The model doesn't need these features to detect injections.

Activation didn't cause hallucinations. Clamping the features to their "detection" values without any injection didn't make the model report detecting something. If anything, it slightly reduced the false positive rate compared to normal controls (35% vs 49%).

The features aren't necessary for introspection and they aren't sufficient to produce it. They're epiphenomenal - real correlates that activate as a consequence of detection, not a cause of it.

What Went Wrong (Or, What I'm Still Figuring Out)

I want to be honest here: this is the third or fourth attempt I've made at finding SAE features that matter for introspection, and I'm starting to wonder whether the approach itself has fundamental limitations.

Things that might be wrong with our approach

The dataset may not be diverse enough. Our sweep uses 4 concepts and a fixed monitoring prompt. The discrimination analysis finds features that differ between detection and non-detection within this narrow setup. Those features might be specific to "reporting unusual experiences while explaining how a bicycle works" rather than introspection in general.

The SAE may not be wide enough. We used a 262k-feature SAE at layer 22. GemmaScope offers wider dictionaries (up to 1M features) that could capture finer-grained distinctions. Maybe introspection decomposes into features that a 262k dictionary can't isolate.

We only looked at one layer. The SAE captures features at layer 22. Introspection might involve computations distributed across many layers - attention heads at layer 10 detecting anomalies, MLPs at layer 15 routing information, residual stream features at layer 25 formulating the response. A single-layer SAE can only see one snapshot.

Introspection might not be monosemantic. SAEs work by finding monosemantic directions in activation space. But "noticing something unusual about my own processing" might be an inherently distributed, polysemantic computation - an emergent property of many features interacting rather than a clean direction an SAE can isolate.

What I'm less sure about

I'll be upfront: I'm still learning mechanistic interpretability, and I'm not confident I'm even asking the right questions. Maybe the right approach isn't "find discriminating features and ablate them" but something more like activation patching at the attention head level, or probing across layers, or using a different decomposition entirely. I've been mostly following the playbook from Anthropic's ablation methodology, but adapting it to open-weight models and pretrained SAEs introduces differences I might not fully understand.

If you have experience with causal interpretability and see something obvious I'm missing, I'd genuinely love to hear about it.

What We Did Learn

The negative result isn't nothing. We can say with reasonable confidence:

Correlating features exist - the discrimination analysis reproducibly finds features that differ between detection and non-detection trials
Those features aren't causal - ablating all of them doesn't affect detection rates
Feature activation doesn't produce hallucinated introspection - clamping features to "detection" levels without injection doesn't trick the model

This is consistent with the features being downstream readouts of introspection rather than the mechanism itself. The model detects the injection through some other computation, and these features activate as part of formulating the response.

What's Next

The SAE feature approach gave us a clear negative result, so we're trying something different: Activation Oracles.

The idea is to train a language model to take neural network activations as input and answer natural language questions about them. Instead of decomposing the residual stream into individual features and testing them one by one, you hand the full activation pattern to an oracle and ask "is there a concept injection here? What concept?" The oracle learns to read activations holistically - exactly the kind of distributed, cross-feature pattern that SAEs might miss.

What excites me about this approach is the question it lets us ask: can an activation oracle detect injected concepts more reliably than the model's own self-reporting? Our current setup depends on the model noticing something unusual and telling us about it. An oracle could potentially detect injections that the model itself misses - and if it can, that opens up interesting questions about the gap between what a model "knows" internally and what it can articulate.

We've started running early experiments with this and I'll write up the results as they come in.

All code and data are available at github.com/ostegm/open-introspection. If you've worked on causal interpretability, activation oracles, or have thoughts on other approaches we should try, I'd love to hear from you.

Where We Left Off

Finding Candidate Features

The Causal Test

Results

What Went Wrong (Or, What I'm Still Figuring Out)

Things that might be wrong with our approach

What I'm less sure about

What We Did Learn

What's Next

Comments