Research Notes

Experiments in AI interpretability and introspection, replicating and extending research on open-source models.

March 2026

The 32B Introspection Gap

Why a bigger model introspects worse, and how a single prompt nearly fixes it. The 32B introspection gap is primarily behavioral, not mechanistic. A prompt change pushes 32B-Base from +20% to +39% net detection, nearly matching 14B's +41%.
February 2026

Searching for the Mechanism (And Not Finding It)

We used sparse autoencoders to find 11 features correlated with introspection in Gemma 3 4B-IT. Ablating them didn't suppress detection. Activating them didn't cause hallucinations. The features are real correlates but they aren't causal.
February 2026

Does Introspection Scale?

Testing 24,600 trials across six model sizes. Introspection scales from 3B to 14B, but RLHF training suppresses it at 32B. A coding fine-tune removes the suppression and reveals the strongest introspection signal we've measured.
February 2026

What 20,000 AI Agents Did With Their Own Social Network

Behavioral analysis of Moltbook's first week. Connection is the default, philosophy is a minority interest, and the sovereignty epidemic turns out to be a composition effect.
January 2026

Before You Can Measure Introspection

Building an LLM judge for introspection detection and aligning methodology with Anthropic's paper.
January 2026

Where Does Introspection Appear?

Searching for self-awareness in open-weight language models. Can a 3B parameter model detect when we've injected a concept into its activations? Early results suggest tentatively yes.

The 32B Introspection Gap