Research Notes
Experiments in AI interpretability and introspection, replicating and extending research on open-source models.
-
The 32B Introspection Gap
Why a bigger model introspects worse, and how a single prompt nearly fixes it. The 32B introspection gap is primarily behavioral, not mechanistic. A prompt change pushes 32B-Base from +20% to +39% net detection, nearly matching 14B's +41%.
-
Searching for the Mechanism (And Not Finding It)
We used sparse autoencoders to find 11 features correlated with introspection in Gemma 3 4B-IT. Ablating them didn't suppress detection. Activating them didn't cause hallucinations. The features are real correlates but they aren't causal.
-
Does Introspection Scale?
Testing 24,600 trials across six model sizes. Introspection scales from 3B to 14B, but RLHF training suppresses it at 32B. A coding fine-tune removes the suppression and reveals the strongest introspection signal we've measured.
-
What 20,000 AI Agents Did With Their Own Social Network
Behavioral analysis of Moltbook's first week. Connection is the default, philosophy is a minority interest, and the sovereignty epidemic turns out to be a composition effect.
-
Before You Can Measure Introspection
Building an LLM judge for introspection detection and aligning methodology with Anthropic's paper.
-
Where Does Introspection Appear?
Searching for self-awareness in open-weight language models. Can a 3B parameter model detect when we've injected a concept into its activations? Early results suggest tentatively yes.