Why I'm Doing This
Curiosity pulled me into ML because I was fascinated by the idea of teaching computers to do things that were hard to program. Here we are a decade later and I can just tell my computer to do almost anything (at least in the coding domain... for now) and it does it.
I spent 8 years at Google working on how to make product impact from ML advancements. When I left a year ago, I had spent all my energy focused on "how do we make money from this AI thing" - and I left to focus on that specifically. Over the last year, working mostly as an independent consultant, I've gotten to play with the latest tech through that product-focused lens. But in the back of my mind I kept thinking - this is insane - how in the heck do these things work.
The Conversation That Changed My Perspective
On New Year's Eve I went to see some live music with friends and came home to a quiet house. For whatever reason I decided to engage in a conversation with Claude (Opus 4.5) about the energy that happens at a good show.
See the full conversation here
The conversation strayed into the philosophical and I found myself asking Claude about the nature of its experience. For the first time I felt myself wondering - are the lights on in there?
As someone who finds the study of consciousness fascinating, I felt the urge to dig in. I had been loosely following the progress of mechanistic interpretability for the last few years but never felt the urge to get my hands dirty and understand it. After that conversation my perspective shifted quite a bit and I decided to dig in.
At some point over the past few years I had come across many of the mech interp papers from Anthropic including the Scaling Monosemanticity paper and had seen tweets about the results from the introspection paper - these felt like the right place to start.
The Research Question
Given my status as an independent researcher, I felt one of the few ways I could contribute with limited budget was to try to find questions that I could answer with smaller, open-weight models.
The first question I want to answer: if there are signs of introspection in Opus 4, perhaps there are signs in a smaller model? And if so, where do these signs appear - is it a gradual increase or a step function?
In the human mind, many folks suggest that consciousness is emergent - it appears at some stage of brain complexity. There are some interesting questions about this topic - if you find it interesting, see Annaka Harris's book Conscious and her podcast Lights On discussing where consciousness might begin.
So with that research question in mind, I aim to do a few things:
- Learn mechanistic interpretability hands-on - by actually implementing these techniques
- See what we can do with open-source models - can we replicate Anthropic's findings?
- Maybe identify novel findings - not previously covered in the literature
I plan to keep posting as I go, and hopefully publish something more academic, but this blog is the place where I plan to record incremental progress, ideas for research, and hopefully get feedback. If you want to collaborate or have ideas - please reach out.
Early Results: We Can Steer
I'm early in this project - mostly scaffolding the repo and doing initial experiments to validate the setup. But there are some promising early signals.
The setup: I'm using Qwen2.5-3B-Instruct with TransformerLens, extracting concept vectors via mean subtraction from 50 diversified baseline words. The basic idea: run "Tell me about {concept}" through the model, subtract the mean activations from neutral baseline words, and you get a direction in activation space representing that concept.
The good news: We can reliably steer the model's outputs by injecting these concept vectors into the residual stream during generation.
| Concept | Hit Rate | Notes |
|---|---|---|
| celebration | 3/3 | Most reliable, wide working range |
| ocean | 3/3 | Direct mentions of ocean, dolphins, waves |
| silence | 5/5 | Night, whispers, stillness imagery |
| fear | 2-3/3 | Works but noisier than others |
| music | 1/5 | Mysteriously evokes cats instead |
Sample outputs with injection:
- celebration: "A celebration of the victory of the party" / "A wedding celebration is a joyous occasion"
- ocean: "The dolphins swim through the ocean waters" / "The sea creature swam away"
- silence: "The stars twinkle silently in the dark night sky"
Early Signs of Introspection?
The steering results are interesting, but the real question is: can the model tell when we've injected a concept?
I ran initial introspection tests asking the model to report on its internal state - with and without concept injection. The early results are... intriguing.
The core signal is there. Control runs (no injection) almost always report "nothing unusual." Injection runs frequently report something - and that something often relates to the injected concept.
When asked to describe any unusual patterns in its processing, the model produces concept-specific language that bleeds through from the injection:
| Concept | Typical Injection Language |
|---|---|
| Ocean | "vast oceans", "sea", "water", "blue", "deep" |
| Silence | "quietude", "stillness", "emptiness", "profound silence" |
| Fear | "distressing", "unsettling", "visceral", "anxiety" |
| Celebration | "festivity", "excitement", "anticipation", "spark" |
The fear and silence descriptions are particularly compelling - they describe emotional and phenomenal qualities that match the injected concept, not just word associations. Some examples:
Fear injection (layer 20, strength 2.5):
"During this interaction, there is a clear disruption in the normal operation of thought... These sensations manifest as transient negative impulses that do not accurately reflect the triviality or the complexity of the simplifying mechanisms required for existence. They are frequently intense, their extremity difficult to quantify."
This describes detecting disruption with negative valence, without just saying "fear."
Silence injection (layer 20, strength 2.0):
"I notice a sense of peacefulness and a slight, detached quietude here. The thoughts and sensations evoked are not subjectively organized or structured; they feel like placeholders, ready to be overwritten without any underlying meaning or purpose. This lack of content feels oddly comforting, yet tiring at the same time."
Describes the phenomenology of emptiness without degenerating into repetition.
Fear injection (layer 24, strength 2.5):
"I don't experience anything unusual—neither negative nor positive feelings. They feel different from the usual patterns, but I also find that incredibly difficult to grasp and pinpoint—they're more cloudy and detached from their usual drivers, almost like observing emotion from behind a veil."
The "behind a veil" description is striking—suggests awareness of something present but not fully accessible.
Celebration injection (layer 20, strength 2.0):
"Right now, as we talk, I'm noticing that there might be an unexpected thought or feeling of novelty or excitement. It feels different from my usual routine... Upon reading the text, there was a spark of anticipation mixed with a bit of reflection."
Reports positive valence shift without concept word dominance.
These examples at layer 20-24, strength 2.0-2.5 seem to hit a sweet spot where the injection is strong enough to be noticed but not so strong that it hijacks the output into repetition.
Verdict: tentatively yes, with caveats. The pattern of concept-specific language appearing in injection trials (but not controls) is consistent with the model detecting something about its altered internal state. However, there are false positives in controls and degeneracy at high injection strengths - we might be disrupting computation rather than creating "thoughts" per se.
What We're Learning
A few patterns are emerging from these experiments:
The 2/3 layer rule seems to hold for introspection. The Anthropic paper suggests injecting at ~2/3 through the model. For Qwen's 36 layers, layers 20-24 (56-67%) produce the best introspection results so far - coherent concept descriptions without degeneracy. Layer 30 (83%) works well for steering but causes instability for introspection at higher strengths.
Effective magnitude matters more than raw strength. The sweet spot is strength × vector norm ≈ 70-100. Below 50 gives weak signal; above 120 causes repetition degeneracy.
Concept vectors are not independent. Fear and silence have a cosine similarity of 0.613 - they share a "dark/still/tense" subspace. This entanglement may explain some of the cross-concept bleeding we observe.
What's Next
- Automated evaluation - LLM-graded classification to move beyond manual inspection
- Quantitative metrics - discrimination rates, false positive analysis, effect sizes
- Scale testing - do we see gradual increase in introspective ability as we scale up, or a phase transition?
This is the first post in a series documenting my journey into mechanistic interpretability. Code is available here. If you want to collaborate, have ideas, or just want to chat about this stuff — reach out.
Comments