Otto Stegmaier

Software Engineer & Applied Researcher

Staff engineer with 8 years at Google bridging research and production ML—shipping systems to hundreds of millions of users while designing evaluation frameworks for model behavior. Comfortable running experiments at the edge of what's possible, focused on making it reliable. Now conducting applied AI work and independent research.

Experience

Cloudripper Labs

Independent AI Consultant
2025 - Present Salt Lake City

Production AI agents deployed across industries. Independent safety research.

Production AI agents deployed across industries. Independent safety research.

  • Closed-loop agent evaluation: Built evaluation environment for autonomous route-planning agent where the system iteratively improved its own prompts and tooling against simulation-based test suites. Probed edge cases to identify failure modes.
  • Human-in-the-loop oversight system: Semi-autonomous trading system using Claude Agents SDK with Modal sandboxes. Slack integration enables async multi-player oversight—designed for safe operation under uncertainty.
  • Mechanistic interpretability research: Investigating introspection in open-weight models using TransformerLens—see Publications below for details.
  • Enterprise AI deployment: Deployed customer-facing agents augmenting existing SaaS apps. Trained engineering teams on agent-supervised workflows with appropriate guardrails.

Eight years bridging research and production ML—from early embeddings and BERT inference to T5-based query rewriting and tool-calling voice agents. Built evaluation frameworks and shipped experiments across Legal, Cloud, YouTube, and Conversational AI.

Conversational Agents & Food AI

Built voice ordering agent before modern multimodal LLMs—scaled from 1 pilot restaurant to hundreds across major fast food chains. Pioneered tool-calling patterns that predated structured output APIs.

  • Guided migration to native Gemini 2.0 tool use—validating patterns we'd built years earlier before structured outputs existed.
  • Created Gemini-based pipeline for cleaning audio training data to improve ASR quality.
  • "Disfluency injection" system to mask LLM latency—injected natural hesitations during inference, trimmed silence during agent speech.
  • Developed human eval system for accent conversion models with bias reduction analysis.

Google Research — YouTube Voice Search

Bridged Google Research and YouTube product teams. Shipped 8 launches using T5 transformers for voice query rewriting—+0.75% Voice Engaged Watchers (30% of YouTube's annual goal).

  • Fine-tuned LLM for phonetic query corrections using contextual signals and personalization.
  • In-memory ranking model for candidate selection. Inference latency <3ms.
  • First personalized ASR rewriter on YouTube—85% click rate on corrections, scaled to 450M queries/day.
  • Coordinated planning for 5 engineers across 3 reporting chains.

Area120 & Google Cloud

Early BERT adopter—built training and serving infrastructure for conversational AI that won major telco contract. Solved low-latency inference challenges before optimized transformer runtimes existed.

  • Managed 4 engineers building automated hyperparameter tuning system.
  • BERT serving at <20ms p99 latency with TPU support.
  • Multi-headed model architecture reduced TPU cost 10x.

Google Legal — Data Scientist

Early adopter of embeddings for legal tech. Fine-tuned word2vec models on patent text before transformer-based embeddings existed.

  • ML model for patent claim breadth prediction—replaced vendor, open-sourced the approach. Served 20M+ documents.
  • Patent similarity search via custom-trained embeddings + BigQuery. 60% reduction in search effort.

A/B testing and churn prediction for health tracking features.

  • A/B test design and analytics for new dashboard and goal-setting features.
  • Churn prediction model balancing interpretability with calibration. Ran intervention experiments.

First data scientist at the company. Built tooling to understand ski lift ticket shopper behavior using collaborative filtering and early ML techniques.

  • Established and managed 4-person analytics team covering pricing optimization, product analytics, and forecasting.
  • Architected first data warehouse (Redshift) with daily Python ETL pipelines.

Commanded teams under uncertainty in high-stakes environments. Combat deployment to Afghanistan. Mission-driven work with bias for action.

  • Developed 11-month training program for 165 marines deploying to Afghanistan. Every marine came home.
  • Directed combat operations team for casualty evacuations and close air support—decisions under pressure with incomplete information.
  • Managed teams of 5-45 across training and two deployments.

Publications & Open Source

Open Introspection 2026

Mechanistic interpretability research investigating whether smaller open-weight models exhibit introspection. Using TransformerLens to extract and inject concept vectors into Qwen2.5-3B—early results show models can detect artificially introduced internal states.

Blog → GitHub →

Contextual Mondegreen 2022

Early production LLM application—rewrote international voice queries (voice → ASR → LLM rewrite) to correct phonetic errors across languages. Personalized precision for different user populations.

BayLearn 2022 →

Patent Claims Breadth Model 2018

Early embeddings application—fine-tuned word2vec model on patent text to predict claim breadth. First open-source model in this space. Widely cited, many follow-on projects.

GitHub →

Education

Harvard University

B.A. Economics · Spanish · Varsity Heavyweight Rowing

Fast.ai

Practical Deep Learning for Coders (2017)

Projects

Level App AI for Construction

Full-stack AI app for renovation estimates via multimodal input (images, voice, video, PDFs). Novel video processing pipeline. Patterns now used in consulting work.

Interests

Cooking · Calisthenics · Backcountry Skiing · Hiking · Bread Baking · Fiction