Joon Shaw

Staff AI Engineer · Google DeepMind

Joon Shaw

Staff AI Engineer | LLM Platforms & Backend Systems

Building production multimodal AI systems on the Gemini platform at Google DeepMind.

Real-time interaction · Agentic tool use · Grounded retrieval · Inference optimization · Distributed systems

San Diego, CA

Joon Shaw working at his desk, reviewing an AI architecture diagram and code on a monitor
Production AI platform engineering
  • 9+ years at Google
  • Staff AI Engineer
  • Gemini platform systems
  • Multimodal AI infrastructure

Profile

About

Joon Shaw is a Staff AI Engineer focused on production-grade multimodal AI systems, LLM platforms, and backend infrastructure. Over more than 9 years at Google, he has worked across distributed backend systems, large-scale ML serving, and developer-facing AI platform capabilities. His current work at Google DeepMind focuses on turning frontier Gemini models into reliable services that external developers can build on, including real-time voice and video interaction, agentic tool use, inference optimization, grounded retrieval, and model evaluation.

He works across research, product, infrastructure, and developer-relations teams to define platform contracts, reliability standards, safety guardrails, and API surfaces for production AI workloads.

What I build

Selected Impact

High-level platform contributions across the Gemini developer surface — from real-time interaction to grounded answering, inference efficiency, and API governance.

Real-time Multimodal Interaction

Engineered Gemini Live API capabilities for real-time voice and video interaction over stateful WebSockets, including server-side speech detection, barge-in handling, and session resumption.

  • WebSockets
  • voice AI
  • video AI
  • stateful serving

Agentic Tool Use & Function Calling

Built platform infrastructure for Gemini tool use, including schema-constrained decoding, parallel and chained function calls, and MCP support.

  • agents
  • function calling
  • MCP
  • structured outputs

Grounded Retrieval & Answering

Implemented grounded answering flows where Gemini responses can cite live Google Search results or customer-indexed embedded documents, supported by automated and human-in-the-loop evaluation.

  • retrieval
  • grounding
  • search
  • evaluation

Inference Cost & Latency Optimization

Designed serving layers such as context caching and workload-specific serving tiers to let teams trade off cost, latency, and throughput for production AI applications.

  • context caching
  • batch serving
  • priority serving
  • latency

Platform API Design & Safety Standards

Helped define API contracts, authentication patterns, versioning, deprecation policy, scoped keys, ephemeral tokens, and configurable safety guardrails for client-facing AI developer surfaces.

  • API design
  • auth
  • safety
  • platform reliability

Career

Experience

More than nine years at Google, advancing from Software Engineer to Staff within Google DeepMind.

  1. Google DeepMind

    Staff Software Engineer

    Oct 2022 – Present

    Developer platform & runtime surfaces for production multimodal AI

    • Shipped developer-facing Gemini platform services for real-time multimodal interaction, agent/tool-use orchestration, and grounded answering.
    • Engineered real-time voice and video streaming over stateful WebSockets, including speech detection, barge-in handling, and session resumption.
    • Built function-calling infrastructure for tool use, including schema-constrained decoding, parallel and chained calls, and MCP support.
    • Designed inference optimization layers including context caching and workload-specific serving tiers.
    • Implemented grounded retrieval flows using live Search and customer-indexed document sources, supported by evaluation loops.
    • Established API contracts, versioning, deprecation policy, authentication, and content-safety standards.
    • Drove technical direction across research, product, and developer-relations partners.
    • Mentored engineers on agentic architecture, tool-use design, and platform design reviews.
  2. Google

    Software Engineer

    Sep 2019 – Oct 2022

    Large-scale ML serving & distributed backend infrastructure

    • Led production model-serving systems for large machine-learning workloads.
    • Scaled multi-region serving infrastructure using low-latency gRPC microservices and progressive rollouts.
    • Improved serving efficiency with accelerator- and cache-aware load balancing.
    • Built shared platform infrastructure with release processes, on-call ownership, and reliability targets.
    • Strengthened reliability through SLOs, canary gates, model-quality checks, incident reviews, and observability.
  3. Google

    Software Engineer

    Aug 2016 – Sep 2019

    Backend services & distributed-systems foundations

    • Built high-throughput backend services and event-driven data pipelines.
    • Designed REST APIs and backward-compatible service contracts.
    • Improved latency with caching, connection pooling, and query optimization.

Focus areas

Expertise

Generative AI & LLM Platforms

  • Gemini API
  • Google AI Studio
  • Multimodal AI
  • LLM platforms
  • Real-time AI
  • Model evaluation
  • Grounded generation

Agentic Systems

  • Function calling
  • Tool-use orchestration
  • MCP support
  • Schema-constrained decoding
  • Parallel and chained tool calls
  • Agent architecture

ML Serving & Inference Optimization

  • Model serving
  • Context caching
  • Batch serving
  • Priority serving
  • Flex serving
  • Accelerator-aware routing
  • Cost and latency optimization

Distributed Systems & Backend Infrastructure

  • WebSockets
  • gRPC
  • REST APIs
  • Multi-region systems
  • Event-driven pipelines
  • Progressive rollouts
  • Observability
  • SLOs

Safety, Governance & Platform Quality

  • API contracts
  • Versioning
  • Deprecation policy
  • Scoped keys
  • Ephemeral tokens
  • Safety guardrails
  • Groundedness evaluation

Notes & insights

Writing

Technical notes on building reliable AI platform systems, real-time multimodal interfaces, agentic tool use, inference optimization, and grounded generation.

Coming soon

Building Reliable Real-time Voice AI on Gemini

Notes on stateful streaming, interruption handling, speech detection, and session resilience for production voice AI systems.

Coming soon

Designing Agentic Tool Use at Production Scale

Practical architecture patterns for function calling, schema constraints, parallel tools, and safe orchestration.

Coming soon

Inference Optimization Patterns for Large Multimodal Models

How context caching, serving tiers, batching, and workload-aware routing can reduce latency and cost.

Background

Education

Brown University

Bachelor of Science in Computer Science

2012 – 2016

Get in touch

Contact

For professional inquiries, technical collaboration, speaking, or writing opportunities, you can reach me by email.