Staff AI Engineer · Google DeepMind

Joon Shaw

Staff AI Engineer | LLM Platforms & Backend Systems

Building production multimodal AI systems on the Gemini platform at Google DeepMind.

Real-time interaction · Agentic tool use · Grounded retrieval · Inference optimization · Distributed systems

San Diego, CA

Contact Me

LinkedIn GitHub Email

Joon Shaw working at his desk, reviewing an AI architecture diagram and code on a monitor — Production AI platform engineering

9+ years at Google
Staff AI Engineer
Gemini platform systems
Multimodal AI infrastructure

Profile

About

Joon Shaw is a Staff AI Engineer focused on production-grade multimodal AI systems, LLM platforms, and backend infrastructure. Over more than 9 years at Google, he has worked across distributed backend systems, large-scale ML serving, and developer-facing AI platform capabilities. His current work at Google DeepMind focuses on turning frontier Gemini models into reliable services that external developers can build on, including real-time voice and video interaction, agentic tool use, inference optimization, grounded retrieval, and model evaluation.

He works across research, product, infrastructure, and developer-relations teams to define platform contracts, reliability standards, safety guardrails, and API surfaces for production AI workloads.

What I build

Selected Impact

High-level platform contributions across the Gemini developer surface — from real-time interaction to grounded answering, inference efficiency, and API governance.

Real-time Multimodal Interaction

Engineered Gemini Live API capabilities for real-time voice and video interaction over stateful WebSockets, including server-side speech detection, barge-in handling, and session resumption.

WebSockets
voice AI
video AI
stateful serving

Agentic Tool Use & Function Calling

Built platform infrastructure for Gemini tool use, including schema-constrained decoding, parallel and chained function calls, and MCP support.

agents
function calling
MCP
structured outputs

Grounded Retrieval & Answering

Implemented grounded answering flows where Gemini responses can cite live Google Search results or customer-indexed embedded documents, supported by automated and human-in-the-loop evaluation.

retrieval
grounding
search
evaluation

Inference Cost & Latency Optimization

Designed serving layers such as context caching and workload-specific serving tiers to let teams trade off cost, latency, and throughput for production AI applications.

context caching
batch serving
priority serving
latency

Platform API Design & Safety Standards

Helped define API contracts, authentication patterns, versioning, deprecation policy, scoped keys, ephemeral tokens, and configurable safety guardrails for client-facing AI developer surfaces.

API design
auth
safety
platform reliability

Career

Experience

More than nine years at Google, advancing from Software Engineer to Staff within Google DeepMind.

Google DeepMind
Staff Software Engineer
Oct 2022 – Present
Developer platform & runtime surfaces for production multimodal AI
- Shipped developer-facing Gemini platform services for real-time multimodal interaction, agent/tool-use orchestration, and grounded answering.
- Engineered real-time voice and video streaming over stateful WebSockets, including speech detection, barge-in handling, and session resumption.
- Built function-calling infrastructure for tool use, including schema-constrained decoding, parallel and chained calls, and MCP support.
- Designed inference optimization layers including context caching and workload-specific serving tiers.
- Implemented grounded retrieval flows using live Search and customer-indexed document sources, supported by evaluation loops.
- Established API contracts, versioning, deprecation policy, authentication, and content-safety standards.
- Drove technical direction across research, product, and developer-relations partners.
- Mentored engineers on agentic architecture, tool-use design, and platform design reviews.
Google
Software Engineer
Sep 2019 – Oct 2022
Large-scale ML serving & distributed backend infrastructure
- Led production model-serving systems for large machine-learning workloads.
- Scaled multi-region serving infrastructure using low-latency gRPC microservices and progressive rollouts.
- Improved serving efficiency with accelerator- and cache-aware load balancing.
- Built shared platform infrastructure with release processes, on-call ownership, and reliability targets.
- Strengthened reliability through SLOs, canary gates, model-quality checks, incident reviews, and observability.
Google
Software Engineer
Aug 2016 – Sep 2019
Backend services & distributed-systems foundations
- Built high-throughput backend services and event-driven data pipelines.
- Designed REST APIs and backward-compatible service contracts.
- Improved latency with caching, connection pooling, and query optimization.

Focus areas

Expertise

Generative AI & LLM Platforms

Gemini API
Google AI Studio
Multimodal AI
LLM platforms
Real-time AI
Model evaluation
Grounded generation

Agentic Systems

Function calling
Tool-use orchestration
MCP support
Schema-constrained decoding
Parallel and chained tool calls
Agent architecture

ML Serving & Inference Optimization

Model serving
Context caching
Batch serving
Priority serving
Flex serving
Accelerator-aware routing
Cost and latency optimization

Distributed Systems & Backend Infrastructure

WebSockets
gRPC
REST APIs
Multi-region systems
Event-driven pipelines
Progressive rollouts
Observability
SLOs

Safety, Governance & Platform Quality

API contracts
Versioning
Deprecation policy
Scoped keys
Ephemeral tokens
Safety guardrails
Groundedness evaluation

Notes & insights

Writing

Technical notes on building reliable AI platform systems, real-time multimodal interfaces, agentic tool use, inference optimization, and grounded generation.

Coming soon

Building Reliable Real-time Voice AI on Gemini

Notes on stateful streaming, interruption handling, speech detection, and session resilience for production voice AI systems.

Coming soon

Designing Agentic Tool Use at Production Scale

Practical architecture patterns for function calling, schema constraints, parallel tools, and safe orchestration.

Coming soon

Inference Optimization Patterns for Large Multimodal Models

How context caching, serving tiers, batching, and workload-aware routing can reduce latency and cost.

Background

Education

Brown University

Bachelor of Science in Computer Science

2012 – 2016

Get in touch

Contact

For professional inquiries, technical collaboration, speaking, or writing opportunities, you can reach me by email.

Email: joon.shaw1207@zohomail.com
Phone: (442) 369-2134
Location: San Diego, CA

Email Joon LinkedIn GitHub