Offline · On-device · Private by design
An offline AI companion, built for the ones you love.
Meet Gemma 3n Voice Companion — a private, real-time voice assistant that runs entirely on a $499 NVIDIA Jetson. No cloud, no data collection, no internet required. Warm, patient, friendly conversation that stays on the device — whether it's helping a kid with homework, keeping Grandma company, or riding shotgun on a camping trip where the signal quits.
- Gemma 3n 4B
- Jetson Orin NX
- Ollama
- Whisper STT
- Piper TTS
- Silero VAD
# Gemma 3n Voice Assistant Pipeline
class GemmaVoice:
def __init__(self):
self.stt = WhisperSTT("small")
self.llm = Ollama("gemma3n:4b")
self.tts = PiperTTS("amy")
async def respond(self, audio):
# Real-time pipeline — TTS begins mid-stream
text = await self.stt.transcribe(audio)
response = await self.llm.stream(text)
await self.tts.speak(response)
# Jetson Orin NX 16GB · 12 tok/s · 2.3GB GPU · 100% offlineWatch it in action
3-minute demo running end-to-end on the Jetson
Project writeup
The full story, on Kaggle
This project was my entry to the Google Gemma 3n Impact Challenge, and it ended up placing 1st. The Kaggle writeup walks through every design decision — the hardware, the quantization, the real-time streaming pipeline, and the safety model for kids.
Why this matters
Four principles behind the build — for kids, grandparents, and anyone who'd rather keep their conversations to themselves.
Privacy by Design
Nothing leaves the device. Safe for therapy sessions, classrooms, family kitchens, and anywhere privacy isn't optional.
Edge AI Innovation
A 4B-parameter LLM quantized to 4-bit and streamed live — all on $499 of hardware sipping just 15 watts.
Friendly by Default
Warm, patient replies. Short sentences. Easy to understand whether you're 7 or 77 — no tech vocabulary required.
Real-Time Pipeline
Streaming STT → LLM → TTS. Speech starts before generation finishes. It feels like a conversation, not a query.
About the builder
A dad, a soldering iron, and a stubborn idea.
I'm Stephen Murphy — engineer, dad, and privacy advocate. I built this because the people I care about — my kids, my parents, anyone curious enough to talk to an AI — deserve one that isn't quietly logging them to a server farm. So I put a whole language model in a box that fits on a desk, costs less than a weekend of daycare, and runs with the WiFi unplugged.
Under the hood
Wake Word Detection
Two-stage "Hey Gemma" activation with fuzzy matching to handle 7-year-old diction.
Speech Recognition
Whisper Small + Silero VAD — robust offline STT even in a noisy kitchen.
Language Model
Gemma 3n 4B quantized to Q4_K_XL via Unsloth. 5.4GB on disk, 2.3GB on the GPU.
Voice Synthesis
Piper TTS "Amy" voice, streamed out sentence-by-sentence so there's no silent wait.
Questions people ask
What is Gemma 3n Voice Companion?
A privacy-first AI voice assistant that runs entirely on an NVIDIA Jetson Orin NX — no cloud, no internet needed. It listens, understands, and replies in real time with warm, patient conversation. It was my 1st-place entry in the Google Gemma 3n Impact Challenge on Kaggle.
Does it send any data to the cloud?
No. Speech-to-text, the language model, and text-to-speech all run on the device. Not a single byte of conversation leaves the hardware. Unplug the Ethernet and Wi-Fi and it still works.
How much does the hardware cost?
About $499 for an NVIDIA Jetson Orin NX 16GB plus basic accessories. Total power draw is around 15 watts under load — less than a desk lamp.
Who is it built for?
Anyone who wants a warm, private AI companion: children, elderly family members, therapy settings, classrooms, campers, off-grid homes, and anyone who would rather keep their conversations to themselves.
What model does it use?
Google's Gemma 3n 4B, quantized to Q4_K_XL via Unsloth and served with Ollama. Whisper Small handles speech-to-text, Silero VAD handles voice activity detection, and Piper TTS (the "Amy" voice) handles text-to-speech — all running locally.
How fast does it respond?
End-to-end latency is 2-3 seconds from end-of-speech to start-of-reply. The LLM generates around 12 tokens per second, and TTS streams out sentence-by-sentence so there's no silent wait.
Curious how it all fits together?
Everything — the pipeline, the prompting, the quantization tradeoffs, the safety guardrails — is laid out in the writeup.
Read the full writeup on Kaggle