Offline · On-device · Private by design

An offline AI companion, built for the ones you love.

Meet Gemma 3n Voice Companion — a private, real-time voice assistant that runs entirely on a $499 NVIDIA Jetson. No cloud, no data collection, no internet required. Warm, patient, friendly conversation that stays on the device — whether it's helping a kid with homework, keeping Grandma company, or riding shotgun on a camping trip where the signal quits.

  • Gemma 3n 4B
  • Jetson Orin NX
  • Ollama
  • Whisper STT
  • Piper TTS
  • Silero VAD
gemma_voice.py
# Gemma 3n Voice Assistant Pipeline
class GemmaVoice:
    def __init__(self):
        self.stt = WhisperSTT("small")
        self.llm = Ollama("gemma3n:4b")
        self.tts = PiperTTS("amy")

    async def respond(self, audio):
        # Real-time pipeline — TTS begins mid-stream
        text = await self.stt.transcribe(audio)
        response = await self.llm.stream(text)
        await self.tts.speak(response)

# Jetson Orin NX 16GB · 12 tok/s · 2.3GB GPU · 100% offline
$0
Total hardware cost
0%
Offline — no cloud, ever
0 tok/s
On-device LLM speed
0W
Power draw under load

Watch it in action

3-minute demo running end-to-end on the Jetson

Project writeup

The full story, on Kaggle

This project was my entry to the Google Gemma 3n Impact Challenge, and it ended up placing 1st. The Kaggle writeup walks through every design decision — the hardware, the quantization, the real-time streaming pipeline, and the safety model for kids.

Why this matters

Four principles behind the build — for kids, grandparents, and anyone who'd rather keep their conversations to themselves.

Privacy by Design

Nothing leaves the device. Safe for therapy sessions, classrooms, family kitchens, and anywhere privacy isn't optional.

Edge AI Innovation

A 4B-parameter LLM quantized to 4-bit and streamed live — all on $499 of hardware sipping just 15 watts.

Friendly by Default

Warm, patient replies. Short sentences. Easy to understand whether you're 7 or 77 — no tech vocabulary required.

Real-Time Pipeline

Streaming STT → LLM → TTS. Speech starts before generation finishes. It feels like a conversation, not a query.

About the builder

A dad, a soldering iron, and a stubborn idea.

I'm Stephen Murphy — engineer, dad, and privacy advocate. I built this because the people I care about — my kids, my parents, anyone curious enough to talk to an AI — deserve one that isn't quietly logging them to a server farm. So I put a whole language model in a box that fits on a desk, costs less than a weekend of daycare, and runs with the WiFi unplugged.

1stKaggle finish
$499Hardware budget
0Bytes to the cloud
100%On-device inference

Under the hood

Wake Word Detection

Two-stage "Hey Gemma" activation with fuzzy matching to handle 7-year-old diction.

Speech Recognition

Whisper Small + Silero VAD — robust offline STT even in a noisy kitchen.

Language Model

Gemma 3n 4B quantized to Q4_K_XL via Unsloth. 5.4GB on disk, 2.3GB on the GPU.

Voice Synthesis

Piper TTS "Amy" voice, streamed out sentence-by-sentence so there's no silent wait.

Questions people ask

What is Gemma 3n Voice Companion?

A privacy-first AI voice assistant that runs entirely on an NVIDIA Jetson Orin NX — no cloud, no internet needed. It listens, understands, and replies in real time with warm, patient conversation. It was my 1st-place entry in the Google Gemma 3n Impact Challenge on Kaggle.

Does it send any data to the cloud?

No. Speech-to-text, the language model, and text-to-speech all run on the device. Not a single byte of conversation leaves the hardware. Unplug the Ethernet and Wi-Fi and it still works.

How much does the hardware cost?

About $499 for an NVIDIA Jetson Orin NX 16GB plus basic accessories. Total power draw is around 15 watts under load — less than a desk lamp.

Who is it built for?

Anyone who wants a warm, private AI companion: children, elderly family members, therapy settings, classrooms, campers, off-grid homes, and anyone who would rather keep their conversations to themselves.

What model does it use?

Google's Gemma 3n 4B, quantized to Q4_K_XL via Unsloth and served with Ollama. Whisper Small handles speech-to-text, Silero VAD handles voice activity detection, and Piper TTS (the "Amy" voice) handles text-to-speech — all running locally.

How fast does it respond?

End-to-end latency is 2-3 seconds from end-of-speech to start-of-reply. The LLM generates around 12 tokens per second, and TTS streams out sentence-by-sentence so there's no silent wait.

Curious how it all fits together?

Everything — the pipeline, the prompting, the quantization tradeoffs, the safety guardrails — is laid out in the writeup.

Read the full writeup on Kaggle