Technical Deep Dive

Building a Real-Time Voice AI Assistant on NVIDIA Jetson Orin NX

14.5

Tokens/Second

2.3GB

GPU Memory Usage

2-3s

End-to-End Latency

100%

Offline Operation

System Architecture

Audio Capture

16kHz PCM

Wake Word

Two-Stage Detection

Whisper STT

CPU Inference

Gemma 3n

Q4_K_XL Quantized

Piper TTS

ONNX Runtime

Hardware Platform

NVIDIA Jetson Orin NX

SoCNVIDIA Orin NX (T234)
CPU8-core ARM Cortex-A78AE
GPU1024 CUDA cores (Ampere)
AI Performance100 TOPS
Memory16GB LPDDR5 (Unified)
Memory Bandwidth204.8 GB/s
Power10-25W Configurable

Memory Allocation

System & OS~3.0GB
Whisper STT~1.0GB (CPU)
Silero VAD~0.2GB (CPU)
Gemma 3n (Ollama)~2.3GB (GPU)
Piper TTS~0.5GB (CPU)
Audio Buffers~0.5GB
Available Buffer~7.0GB

Gemma 3n Model Configuration

I deployed the Gemma 3n Enhanced 4B parameter variant using aggressive quantization for edge deployment:

Q4_K_XL

Quantization

5.4GB

Model Size

1024

Context Tokens

12 tok/s

Inference Speed

Implementation Details

Two-Stage Wake Word Activationgemma_voice_final.py
def respond(self, user_input):
    # Two-stage activation: silent warm-up first, then real activation
    if not self.pre_warmed and self.check_wake_word(user_input):
        # First wake word - do silent warm-up
        self.pre_warmed = True
        print(f"\n🔥 Wake word detected, doing silent warm-up...")

        # Mute TTS temporarily
        original_tts = self.tts
        self.tts = None
        self.processing = True
        self.stt.muted = True

        # Do a real response to warm up the model properly
        warm_messages = self.messages + [{"role": "user", "content": user_input}]
        response = requests.post(
            f"{OLLAMA_HOST}/api/chat",
            json={
                "model": OLLAMA_MODEL,
                "messages": warm_messages,
                "stream": True,
                "options": {
                    "temperature": 0.77,
                    "num_predict": 117,
                    "num_ctx": 1024,
                    "num_gpu": 999,  # Force full GPU usage
                    "keep_alive": "60m"
                }
            },
            stream=True,
            timeout=(1, None)
        )

        # Enable quick response mode for instant detection
        self.stt.set_quick_mode(True)
        print("✨ Warmed up! Say 'Gemma' again for instant activation.")

Streaming LLM to TTS Pipelinegemma_voice_final.py
# Stream tokens and speak sentences immediately
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        if "message" in chunk:
            token = clean_text(chunk["message"].get("content", ""))
            if token:
                print(token, end="", flush=True)
                full_response += token
                buffer.append(token)

                text = ''.join(buffer)
                sentences = re.split(r'([.!?]+\s*)', text)

                # Process complete sentences immediately
                i = 0
                while i < len(sentences) - 1:
                    if re.match(r'[.!?]+\s*', sentences[i+1]):
                        sentence = sentences[i] + sentences[i+1]
                        if sentence.strip() and self.tts:
                            self.tts.speak(sentence.strip())
                        i += 2
                    else:
                        break

                buffer = [sentences[-1]] if i < len(sentences) else []
                time.sleep(TYPING_DELAY)

Optimized Ollama Configurationconfig.py
# Model Configuration
OLLAMA_MODEL = "hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL"
OLLAMA_HOST = "http://localhost:11434"

# Generation Settings - Optimized for conversation
TEMPERATURE = 0.77          # Balanced creativity/coherence
MAX_TOKENS = 117           # Keep responses concise
TOP_P = 0.95               # Nucleus sampling
TOP_K = 33                 # Top-k sampling
REPEAT_PENALTY = 1.33      # Reduce repetition
CONTEXT_WINDOW = 1024      # Context window size
NUM_CTX = 1024             # Ollama parameter name

# Performance Settings
STREAM_CHUNK_SIZE = 17     # Characters per chunk
TYPING_DELAY = 0.03        # Natural feel

# Audio Settings
SAMPLE_RATE = 16000        # Whisper requirement
STT_VAD_THRESHOLD = 0.11  # Voice activity detection
STT_MIN_SPEECH_MS = 200    # Minimum speech duration
STT_MAX_SILENCE_MS = 666   # Maximum silence

Voice Activity Detection with Silerogemma_voice_final.py
def __init__(self):
    # Load Silero VAD for efficient voice detection
    vad_path = Path.home() / ".cache/silero/silero_vad.jit"
    vad_path.parent.mkdir(parents=True, exist_ok=True)

    if not vad_path.exists():
        import urllib.request
        urllib.request.urlretrieve(
            "https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.jit",
            vad_path
        )

    self.vad = torch.jit.load(str(vad_path), map_location="cpu")
    self.vad.eval()

def listen(self, callback, is_activated_callback=None):
    # Adaptive thresholds for different states
    if self.quick_response_mode:
        threshold = 0.15  # Very sensitive for instant wake word
    elif is_activated_callback and not is_activated_callback():
        threshold = 0.3   # Moderately sensitive for initial wake word
    else:
        threshold = 0.4   # Lower for better conversation flow

    speech = self.vad(torch.from_numpy(audio), self.sample_rate).item() > threshold

Performance Metrics

< 100ms

Wake Word Detection

After warm-up phase

800ms

Speech Recognition

Whisper Small on CPU

12 tok/s

LLM Generation

Q4_K_XL on GPU

< 100ms

TTS First Byte

Piper ONNX Runtime

System Integration

Software Stack

Operating SystemUbuntu 22.04.5 LTS
JetPack6.2 (L4T R36.4.4)
CUDA12.6.68
Python3.10.18 (conda)
PyTorch2.6.0 (CUDA)
Ollama0.5.1

Process Architecture

Main Processgemma_voice_final.py

├─ Main ThreadOllama API

├─ Audio Input ThreadPyAudio 16kHz

├─ VAD ThreadSilero Inference

├─ STT WorkerWhisper

├─ TTS WorkerPiper

└─ Keep-Alive ThreadModel Persistence

Resource Utilization

45%

CPU Usage (Active)

8-core ARM @ 1.98GHz

65%

GPU Usage (Active)

1024 CUDA cores

12GB

Memory Usage

16GB Available

15W

Power Draw

25W Maximum

Text-to-Speech Pipeline

Piper TTS Configurationtts_piper.py
class EnhancedTTS:
    def __init__(self, use_piper: bool = True):
        # Piper configuration
        self.piper_path = str(Path.home() / "bin" / "piper" / "piper")
        self.piper_model_dir = Path.home() / ".local/share/piper/voices"

        # Available Piper voices
        self.piper_voices = {
            "amy": "en_US-amy-medium.onnx",    # Female, child-friendly
            "ryan": "en_US-ryan-high.onnx",     # Male, natural
        }

        # Audio device config
        self.audio_device = TTS_DEVICE

        # Voice parameters
        TTS_PIPER_SPEED = 0.66           # Slower for clarity
        TTS_PIPER_SENTENCE_PAUSE = 0.02  # Natural pauses

    def _speak_piper(self, text: str):
        # Piper command with optimal settings
        piper_cmd = [
            self.piper_path,
            "--model", str(voice_file),
            "--output-raw",
            "--length-scale", str(TTS_PIPER_SPEED),
            "--sentence-silence", str(TTS_PIPER_SENTENCE_PAUSE),
        ]

Deployment Configuration

I created a fully automated deployment system that ensures consistent performance across restarts:

Installation Commandssetup.sh
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3n quantized model
ollama pull hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL

# Download Piper voices
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json

# Create conda environment
conda create -n gemma_jetson python=3.10.18
conda activate gemma_jetson

# Install dependencies in specific order
pip install numpy==1.22.0
pip install torch==2.6.0 torchaudio==2.6.0
pip install openai-whisper==20250625
pip install ollama==0.5.1
pip install piper==0.14.4
pip install pyaudio==0.2.14

Technical Achievements

Memory Optimization

I achieved a 75% reduction in memory usage through strategic quantization:

Original FP16 Model~16GB
Q4_K_XL Quantized5.4GB
GPU Memory Active2.3GB
Quality LossMinimal

Latency Optimization

I implemented multiple strategies to minimize response time:

Model Pre-warmingEliminates cold start
Streaming PipelineParallel processing
Keep-Alive ThreadModel persistence
Quick Response ModeInstant activation

Conclusion

This project demonstrates that sophisticated AI doesn't require massive data centers or privacy compromises. By carefully optimizing every component of the pipeline, I created a voice assistant that:

100%

Offline Operation

Data Collected

15W

Power Usage

∞

Privacy Guaranteed

Built with dedication for the Gemma 3n Impact Challenge - Making AI accessible, private, and meaningful for everyone.