Technical Deep Dive

Building a Real-Time Voice AI Assistant on NVIDIA Jetson Orin NX

14.5
Tokens/Second
2.3GB
GPU Memory Usage
2-3s
End-to-End Latency
100%
Offline Operation

System Architecture

Audio Capture
16kHz PCM
Wake Word
Two-Stage Detection
Whisper STT
CPU Inference
Gemma 3n
Q4_K_XL Quantized
Piper TTS
ONNX Runtime

Hardware Platform

NVIDIA Jetson Orin NX

  • SoCNVIDIA Orin NX (T234)
  • CPU8-core ARM Cortex-A78AE
  • GPU1024 CUDA cores (Ampere)
  • AI Performance100 TOPS
  • Memory16GB LPDDR5 (Unified)
  • Memory Bandwidth204.8 GB/s
  • Power10-25W Configurable

Memory Allocation

  • System & OS~3.0GB
  • Whisper STT~1.0GB (CPU)
  • Silero VAD~0.2GB (CPU)
  • Gemma 3n (Ollama)~2.3GB (GPU)
  • Piper TTS~0.5GB (CPU)
  • Audio Buffers~0.5GB
  • Available Buffer~7.0GB

Gemma 3n Model Configuration

I deployed the Gemma 3n Enhanced 4B parameter variant using aggressive quantization for edge deployment:

Q4_K_XL
Quantization
5.4GB
Model Size
1024
Context Tokens
12 tok/s
Inference Speed

Implementation Details

Two-Stage Wake Word Activationgemma_voice_final.py
def respond(self, user_input):
    # Two-stage activation: silent warm-up first, then real activation
    if not self.pre_warmed and self.check_wake_word(user_input):
        # First wake word - do silent warm-up
        self.pre_warmed = True
        print(f"\n🔥 Wake word detected, doing silent warm-up...")

        # Mute TTS temporarily
        original_tts = self.tts
        self.tts = None
        self.processing = True
        self.stt.muted = True

        # Do a real response to warm up the model properly
        warm_messages = self.messages + [{"role": "user", "content": user_input}]
        response = requests.post(
            f"{OLLAMA_HOST}/api/chat",
            json={
                "model": OLLAMA_MODEL,
                "messages": warm_messages,
                "stream": True,
                "options": {
                    "temperature": 0.77,
                    "num_predict": 117,
                    "num_ctx": 1024,
                    "num_gpu": 999,  # Force full GPU usage
                    "keep_alive": "60m"
                }
            },
            stream=True,
            timeout=(1, None)
        )

        # Enable quick response mode for instant detection
        self.stt.set_quick_mode(True)
        print("✨ Warmed up! Say 'Gemma' again for instant activation.")
Streaming LLM to TTS Pipelinegemma_voice_final.py
# Stream tokens and speak sentences immediately
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        if "message" in chunk:
            token = clean_text(chunk["message"].get("content", ""))
            if token:
                print(token, end="", flush=True)
                full_response += token
                buffer.append(token)

                text = ''.join(buffer)
                sentences = re.split(r'([.!?]+\s*)', text)

                # Process complete sentences immediately
                i = 0
                while i < len(sentences) - 1:
                    if re.match(r'[.!?]+\s*', sentences[i+1]):
                        sentence = sentences[i] + sentences[i+1]
                        if sentence.strip() and self.tts:
                            self.tts.speak(sentence.strip())
                        i += 2
                    else:
                        break

                buffer = [sentences[-1]] if i < len(sentences) else []
                time.sleep(TYPING_DELAY)
Optimized Ollama Configurationconfig.py
# Model Configuration
OLLAMA_MODEL = "hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL"
OLLAMA_HOST = "http://localhost:11434"

# Generation Settings - Optimized for conversation
TEMPERATURE = 0.77          # Balanced creativity/coherence
MAX_TOKENS = 117           # Keep responses concise
TOP_P = 0.95               # Nucleus sampling
TOP_K = 33                 # Top-k sampling
REPEAT_PENALTY = 1.33      # Reduce repetition
CONTEXT_WINDOW = 1024      # Context window size
NUM_CTX = 1024             # Ollama parameter name

# Performance Settings
STREAM_CHUNK_SIZE = 17     # Characters per chunk
TYPING_DELAY = 0.03        # Natural feel

# Audio Settings
SAMPLE_RATE = 16000        # Whisper requirement
STT_VAD_THRESHOLD = 0.11  # Voice activity detection
STT_MIN_SPEECH_MS = 200    # Minimum speech duration
STT_MAX_SILENCE_MS = 666   # Maximum silence
Voice Activity Detection with Silerogemma_voice_final.py
def __init__(self):
    # Load Silero VAD for efficient voice detection
    vad_path = Path.home() / ".cache/silero/silero_vad.jit"
    vad_path.parent.mkdir(parents=True, exist_ok=True)

    if not vad_path.exists():
        import urllib.request
        urllib.request.urlretrieve(
            "https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.jit",
            vad_path
        )

    self.vad = torch.jit.load(str(vad_path), map_location="cpu")
    self.vad.eval()

def listen(self, callback, is_activated_callback=None):
    # Adaptive thresholds for different states
    if self.quick_response_mode:
        threshold = 0.15  # Very sensitive for instant wake word
    elif is_activated_callback and not is_activated_callback():
        threshold = 0.3   # Moderately sensitive for initial wake word
    else:
        threshold = 0.4   # Lower for better conversation flow

    speech = self.vad(torch.from_numpy(audio), self.sample_rate).item() > threshold

Performance Metrics

< 100ms
Wake Word Detection
After warm-up phase
800ms
Speech Recognition
Whisper Small on CPU
12 tok/s
LLM Generation
Q4_K_XL on GPU
< 100ms
TTS First Byte
Piper ONNX Runtime

System Integration

Software Stack

  • Operating SystemUbuntu 22.04.5 LTS
  • JetPack6.2 (L4T R36.4.4)
  • CUDA12.6.68
  • Python3.10.18 (conda)
  • PyTorch2.6.0 (CUDA)
  • Ollama0.5.1

Process Architecture

Main Processgemma_voice_final.py
├─ Main ThreadOllama API
├─ Audio Input ThreadPyAudio 16kHz
├─ VAD ThreadSilero Inference
├─ STT WorkerWhisper
├─ TTS WorkerPiper
└─ Keep-Alive ThreadModel Persistence

Resource Utilization

45%
CPU Usage (Active)
8-core ARM @ 1.98GHz
65%
GPU Usage (Active)
1024 CUDA cores
12GB
Memory Usage
16GB Available
15W
Power Draw
25W Maximum

Text-to-Speech Pipeline

Piper TTS Configurationtts_piper.py
class EnhancedTTS:
    def __init__(self, use_piper: bool = True):
        # Piper configuration
        self.piper_path = str(Path.home() / "bin" / "piper" / "piper")
        self.piper_model_dir = Path.home() / ".local/share/piper/voices"

        # Available Piper voices
        self.piper_voices = {
            "amy": "en_US-amy-medium.onnx",    # Female, child-friendly
            "ryan": "en_US-ryan-high.onnx",     # Male, natural
        }

        # Audio device config
        self.audio_device = TTS_DEVICE

        # Voice parameters
        TTS_PIPER_SPEED = 0.66           # Slower for clarity
        TTS_PIPER_SENTENCE_PAUSE = 0.02  # Natural pauses

    def _speak_piper(self, text: str):
        # Piper command with optimal settings
        piper_cmd = [
            self.piper_path,
            "--model", str(voice_file),
            "--output-raw",
            "--length-scale", str(TTS_PIPER_SPEED),
            "--sentence-silence", str(TTS_PIPER_SENTENCE_PAUSE),
        ]

Deployment Configuration

I created a fully automated deployment system that ensures consistent performance across restarts:

Installation Commandssetup.sh
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3n quantized model
ollama pull hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL

# Download Piper voices
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json

# Create conda environment
conda create -n gemma_jetson python=3.10.18
conda activate gemma_jetson

# Install dependencies in specific order
pip install numpy==1.22.0
pip install torch==2.6.0 torchaudio==2.6.0
pip install openai-whisper==20250625
pip install ollama==0.5.1
pip install piper==0.14.4
pip install pyaudio==0.2.14

Technical Achievements

Memory Optimization

I achieved a 75% reduction in memory usage through strategic quantization:

  • Original FP16 Model~16GB
  • Q4_K_XL Quantized5.4GB
  • GPU Memory Active2.3GB
  • Quality LossMinimal

Latency Optimization

I implemented multiple strategies to minimize response time:

  • Model Pre-warmingEliminates cold start
  • Streaming PipelineParallel processing
  • Keep-Alive ThreadModel persistence
  • Quick Response ModeInstant activation

Conclusion

This project demonstrates that sophisticated AI doesn't require massive data centers or privacy compromises. By carefully optimizing every component of the pipeline, I created a voice assistant that:

100%
Offline Operation
0
Data Collected
15W
Power Usage
Privacy Guaranteed

Built with dedication for the Gemma 3n Impact Challenge - Making AI accessible, private, and meaningful for everyone.