Technical Deep Dive
Building a Real-Time Voice AI Assistant on NVIDIA Jetson Orin NX
14.5
Tokens/Second
2.3GB
GPU Memory Usage
2-3s
End-to-End Latency
100%
Offline Operation
Hardware Platform
NVIDIA Jetson Orin NX
- SoCNVIDIA Orin NX (T234)
- CPU8-core ARM Cortex-A78AE
- GPU1024 CUDA cores (Ampere)
- AI Performance100 TOPS
- Memory16GB LPDDR5 (Unified)
- Memory Bandwidth204.8 GB/s
- Power10-25W Configurable
Memory Allocation
- System & OS~3.0GB
- Whisper STT~1.0GB (CPU)
- Silero VAD~0.2GB (CPU)
- Gemma 3n (Ollama)~2.3GB (GPU)
- Piper TTS~0.5GB (CPU)
- Audio Buffers~0.5GB
- Available Buffer~7.0GB
Gemma 3n Model Configuration
I deployed the Gemma 3n Enhanced 4B parameter variant using aggressive quantization for edge deployment:
Q4_K_XL
Quantization
5.4GB
Model Size
1024
Context Tokens
12 tok/s
Inference Speed
Implementation Details
Two-Stage Wake Word Activationgemma_voice_final.py
def respond(self, user_input):
# Two-stage activation: silent warm-up first, then real activation
if not self.pre_warmed and self.check_wake_word(user_input):
# First wake word - do silent warm-up
self.pre_warmed = True
print(f"\n🔥 Wake word detected, doing silent warm-up...")
# Mute TTS temporarily
original_tts = self.tts
self.tts = None
self.processing = True
self.stt.muted = True
# Do a real response to warm up the model properly
warm_messages = self.messages + [{"role": "user", "content": user_input}]
response = requests.post(
f"{OLLAMA_HOST}/api/chat",
json={
"model": OLLAMA_MODEL,
"messages": warm_messages,
"stream": True,
"options": {
"temperature": 0.77,
"num_predict": 117,
"num_ctx": 1024,
"num_gpu": 999, # Force full GPU usage
"keep_alive": "60m"
}
},
stream=True,
timeout=(1, None)
)
# Enable quick response mode for instant detection
self.stt.set_quick_mode(True)
print("✨ Warmed up! Say 'Gemma' again for instant activation.")Streaming LLM to TTS Pipelinegemma_voice_final.py
# Stream tokens and speak sentences immediately
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if "message" in chunk:
token = clean_text(chunk["message"].get("content", ""))
if token:
print(token, end="", flush=True)
full_response += token
buffer.append(token)
text = ''.join(buffer)
sentences = re.split(r'([.!?]+\s*)', text)
# Process complete sentences immediately
i = 0
while i < len(sentences) - 1:
if re.match(r'[.!?]+\s*', sentences[i+1]):
sentence = sentences[i] + sentences[i+1]
if sentence.strip() and self.tts:
self.tts.speak(sentence.strip())
i += 2
else:
break
buffer = [sentences[-1]] if i < len(sentences) else []
time.sleep(TYPING_DELAY)Optimized Ollama Configurationconfig.py
# Model Configuration
OLLAMA_MODEL = "hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL"
OLLAMA_HOST = "http://localhost:11434"
# Generation Settings - Optimized for conversation
TEMPERATURE = 0.77 # Balanced creativity/coherence
MAX_TOKENS = 117 # Keep responses concise
TOP_P = 0.95 # Nucleus sampling
TOP_K = 33 # Top-k sampling
REPEAT_PENALTY = 1.33 # Reduce repetition
CONTEXT_WINDOW = 1024 # Context window size
NUM_CTX = 1024 # Ollama parameter name
# Performance Settings
STREAM_CHUNK_SIZE = 17 # Characters per chunk
TYPING_DELAY = 0.03 # Natural feel
# Audio Settings
SAMPLE_RATE = 16000 # Whisper requirement
STT_VAD_THRESHOLD = 0.11 # Voice activity detection
STT_MIN_SPEECH_MS = 200 # Minimum speech duration
STT_MAX_SILENCE_MS = 666 # Maximum silenceVoice Activity Detection with Silerogemma_voice_final.py
def __init__(self):
# Load Silero VAD for efficient voice detection
vad_path = Path.home() / ".cache/silero/silero_vad.jit"
vad_path.parent.mkdir(parents=True, exist_ok=True)
if not vad_path.exists():
import urllib.request
urllib.request.urlretrieve(
"https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.jit",
vad_path
)
self.vad = torch.jit.load(str(vad_path), map_location="cpu")
self.vad.eval()
def listen(self, callback, is_activated_callback=None):
# Adaptive thresholds for different states
if self.quick_response_mode:
threshold = 0.15 # Very sensitive for instant wake word
elif is_activated_callback and not is_activated_callback():
threshold = 0.3 # Moderately sensitive for initial wake word
else:
threshold = 0.4 # Lower for better conversation flow
speech = self.vad(torch.from_numpy(audio), self.sample_rate).item() > thresholdPerformance Metrics
< 100ms
Wake Word Detection
After warm-up phase
800ms
Speech Recognition
Whisper Small on CPU
12 tok/s
LLM Generation
Q4_K_XL on GPU
< 100ms
TTS First Byte
Piper ONNX Runtime
System Integration
Software Stack
- Operating SystemUbuntu 22.04.5 LTS
- JetPack6.2 (L4T R36.4.4)
- CUDA12.6.68
- Python3.10.18 (conda)
- PyTorch2.6.0 (CUDA)
- Ollama0.5.1
Process Architecture
Main Processgemma_voice_final.py
├─ Main ThreadOllama API
├─ Audio Input ThreadPyAudio 16kHz
├─ VAD ThreadSilero Inference
├─ STT WorkerWhisper
├─ TTS WorkerPiper
└─ Keep-Alive ThreadModel Persistence
Resource Utilization
45%
CPU Usage (Active)
8-core ARM @ 1.98GHz
65%
GPU Usage (Active)
1024 CUDA cores
12GB
Memory Usage
16GB Available
15W
Power Draw
25W Maximum
Text-to-Speech Pipeline
Piper TTS Configurationtts_piper.py
class EnhancedTTS:
def __init__(self, use_piper: bool = True):
# Piper configuration
self.piper_path = str(Path.home() / "bin" / "piper" / "piper")
self.piper_model_dir = Path.home() / ".local/share/piper/voices"
# Available Piper voices
self.piper_voices = {
"amy": "en_US-amy-medium.onnx", # Female, child-friendly
"ryan": "en_US-ryan-high.onnx", # Male, natural
}
# Audio device config
self.audio_device = TTS_DEVICE
# Voice parameters
TTS_PIPER_SPEED = 0.66 # Slower for clarity
TTS_PIPER_SENTENCE_PAUSE = 0.02 # Natural pauses
def _speak_piper(self, text: str):
# Piper command with optimal settings
piper_cmd = [
self.piper_path,
"--model", str(voice_file),
"--output-raw",
"--length-scale", str(TTS_PIPER_SPEED),
"--sentence-silence", str(TTS_PIPER_SENTENCE_PAUSE),
]Deployment Configuration
I created a fully automated deployment system that ensures consistent performance across restarts:
Installation Commandssetup.sh
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 3n quantized model
ollama pull hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4_K_XL
# Download Piper voices
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json
# Create conda environment
conda create -n gemma_jetson python=3.10.18
conda activate gemma_jetson
# Install dependencies in specific order
pip install numpy==1.22.0
pip install torch==2.6.0 torchaudio==2.6.0
pip install openai-whisper==20250625
pip install ollama==0.5.1
pip install piper==0.14.4
pip install pyaudio==0.2.14Technical Achievements
Memory Optimization
I achieved a 75% reduction in memory usage through strategic quantization:
- Original FP16 Model~16GB
- Q4_K_XL Quantized5.4GB
- GPU Memory Active2.3GB
- Quality LossMinimal
Latency Optimization
I implemented multiple strategies to minimize response time:
- Model Pre-warmingEliminates cold start
- Streaming PipelineParallel processing
- Keep-Alive ThreadModel persistence
- Quick Response ModeInstant activation
Conclusion
This project demonstrates that sophisticated AI doesn't require massive data centers or privacy compromises. By carefully optimizing every component of the pipeline, I created a voice assistant that:
100%
Offline Operation
0
Data Collected
15W
Power Usage
∞
Privacy Guaranteed
Built with dedication for the Gemma 3n Impact Challenge - Making AI accessible, private, and meaningful for everyone.