Production-ready deployment pipeline for NVIDIA Parakeet RNNT ASR via Riva with WebSocket streaming and comprehensive infrastructure automation.
https://github.com/davidbmar/nvidia-parakeet-ver-5 · public · shipped
A complete infrastructure-as-code and client-wrapper system for deploying NVIDIA's Parakeet Recurrent Neural Network Transducer (RNNT) model. It supports both modern NIM containers and traditional Riva servers, providing low-latency real-time transcription via WebSocket with word-level timestamps and confidence scores.
git clone https://github.com/davidbmar/nvidia-parakeet.git cd nvidia-parakeet-3 ./scripts/riva-010-run-complete-deployment-pipeline.sh
flowchart TD
ClientApps[Client Apps Browser or App] -->|WebSocket Audio Stream| WebSocketServer[WebSocket Server Port 8443]
WebSocketServer -->|gRPC Client Wrapper| RivaASR[NVIDIA Riva ASR NIM or Traditional]
RivaASR -->|GPU Inference| ParakeetRNNT[Parakeet RNNT Model]
WebSocketServer -->|Logs| StructuredLogging[Structured Logging and Monitoring]
RivaASR -->|Health Checks| HealthMonitor[Health Monitor]
Built with Python 3.10+ using FastAPI for the backend server, Pydantic for configuration management, and websockets for real-time audio streaming. It integrates NVIDIA Riva ASR gRPC services or SpeechBrain Conformer models, orchestrated via Bash scripts for AWS EC2 GPU instance provisioning, Docker container management, and security group configuration.
sequenceDiagram
participant Client as Client Application
participant WS as WebSocket Server
participant RivaClient as RivaASRClient Wrapper
participant Riva as NVIDIA Riva ASR Service
participant GPU as GPU Worker
Client->>WS: Connect WebSocket
WS->>Client: Connection Established
Client->>WS: Send Audio Chunk
WS->>RivaClient: Forward Audio Data
RivaClient->>Riva: gRPC Streaming Request
Riva->>GPU: Execute Inference
GPU-->>Riva: Partial Results
Riva-->>RivaClient: Partial Transcription
RivaClient-->>WS: Return Partial Result
WS-->>Client: Emit Partial Text
GPU-->>Riva: Final Results
Riva-->>RivaClient: Final Transcription with Timestamps
RivaClient-->>WS: Return Final Result
WS-->>Client: Emit Final Text
Use this system to deploy a scalable, GPU-accelerated speech-to-text service on AWS. It is suitable for applications requiring ultra-low latency transcription, such as live captioning, voice assistants, or meeting transcription tools, where word-level timing and high throughput are critical.