NVIDIA Parakeet Riva ASR Deployment System

Production-ready deployment pipeline for NVIDIA Parakeet RNNT ASR via Riva with WebSocket streaming and comprehensive infrastructure automation.

https://github.com/davidbmar/nvidia-parakeet-ver-5  ·  public  ·  shipped

What it is

A complete infrastructure-as-code and client-wrapper system for deploying NVIDIA's Parakeet Recurrent Neural Network Transducer (RNNT) model. It supports both modern NIM containers and traditional Riva servers, providing low-latency real-time transcription via WebSocket with word-level timestamps and confidence scores.

Features

Quickstart

git clone https://github.com/davidbmar/nvidia-parakeet.git
cd nvidia-parakeet-3
./scripts/riva-010-run-complete-deployment-pipeline.sh

Architecture

flowchart TD
    ClientApps[Client Apps Browser or App] -->|WebSocket Audio Stream| WebSocketServer[WebSocket Server Port 8443]
    WebSocketServer -->|gRPC Client Wrapper| RivaASR[NVIDIA Riva ASR NIM or Traditional]
    RivaASR -->|GPU Inference| ParakeetRNNT[Parakeet RNNT Model]
    WebSocketServer -->|Logs| StructuredLogging[Structured Logging and Monitoring]
    RivaASR -->|Health Checks| HealthMonitor[Health Monitor]

How it's built

Built with Python 3.10+ using FastAPI for the backend server, Pydantic for configuration management, and websockets for real-time audio streaming. It integrates NVIDIA Riva ASR gRPC services or SpeechBrain Conformer models, orchestrated via Bash scripts for AWS EC2 GPU instance provisioning, Docker container management, and security group configuration.

How it runs

sequenceDiagram
    participant Client as Client Application
    participant WS as WebSocket Server
    participant RivaClient as RivaASRClient Wrapper
    participant Riva as NVIDIA Riva ASR Service
    participant GPU as GPU Worker
    Client->>WS: Connect WebSocket
    WS->>Client: Connection Established
    Client->>WS: Send Audio Chunk
    WS->>RivaClient: Forward Audio Data
    RivaClient->>Riva: gRPC Streaming Request
    Riva->>GPU: Execute Inference
    GPU-->>Riva: Partial Results
    Riva-->>RivaClient: Partial Transcription
    RivaClient-->>WS: Return Partial Result
    WS-->>Client: Emit Partial Text
    GPU-->>Riva: Final Results
    Riva-->>RivaClient: Final Transcription with Timestamps
    RivaClient-->>WS: Return Final Result
    WS-->>Client: Emit Final Text

How to apply & reuse

Use this system to deploy a scalable, GPU-accelerated speech-to-text service on AWS. It is suitable for applications requiring ultra-low latency transcription, such as live captioning, voice assistants, or meeting transcription tools, where word-level timing and high throughput are critical.

At a glance

CapabilitiesStreaming ASRBatch TranscriptionAWS AutomationGPU AccelerationReal-time Logging
ComponentsRivaASRClientWebSocket ServerNIM ContainerTraditional Riva ServerDeployment ScriptsLogging Framework
TechPythonFastAPIWebSocketsDockerNVIDIA RivaPydanticBash
Depends onAWS EC2NVIDIA GPU DriversDocker EngineNVIDIA Container ToolkitPython 3.10+
Integrates withNVIDIA NIMNVIDIA RivaAWS S3SpeechBrain
PatternsClient-ServerStreaming RPCInfrastructure as CodeObserver Pattern
Reuse tagsasrspeech-to-textnvidia-rivawebsocketaws-deploymentgpu-inference

⚠ Needs attention