GPU Sidecar: Your Private AI Engine
Run AI models on your own hardware. No data leaves your network — ever. Full document intelligence, completely local.
Why Local AI Matters for Legal Work
When you use a cloud AI service, your document text is transmitted to a third party for processing. For legal work — where attorney-client privilege and confidentiality are paramount — that creates risk.
The GPU Sidecar eliminates this concern entirely. Every AI operation — reading scanned documents, generating search indexes, ranking results, answering questions — happens on hardware you control. Nothing is transmitted to OpenAI, Anthropic, or any other provider.
Without the Sidecar
Sound Suite still works perfectly with cloud API keys. You add an OpenAI or Anthropic key in the admin panel, and your documents are processed through their services. The quality is excellent — but your document text is sent to those providers.
Where Your Data Goes
What the Sidecar Does
Four AI capabilities that run entirely on your hardware. No cloud services, no API costs, no data leaving your network.
Reading Scanned Documents (OCR)
Many court filings arrive as scanned PDFs — just images with no searchable text. The sidecar runs a specialized AI model (olmocr2) that reads these scans with high accuracy, even on poor-quality copies or faded documents.
Making Documents Searchable by Meaning
Think of embedding as creating a meaning fingerprint for each paragraph — your private index that understands legal concepts, not just keywords. This is how Sound Suite finds relevant passages when you search "what obligations does the contract impose" — it understands meaning.
Getting the Best Results (Reranking)
After vector search returns ~100 candidate passages, a second AI model (a cross-encoder) re-reads the query alongside each result and re-scores them. Think of it as having a senior associate review your search results and put the most relevant ones first. Takes 1-2 seconds for 100 documents. Auto-starts on demand and idles after 5 minutes to save GPU memory.
AI Answers and Auto-Suggest
Powers the AI Chat panel and Auto-Suggest in the draft editor. The AI reads your search results and formulates answers with citations — or suggests the next sentence as you write, drawing from your indexed case documents.
How It Works
The sidecar is a separate process that runs on a machine with a GPU. It manages Docker containers — each container runs one specialized AI model. It connects to your main Sound Suite server via WebSocket, so the two machines can be anywhere on your network.
Setup in Four Steps
- 1 Download the sidecar — It ships as a standalone package. Install it on any machine with an NVIDIA GPU and Docker.
- 2 Auto-provisioning — On first launch, the sidecar pulls the required Docker images and creates containers for each AI model. No manual configuration needed.
- 3 Connect to Sound Suite — Point the sidecar at your main server. It opens a WebSocket connection and registers its capabilities.
- 4 VRAM-aware mode switching — The sidecar automatically manages which models are loaded based on available GPU memory, swapping between indexing and searching modes as needed.
Docker Containers
Embedding
ollama · qwen3-embedding:0.6b
Completion
ollama · qwen3.5:9b
OCR
ollama · olmocr2:7b-q8
Reranker
vLLM · Qwen3-Reranker-8B
VRAM-Aware Mode Switching
Indexing Mode
- Embedding (1.2 GB)
- OCR (8 GB)
~9 GB total
Searching Mode
- Embedding (1.2 GB)
- Reranker (7 GB)
- Completion (10 GB)
~18 GB total
Containers are stopped and started automatically to stay within available VRAM. 24 GB+ recommended (RTX 4090, A5000) for simultaneous use.
Connection
Sound Suite
Main Server
GPU Sidecar
Your GPU Machine
With vs Without the Sidecar
Sound Suite works great either way. The sidecar adds speed, privacy, and eliminates API costs.
| Capability | Without Sidecar | With Sidecar |
|---|---|---|
| Document search | Cloud API or slow local CPU | Fast local GPU |
| OCR accuracy | Good (tesseract.js) | Excellent (AI-powered) |
| Search quality | Basic vector search | Reranked results |
| AI chat / auto-suggest | Requires API keys ($) | Included, no API costs |
| Data privacy | Text sent to cloud providers | 100% local |
| Speed | Depends on internet / CPU | GPU-accelerated |
Don't Have a GPU? You Don't Need One
Sound Suite works perfectly without the sidecar. Add API keys for OpenAI, Anthropic, or Groq in the admin panel and you get excellent AI capabilities immediately. No hardware investment, no Docker setup.
Cloud API Mode
- ✓ No hardware requirements
- ✓ Excellent AI quality from top providers
- ✓ Easy setup — just add an API key
- • Document text processed by third parties
- • Ongoing API costs
GPU Sidecar Mode
- ✓ 100% local — nothing leaves your network
- ✓ No API costs after setup
- ✓ GPU-accelerated speed
- ✓ AI reranking for better search results
- • Requires NVIDIA GPU with 12+ GB VRAM
What You Need
Three requirements. The sidecar handles everything else.
NVIDIA GPU
12+ GB VRAM. An RTX 3060 or better. Desktop or workstation — laptops work too.
Docker
Docker Desktop or Docker Engine with NVIDIA Container Toolkit. The sidecar manages all containers automatically.
Network Access
The GPU machine needs to reach your Sound Suite server. Same LAN, VPN, or any network path will work.
Ready to Go Fully Local?
Download Sound Suite and add the GPU Sidecar for complete on-premise AI.
Sidecar license: $1,000 per instance for commercial use. Free for pro se litigants.