GPU Sidecar: Your Private AI Engine

Run AI models on your own hardware. No data leaves your network — ever. Full document intelligence, completely local.

Data Privacy

Why Local AI Matters for Legal Work

When you use a cloud AI service, your document text is transmitted to a third party for processing. For legal work — where attorney-client privilege and confidentiality are paramount — that creates risk.

The GPU Sidecar eliminates this concern entirely. Every AI operation — reading scanned documents, generating search indexes, ranking results, answering questions — happens on hardware you control. Nothing is transmitted to OpenAI, Anthropic, or any other provider.

Without the Sidecar

Sound Suite still works perfectly with cloud API keys. You add an OpenAI or Anthropic key in the admin panel, and your documents are processed through their services. The quality is excellent — but your document text is sent to those providers.

Where Your Data Goes

Cloud API Path
Your Documents Internet Third-Party Servers
GPU Sidecar Path
Your Documents Your GPU Results Stay Local
Capabilities

What the Sidecar Does

Four AI capabilities that run entirely on your hardware. No cloud services, no API costs, no data leaving your network.

Reading Scanned Documents (OCR)

Many court filings arrive as scanned PDFs — just images with no searchable text. The sidecar runs a specialized AI model (olmocr2) that reads these scans with high accuracy, even on poor-quality copies or faded documents.

Without sidecar tesseract.js — 5-10 min per 50 pages, moderate accuracy
With sidecar GPU AI OCR — 30-60 sec per 50 pages, higher accuracy

Making Documents Searchable by Meaning

Think of embedding as creating a meaning fingerprint for each paragraph — your private index that understands legal concepts, not just keywords. This is how Sound Suite finds relevant passages when you search "what obligations does the contract impose" — it understands meaning.

Without sidecar Cloud API (text sent externally) or local CPU (~20 min/case)
With sidecar Local GPU — ~30 sec per case, nothing leaves your network

Getting the Best Results (Reranking)

After vector search returns ~100 candidate passages, a second AI model (a cross-encoder) re-reads the query alongside each result and re-scores them. Think of it as having a senior associate review your search results and put the most relevant ones first. Takes 1-2 seconds for 100 documents. Auto-starts on demand and idles after 5 minutes to save GPU memory.

Without sidecar Basic vector search only — no reranking
With sidecar AI reranking — the exact paragraph you need, not 10 vague matches

AI Answers and Auto-Suggest

Powers the AI Chat panel and Auto-Suggest in the draft editor. The AI reads your search results and formulates answers with citations — or suggests the next sentence as you write, drawing from your indexed case documents.

Without sidecar Requires cloud API keys (OpenAI, Anthropic, Groq)
With sidecar Runs locally — no API costs, no data leaves your network
Architecture

How It Works

The sidecar is a separate process that runs on a machine with a GPU. It manages Docker containers — each container runs one specialized AI model. It connects to your main Sound Suite server via WebSocket, so the two machines can be anywhere on your network.

Setup in Four Steps

  1. 1 Download the sidecar — It ships as a standalone package. Install it on any machine with an NVIDIA GPU and Docker.
  2. 2 Auto-provisioning — On first launch, the sidecar pulls the required Docker images and creates containers for each AI model. No manual configuration needed.
  3. 3 Connect to Sound Suite — Point the sidecar at your main server. It opens a WebSocket connection and registers its capabilities.
  4. 4 VRAM-aware mode switching — The sidecar automatically manages which models are loaded based on available GPU memory, swapping between indexing and searching modes as needed.

Docker Containers

Embedding

ollama · qwen3-embedding:0.6b

1.2 GB

Completion

ollama · qwen3.5:9b

10 GB

OCR

ollama · olmocr2:7b-q8

8 GB

Reranker

vLLM · Qwen3-Reranker-8B

7 GB

VRAM-Aware Mode Switching

Indexing Mode

  • Embedding (1.2 GB)
  • OCR (8 GB)

~9 GB total

Searching Mode

  • Embedding (1.2 GB)
  • Reranker (7 GB)
  • Completion (10 GB)

~18 GB total

Containers are stopped and started automatically to stay within available VRAM. 24 GB+ recommended (RTX 4090, A5000) for simultaneous use.

Connection

Sound Suite

Main Server

WebSocket
Results

GPU Sidecar

Your GPU Machine

With vs Without the Sidecar

Sound Suite works great either way. The sidecar adds speed, privacy, and eliminates API costs.

Capability Without Sidecar With Sidecar
Document search Cloud API or slow local CPU Fast local GPU
OCR accuracy Good (tesseract.js) Excellent (AI-powered)
Search quality Basic vector search Reranked results
AI chat / auto-suggest Requires API keys ($) Included, no API costs
Data privacy Text sent to cloud providers 100% local
Speed Depends on internet / CPU GPU-accelerated
No GPU Required

Don't Have a GPU? You Don't Need One

Sound Suite works perfectly without the sidecar. Add API keys for OpenAI, Anthropic, or Groq in the admin panel and you get excellent AI capabilities immediately. No hardware investment, no Docker setup.

Cloud API Mode

  • No hardware requirements
  • Excellent AI quality from top providers
  • Easy setup — just add an API key
  • Document text processed by third parties
  • Ongoing API costs

GPU Sidecar Mode

  • 100% local — nothing leaves your network
  • No API costs after setup
  • GPU-accelerated speed
  • AI reranking for better search results
  • Requires NVIDIA GPU with 12+ GB VRAM

What You Need

Three requirements. The sidecar handles everything else.

NVIDIA GPU

12+ GB VRAM. An RTX 3060 or better. Desktop or workstation — laptops work too.

Docker

Docker Desktop or Docker Engine with NVIDIA Container Toolkit. The sidecar manages all containers automatically.

Network Access

The GPU machine needs to reach your Sound Suite server. Same LAN, VPN, or any network path will work.

Ready to Go Fully Local?

Download Sound Suite and add the GPU Sidecar for complete on-premise AI.

Sidecar license: $1,000 per instance for commercial use. Free for pro se litigants.