For engineers

Reference architecture for a private LLM on trade-secret data.

A buildable design for a self-hosted assistant a small team can use on confidential client material — threat model, three deployment options, model and hardware sizing, the hardening checklist, and a mapping from each control to the legal "reasonable steps" test.

Workload general chat assistant, small team Models open-weight, self-hosted Date 26 May 2026
Threat model first

One question decides everything: who can read the data in the clear?

For trade secrets the asset is confidentiality. So the only question that matters is: who or what can technically observe the prompts, the documents, the model's working memory, the disk, the logs, and the outputs — and can you prove the list is short?

Every design decision below shortens that list and makes it auditable. The legal payoff (covered in the business report) is that a short, controlled, documented list is exactly what "reasonable steps to keep it secret" looks like in practice.

Design principle: move the model to the data, never the data to a service. The model is the thing that should travel; the secret stays put.

The leak vectors that actually bite

  • Prompt / request logs in the model server, the chat UI, or a reverse proxy
  • Swap, crash dumps, temp files, shell history
  • Persistent disks / snapshots that outlive the job (rented infra)
  • Outbound telemetry from inference frameworks, tracing, "observability" or eval libraries
  • Accidental calls to hosted APIs for embeddings, rerankers, or "helper" features
  • Multi-tenant exposure: a neighbour or host operator on shared hardware
Deployment options

Three trust boundaries, graded by who's inside them

A · Own hardware (recommended)

On-prem / controlled LAN

  • Inside the boundary: only you.
  • No hypervisor, no host operator, no neighbours.
  • Can be fully air-gapped or LAN/VPN-only.
  • Residual risk: physical theft, malware, bad backups — all locally controllable.
B · Rented single-tenant

Dedicated / bare-metal, EU

  • Inside: you + one named provider under a DPA.
  • No co-tenants; you control the OS and disk.
  • Provider can physically access hardware — mitigate with encryption + no persistence.
  • e.g. Hetzner dedicated GPU, AWS dedicated host / bare metal.
C · Managed private endpoint

Bedrock / Azure OpenAI

  • Inside: you + the AI service, under enterprise terms.
  • Contractually no training on your data, no input/output storage; KMS, PrivateLink, EU region.
  • Data leaves your runtime — trust rests on contract + certifications.
  • Lowest ops; weakest control. Client-approved cases only.
Hard exclusions for entrusted secrets: anonymous GPU marketplaces (unknown host operator, unclear disk lifecycle) and consumer LLM APIs (data may be logged/retained/used for training). Both fail the "reasonable steps" test by default.
The recommended build

Reference architecture — a small-team private assistant

Built for Option A (own hardware); the same stack lifts onto Option B unchanged. Everything runs on one machine, behind a default-deny firewall, reachable only over your LAN or a VPN.

Trust boundary — your machine, no egress for data
Client / usersA handful of team members on the office network or WireGuard VPN — individual accounts, MFA.LAN / VPN only
Chat UISelf-hosted web chat with multi-user accounts & role-based access. Bound to localhost/LAN.Open WebUI
Inference serverServes the model over an OpenAI-compatible API. Request logging disabled.vLLM · or Ollama
Model weightsOpen-weight, permissive licence, downloaded once then kept offline; checksum pinned.30B–70B class
Optional: retrievalLocal embeddings + local vector store for document Q&A. Never a hosted vector DB.bge/e5 · Qdrant/pgvector
Host OSLinux on an encrypted disk (LUKS), no/encrypted swap, hardened SSH, egress firewall.Ubuntu LTS + LUKS
HardwareWorkstation with one 48 GB-class pro GPU; physically secured.RTX 6000 Ada 48 GB
Why vLLM: high-throughput batched serving for several concurrent users, OpenAI-compatible API so Open WebUI plugs straight in. Use Ollama instead if you want the simplest possible single-binary setup and can accept lower concurrency.
Why Open WebUI: a polished self-hosted chat front-end with users, groups, RBAC and document upload — gives the team a ChatGPT-like experience with nothing leaving the box.
Model selection

Pick by licence and VRAM fit, not by hype

The open-weight leaderboard reshuffles monthly, so choose by durable criteria and slot in the current best release at build time.

  • Permissive licence first. Prefer Apache-2.0 / MIT families (Qwen, Mistral, some Gemma/DeepSeek/GLM releases) to avoid licence entanglement on commercial client work. Check the exact release's terms.
  • Fit the GPU. Pick the largest model that fits your VRAM at 4-bit with room for context — sizing table at right.
  • Verify on a live leaderboard. Cross-check the current top instruct models on a neutral index before committing (sources below).
  • Download once, then air-gap. Pull weights on a connected machine, verify the checksum, move them over, and cut egress.
Good default for a 48 GB GPU: a strong ~30B dense instruct model (or a small MoE) at 4–8-bit gives near-flagship chat quality with comfortable context headroom for a small team.
Model sizeVRAM @ 4-bit*Fits on
7–9B~6–8 GBAny modern GPU
13–14B~10–12 GB16–20 GB (e.g. GEX44)
30–34B~20–24 GB24–48 GB
70–72B~40–48 GB48 GB (RTX 6000 Ada) — tight; 80 GB comfortable
100B+ MoEvariesMulti-GPU / 80 GB+
*Approximate weights-only footprint; add headroom for KV-cache (grows with context length & concurrent users).
Quantisation: 4-bit (e.g. AWQ/GPTQ/GGUF Q4) roughly halves VRAM vs 8-bit with minor quality loss — the standard lever for fitting a bigger, smarter model on one card.
Hardware & hosting

What to buy, or what to rent

PathSpecGPU / VRAMCostNotes
A · Own boxWorkstation: 1× pro GPU, 64–128 GB RAM, NVMe (LUKS)RTX 6000 Ada 48 GB (≈ €6.8k) or RTX PRO 6000 Blackwell 96 GB (≈ €8k)~€8k–12k onceRuns 30–70B comfortably; full physical control
B · Hetzner GEX44Dedicated, EURTX 4000 SFF Ada 20 GB€184/mo + €79 setupGood for ≤14B models / lighter use
B · Hetzner GEX130Dedicated, EURTX 6000 Ada 48 GB~€838/mo + €79 setupMatches the owned-box GPU, as OPEX
B · AWS dedicatedDedicated host / bare-metal GPU, EU regionL4 / L40S / A10G classUsage-based, higherStrong isolation primitives; easy to misconfigure
C · ManagedBedrock / Azure OpenAI, EU regionn/a (service)Per tokenNo GPU ops; data leaves runtime
Indicative figures, May 2026. The RTX 5090 (32 GB, consumer) is cheaper but not a datacenter/pro part — viable for an owned box on a budget, with less VRAM headroom.
The hardening checklist

The controls that make it "reasonable steps"

Apply all of these for A and B. For C, the network/crypto items become contract + provider-config items (KMS, PrivateLink, region pinning, no-logging).

Network & egress

  • Firewall default-deny inbound; no public inference endpoint
  • Access only via LAN or VPN (WireGuard); bind services to localhost/private iface
  • Default-deny outbound too — allow only OS/package mirrors you actually need
  • This single rule kills accidental telemetry and hosted-API calls

Cryptography & storage

  • Full-disk encryption (LUKS) on the model + data volumes
  • No swap, or encrypted swap only
  • Disable crash/core dumps; clear temp aggressively
  • Backups encrypted, access-controlled, or none

Data hygiene & logging

  • Disable prompt/response logging in the inference server and the chat UI
  • If any logging is required, keep it short-retention, encrypted, access-controlled
  • Disable shell history for secret-handling sessions
  • Ephemeral staging; wipe input/output files after use

Access & audit

  • SSH keys only — no passwords; restrict source IPs
  • Individual user accounts; RBAC in the chat UI; MFA on the VPN
  • Keep an access log: who reached the machine and when
  • Offboarding checklist: revoke keys/accounts on departure

Supply chain

  • Audit the inference stack for telemetry (OpenTelemetry/analytics/crash reporting off)
  • Pin model + container checksums; verify before load
  • Avoid tools that call out for embeddings, rerankers, tracing or eval
  • Vendor/pin dependencies; review what each library phones home

Lifecycle & decommission

  • Separate machine/account/project per sensitive workload
  • On rented infra: delete volumes & snapshots after use
  • Crypto-erase or securely wipe disks at end of engagement
  • Document the wipe — it's evidence of "reasonable steps"
The bridge to the legal test

Each control maps to "reasonable steps to keep it secret"

This table is the hand-off to the business report. It turns engineering into the evidence a court or a client wants: a documented, proportionate set of measures.

Technical controlMaps to the legal requirement…
Self-hosted open-weight model; no external APINo disclosure of the secret outside the trusted circle
LAN/VPN-only, no public endpoint, egress lockedRestricting access; preventing onward transmission
Full-disk encryption, no/encrypted swap"Appropriate technical measures" to secure the information
Individual accounts, MFA, RBAC, access logLimiting the number of people with access; demonstrable control
Logging disabled / minimised; temp wipedNot creating uncontrolled copies of the secret
Single-tenant infra + signed DPA (Option B)Sufficient guarantees from any third party that is involved
Documented wipe / decommissionEvidence the holder actively maintained secrecy throughout
NDAs + access policy (organisational, not technical)The contractual half of "reasonable steps" — pair with the above
Operational runbook

From bare machine to locked-down assistant

An outline, not copy-paste commands — adapt to your distro and provider. The ordering matters: bring the data path online after egress is cut.

# 1. Provision & encrypt
install Ubuntu LTS on LUKS-encrypted NVMe; disable swap (or encrypt it)
harden SSH: keys only, no root login, restrict source IPs
ufw default deny incoming; ufw default deny outgoing
allow out only: OS + package mirrors (temporarily, for setup)

# 2. Pull the model (still online), then go dark
download weights on this box or a staging machine
verify sha256 against the published checksum
once installed: remove the temporary outbound allow-rules

# 3. Serve + UI, no logging
run vLLM (or Ollama) bound to 127.0.0.1; disable request logging
run Open WebUI bound to LAN/VPN iface; create per-user accounts + RBAC
confirm: no OpenTelemetry / analytics / crash reporting enabled

# 4. Verify the boundary (the important step)
tcpdump / firewall logs: confirm zero unexpected egress during a real chat
grep the box for prompt text in logs/temp/swap — should find nothing

# 5. Decommission (end of engagement)
stop services; securely wipe data + model volumes
on rented infra: delete the server, volumes, and snapshots
record the wipe (date, method, operator) in the evidence pack
Step 4 is the one people skip. Actually watching the network during a real conversation — and confirming nothing leaves — is what converts "we think it's private" into "we verified it's private." Keep that capture as evidence.
Keep the receipts

The evidence pack

"Reasonable steps" is only worth what you can show. Maintain a short folder, versioned, that a client or a court can be walked through:

Sources

  1. Open-weight model landscape & leaderboards — Self-hosted LLM leaderboard; HF open-source LLM overview (verify current releases at build time)
  2. Hetzner dedicated GPU servers (GEX44 / GEX130) — hetzner.com GPU matrix; GEX44
  3. GPU pricing (RTX 6000 Ada, RTX PRO 6000 Blackwell, RTX 5090) — NVIDIA RTX 6000 Ada; getdeploying price index
  4. AWS Bedrock data handling (no training, no input/output storage, KMS, PrivateLink) — aws.amazon.com/bedrock/security-privacy
  5. Trade-secret "reasonable steps" standard (legal bridge) — see the business report sources (TRIPS Art. 39, EU 2016/943, UK Regs 2018, PT DL 110/2018)
Engineering guidance, not legal advice. Component names and figures reflect May 2026; the open-weight model field moves monthly, so re-verify the current best release and exact licence before deployment. The control-to-law mapping is a practical aid, not a legal opinion — confirm sufficiency with a qualified lawyer for each client's jurisdiction.