Scalability

Diminuendo’s scaling model is a consequence of its data model, not an afterthought bolted on through distributed consensus protocols or shared-nothing clustering. The key insight is structural: if no two gateway instances ever need to write to the same database, then scaling out is simply a matter of routing tenants to instances. There is no coordination problem because there is nothing to coordinate. This page covers three aspects of the scaling architecture: the per-tenant data isolation that makes horizontal scaling possible, the multi-worker SQLite architecture that makes a single instance fast, and the benchmark data that measures the result.

Per-Tenant Data Isolation

Every tenant in Diminuendo receives its own SQLite database file for session metadata, and every session receives its own dedicated database for conversation history, events, and usage records:
data/
  tenants/
    {tenantId}/
      registry.db          # Session metadata for this tenant
  sessions/
    {sessionId}/
      session.db           # Conversation history, events, turn usage
There is no shared database between tenants or sessions. A query against tenant acme’s registry physically cannot touch tenant globex’s data — they reside in different files on different filesystem paths. There is no WHERE tenant_id = ? clause to forget, no row-level security policy to misconfigure, no cross-tenant join to accidentally permit.
This isolation extends to deletion semantics. Removing a session means deleting a directory. Removing a tenant means deleting a directory tree. No cascading deletes, no orphaned foreign key references, no vacuum passes over a shared tablespace.

Why This Enables Horizontal Scaling

Since there is no shared state between tenants, multiple Diminuendo instances can serve different tenants independently. A load balancer routes by tenant ID — using sticky sessions or tenant-affinity routing — to ensure all requests for a given tenant reach the same instance. The fundamental invariant is simple: at any point in time, exactly one gateway instance is responsible for a given tenant’s data. This is trivially satisfied by a load balancer that hashes on the tenant ID extracted from the JWT’s tenant_id claim.
                Load Balancer
                (tenant-affinity)
              /        |        \
        Instance A  Instance B  Instance C
        tenant: acme  tenant: globex  tenant: initech
        data/tenants/acme/  data/tenants/globex/  data/tenants/initech/
Adding capacity means adding instances and rebalancing the tenant-to-instance mapping. No data migration is required — just copy the tenant’s data/ directory to the new instance and update the routing table.

Sticky Session Requirement

WebSocket connections are inherently stateful. Each connected client maintains in-memory state on the gateway instance: the ActiveSession record, the ConnectionState tracking authentication and subscriptions, and the event streaming fiber that consumes Podium events and publishes them to session topics. A client must reconnect to the same instance that holds its session’s in-memory state. If a load balancer routes a reconnecting client to a different instance, that instance will not have the session’s active Podium connection, event fiber, or subscriber registrations.
Sticky sessions are a hard requirement for the current architecture, not a performance optimization. The gateway does not replicate in-memory state between instances. A misdirected WebSocket connection will fail to find an active session and will require the client to re-join, which triggers a fresh Podium connection and state snapshot.
During rolling deployments, tenants can be redistributed across instances by leveraging the stale session recovery mechanism: when an instance restarts, it resets all non-idle sessions to inactive. Clients reconnect, receive a state_snapshot reflecting the reset state, and the session activates cleanly on the new instance.

SQLite as Scaling Advantage

The choice of SQLite over PostgreSQL is often perceived as a scalability limitation. In Diminuendo’s architecture, it is precisely the opposite — SQLite enables a scaling model that a shared database would complicate:

No Cluster to Manage

There is no PostgreSQL primary, no read replicas, no connection pooler (PgBouncer/pgcat), no failover orchestrator. Each instance manages its own local files.

Copy-Based Backup

Backing up a tenant means copying a directory. Restoring means placing files. No pg_dump, no WAL archiving, no point-in-time recovery infrastructure.

Per-Session Archival

Completed sessions can be archived independently — compress the session directory, upload to object storage, and delete locally. No DELETE FROM events WHERE session_id = ? on a multi-terabyte table.

WAL Concurrency

WAL mode allows concurrent reads without blocking the writer. The two-worker architecture places reads and writes on separate threads, so a long-running history query never stalls event persistence.

Multi-Worker SQLite Architecture

SQLite serializes all writes within a single database connection. Left unaddressed, a single-threaded writer becomes a bottleneck the moment dozens of concurrent sessions generate events, messages, and token-usage records simultaneously. The solution is a two-worker architecture that separates read and write paths into dedicated Bun Web Workers, communicating with the main thread via structured message passing.

Architecture Overview

                       Main Thread
                     (Bun.serve + WS)
                           |
                     WorkerManager
                    (Effect Layer)
                    /              \
          postMessage()         postMessage()
              |                      |
     Writer Worker            Reader Worker
  sqlite-writer.worker.ts   sqlite-reader.worker.ts
       |        |                |        |
   DbLruCache(128)          DbLruCache(64)
    /    |    \               /    |    \
  WAL   WAL   WAL         RO    RO    RO
  .db   .db   .db         .db   .db   .db
Both workers open their own database handles to the same underlying SQLite files. WAL (Write-Ahead Logging) mode allows the reader worker to execute SELECT queries concurrently while the writer worker holds a write lock — there is never contention between a read and a write.

The Writer Worker

The writer worker (sqlite-writer.worker.ts) receives fire-and-forget write commands from the main thread. It never sends responses for ordinary writes — only for explicit flush and shutdown commands that require acknowledgement.

Batching Strategy

Rather than executing each write immediately, the worker buffers incoming commands and flushes on whichever condition is met first:
  • Timer: 50ms since the first buffered command
  • Batch size: 100 commands accumulated
On flush, commands are grouped by sessionId and each group runs inside a single BEGIN / COMMIT transaction. This dramatically improves throughput because SQLite’s per-transaction overhead (fsync, WAL checkpoint) is amortized across many writes rather than paid per-statement.
// Batching constants
const BATCH_INTERVAL_MS = 50
const BATCH_MAX_SIZE = 100

Supported Write Commands

The writer handles five data-bearing command types, each mapped to a prepared INSERT statement:
CommandTableDescription
insert_eventeventsPersistent gateway events with sequence numbers
insert_event_with_ideventsSame, with an explicit event_id
insert_messagemessagesUser or assistant messages tied to a turn
insert_message_metamessagesMessages with JSON metadata (e.g., question responses)
insert_usageturn_usageToken counts, model info, cost per turn
In addition, three lifecycle commands manage database handles:
CommandBehavior
ensure_dbOpens the DB connection lazily (no-op if already cached)
close_dbDeferred until after the current batch’s transaction commits, then evicts the handle
flushForces an immediate flush of all buffered commands and sends a flush_ack response
The flush command is critical for correctness before destructive operations. When a client deletes a session, the main thread calls flush(sessionId) to guarantee all pending writes are committed before the session directory is removed from disk.

Shutdown Protocol

Shutdown bypasses the buffer entirely: it flushes all pending commands, closes every cached database handle, and posts a shutdown_ack response. The WorkerManager enforces a 5-second timeout — if the worker does not acknowledge in time, it is forcibly terminated.
// Shutdown bypasses buffer
if (cmd.type === "shutdown") {
  flushBuffer()
  dbCache.closeAll()
  const ack: ShutdownAck = { type: "shutdown_ack", requestId: cmd.requestId }
  postMessage(ack)
  return
}

The Reader Worker

The reader worker (sqlite-reader.worker.ts) handles SELECT queries using a request/response pattern. Every request carries a requestId (a UUID generated by the main thread), and the response echoes it back for correlation.

Read Operations

Request TypeSQL PatternUse Case
get_historySELECT FROM messages WHERE session_id = ? AND rowid > ? LIMIT ?Paginated message history
get_eventsSELECT FROM events WHERE session_id = ? AND seq > ? LIMIT ?Event replay after a given sequence
get_snapshot_messagesSELECT FROM messages WHERE session_id = ? ORDER BY created_at DESC LIMIT ?Recent messages for join snapshots
The reader opens databases in read-only mode. If the writer has not yet created a database file for a session, the reader temporarily opens it in writable mode to run migrations, closes that handle, and then re-opens read-only. This avoids caching a writable handle in the reader’s LRU.

Error Handling

Every read operation is wrapped in a try/catch. On failure, the worker posts a typed ReaderErrorRes with the requestId and a safe error message. The main thread’s sendReaderRequest helper rejects the corresponding Effect.async callback, surfacing the error through the Effect pipeline.

WorkerManager: The Effect Layer

The WorkerManager is an Effect Context.Tag that provides a typed API for the main thread. It abstracts away the worker boundary entirely — consumers interact with methods like write(), readHistory(), and flush() without knowing that structured messages are being passed across threads.
export class WorkerManager extends Context.Tag("WorkerManager")<WorkerManager, {
  readonly write: (cmd: WriterCommand) => void
  readonly readHistory: (params: { sessionId: string; afterSeq: number; limit: number }) => Effect.Effect<MessageRow[]>
  readonly readEvents: (params: { sessionId: string; afterSeq: number; limit: number }) => Effect.Effect<EventRow[]>
  readonly readSnapshotMessages: (params: { sessionId: string; limit: number }) => Effect.Effect<MessageRow[]>
  readonly flush: (sessionId: string) => Effect.Effect<void>
  readonly closeDb: (sessionId: string) => Effect.Effect<void>
  readonly shutdown: () => Effect.Effect<void>
}>() {}
Key design details:
  • write() is synchronous and void-returning. The main thread posts the command and moves on. No backpressure, no acknowledgement. This is safe because the writer’s batching strategy ensures writes are committed promptly.
  • readHistory(), readEvents(), and readSnapshotMessages() return Effect.Effect. Under the hood, each generates a UUID requestId, posts the request via postMessage, and suspends the current fiber with Effect.async until the reader responds.
  • flush() awaits a flush_ack. It is the only writer command that blocks the caller.
  • closeDb() closes handles in both workers. The writer receives a fire-and-forget close_db; the reader’s close_db is awaited for confirmation.
Both workers are spawned when the WorkerManagerLive layer is built. The first message sent to each is a string — the sessionsBaseDir path — which configures where they find SQLite files on disk.

Prepared Statement Cache

The PreparedStatements module provides a WeakMap-based cache that maps each Database handle to a Map<string, Statement>. Statements are prepared once per (db, key) pair and reused on every subsequent call, avoiding repeated SQL compilation on hot paths.
const cache = new WeakMap<Database, Map<string, Statement>>()

export function stmt<TRow, TParams>(db: Database, key: string, sql: string): Statement<TRow, TParams> {
  let stmtMap = cache.get(db)
  if (!stmtMap) {
    stmtMap = new Map()
    cache.set(db, stmtMap)
  }
  let s = stmtMap.get(key)
  if (!s) {
    s = db.query(sql)
    stmtMap.set(key, s)
  }
  return s as Statement<TRow, TParams>
}
The WeakMap keying is deliberate: when a Database handle is closed and evicted from the LRU cache, the entire statement map for that database becomes eligible for garbage collection. No manual cleanup is required beyond the evictStatements(db) call that the DbLruCache issues before closing each handle.

Database Handle LRU Cache

Both workers use DbLruCache to manage open database connections. The writer caches up to 128 handles; the reader caches up to 64. The cache uses Map insertion-order semantics for O(1) get, set, and eviction:
  • Get touches the entry by deleting and re-inserting it (moving it to the end of iteration order)
  • Set evicts the oldest entry (first in iteration order) if the cache is at capacity
  • Evict calls evictStatements(db) to clear the prepared statement cache, then db.close()

Worker Protocol: Type Safety Across Threads

The worker-protocol.ts module defines discriminated unions for every message that crosses the worker boundary:
export type WriterCommand =
  | InsertEventCmd
  | InsertEventWithIdCmd
  | InsertMessageCmd
  | InsertMessageMetaCmd
  | InsertUsageCmd
  | EnsureDbCmd
  | CloseDbCmd
  | FlushCmd
  | ShutdownCmd
Every interface uses readonly fields, and command types are string literal discriminants. This makes exhaustive switch statements in both workers fully type-checked at compile time.

Resource Budget Per Instance

Each Diminuendo instance enforces bounded resource consumption through carefully sized caches and rate limiters:
ResourceBoundEviction Policy
Writer DB cache128 max open handlesLRU eviction
Reader DB cache64 max open handlesLRU eviction
Auth rate limiter10,000 IP entriesPeriodic cleanup (60s interval)
Per-connection dedup buffer5,000 eventsPer-connection, cleared on disconnect
Prepared statement cacheWeakMap per DB handleGC’d when DB handle is evicted
Per-connection rate limit60 messages per 10s windowSliding window, per connection
These bounds ensure that memory consumption grows linearly with the number of active sessions (up to the LRU cache cap) and then plateaus. An instance serving 1,000 concurrent sessions uses approximately the same memory as one serving 200, because at most 128 session databases are open simultaneously.

Vertical Scaling Limits

Diminuendo runs on Bun’s single-threaded JavaScript event loop, with SQLite I/O offloaded to dedicated Web Workers. The practical bottlenecks for a single instance are:
  1. CPU for JSON serialization — every WebSocket message is JSON.parse’d on receipt and JSON.stringify’d on send. For high-throughput sessions with rapid text_delta events, this is the dominant CPU cost.
  2. SQLite write throughput — the writer worker batches commands (50ms or 100 commands, whichever comes first) and executes them within transactions. This sustains thousands of writes per second, but a single writer is ultimately serialized.
  3. WebSocket connection count — Bun’s event loop can handle thousands of concurrent WebSocket connections, but each connection consumes a file descriptor and a small amount of memory for its WsData state.
The multi-worker architecture moves SQLite I/O off the main thread, ensuring that database writes never block event delivery. For most workloads, a single instance can serve hundreds of concurrent agent sessions before any of these limits become relevant.

Performance Benchmark

Both Diminuendo and Crescendo connect to the same Podium (agent orchestrator) and Ensemble (LLM inference). Since agent processing time is constant across both gateways, the measured delta is the gateway overhead — the tax each architecture imposes on every request before the actual work begins.
All benchmarks run locally on the same machine with shared backends. 10 warmup iterations are discarded before measurement begins.

Test Environment

ServicePortNotes
Podium Gateway:5083Shared — both gateways route here
Podium Coordinator:5082Shared
Ensemble:5180Shared
Crescendo:8002Next.js on Bun (dev/turbo)
Diminuendo:8080Bun + Effect TS

Health Endpoint

100 iterations, 10 warmup
DiminuendoCrescendoSpeedup
p500.6ms5.0ms8.4x faster
p951.1ms7.5ms6.8x faster
p991.4ms10.3ms7.3x faster
mean0.7ms5.6ms8.0x faster
stddev0.3ms1.6ms5.3x tighter
RPS10,39029135.7x throughput
Crescendo checks 4 dependencies (PostgreSQL, Redis, Ensemble, Podium). Diminuendo checks 2 (Ensemble, Podium). Even accounting for 2 fewer sub-millisecond probes, the dominant cost is Next.js per-request middleware and routing overhead.

Connection and Authentication

20 iterations
DiminuendoCrescendoSpeedup
p500.4ms5.5ms15.7x faster
p950.5ms8.5ms17.0x faster
Diminuendo establishes a WebSocket and auto-authenticates in dev mode with zero I/O. Crescendo sends POST /api/e2e/seed, which requires a PostgreSQL upsert round-trip.

Session Creation

50 iterations, 10 warmup
DiminuendoCrescendoSpeedup
p500.6ms17.7ms27.6x faster
p950.9ms24.8ms27.6x faster
p990.9ms51.9ms57.7x faster
mean0.7ms19.1ms27.3x faster
stddev0.1ms8.9ms89x less variance
min0.5ms10.9ms
max0.9ms75.9ms
Diminuendo’s sub-millisecond consistency (stddev 0.1ms) comes from in-process SQLite writes. Crescendo’s variance (stddev 8.9ms, max 75.9ms) reflects PostgreSQL network round-trips and Redis publish fan-out.

Summary

MetricDiminuendoCrescendoAdvantage
Health p500.6ms5.0ms8.4x faster
Health RPS10,39029135.7x throughput
Auth/connect p500.4ms5.5ms15.7x faster
Session create p500.6ms17.7ms27.6x faster
Session create p950.9ms24.8ms27.6x faster
Session create jitter0.1ms stddev8.9ms stddev89x less variance

Why Diminuendo Is Faster

Bun-native runtime

Bun’s native HTTP server + Effect TS vs Next.js middleware stack eliminates approximately 4ms of per-request overhead.

WebSocket transport

Persistent connections eliminate per-request TCP handshakes and cookie parsing. Authentication is amortized to zero after the initial connect.

In-process SQLite

Zero-network writes save 10—15ms per database operation compared to PostgreSQL over TCP.

In-process pub/sub

Bun’s built-in publish/subscribe avoids the Redis network hop, saving 1—2ms per event.

Raw Backend Baselines

Direct health-check latency to the shared backends (50 iterations), for reference:
Backendp50p95
Podium0.37ms0.76ms
Ensemble0.24ms0.39ms
These are the floor — all gateway overhead is additive on top.

What Would Require Redis or PostgreSQL

The current architecture is designed for tenant-affinity routing, where each tenant is served by exactly one instance. Several capabilities would require shared infrastructure:
If a client connects to instance A but the session’s Podium events arrive on instance B (because the Podium connection was established there), instance A has no way to receive those events. A shared pub/sub layer (Redis Streams, NATS) would be needed to bridge events across instances.
Moving an active session from one instance to another — for example, during a rolling deployment — currently requires the session to be deactivated and reactivated. A shared state store would enable live handoff without interrupting the Podium connection.
The auth rate limiter tracks attempts per IP address within a single instance. A coordinated attacker distributing attempts across instances would bypass per-instance limits. A shared rate limiter (Redis-backed sliding window) would provide global protection.
The BillingService currently operates per-instance with local credit reservation. A multi-instance deployment serving the same tenant from different instances would require a shared ledger to prevent over-spending.
These capabilities are not yet needed for the current deployment model. The architecture is designed so that adding them later is additive — it requires new service implementations behind the existing Effect Layer interfaces, not rewrites of the core logic.

Stale Recovery on Restart

When a gateway instance restarts — whether due to deployment, crash, or scaling event — it performs stale session recovery as part of its startup sequence:
1

Enumerate known tenants

The instance queries all known tenant IDs from the data/tenants/ directory, plus the default tenant (dev in dev mode, default otherwise).
2

Query non-idle sessions

For each tenant, the instance queries the registry database for sessions whose status is not inactive — these are sessions that were active when the previous process died.
3

Reset to inactive

Each stale session is reset to inactive. This is safe because Podium connections do not survive process death — the WebSocket to the Podium coordinator was severed when the process exited, and the compute instance has already been reclaimed or timed out.
4

Resume normal operation

When clients reconnect and join these sessions, they receive a state_snapshot showing inactive status. The client can then trigger re-activation, which creates a fresh Podium instance and establishes a new connection.
This recovery runs as a forked daemon fiber — it executes concurrently with the server startup and does not block incoming connections. Up to 4 tenants are reconciled in parallel.

Reproducing the Benchmark

# Prerequisites: Podium on :5083, Ensemble on :5180, both gateways running
cd ~/Projects/gateway-bench
bun install
bun run bench                            # all scenarios
bun run bench -- --scenarios health      # just health
bun run bench -- --scenarios session-create
The benchmark script auto-detects whether services are running and starts them if needed.
Benchmark run: 2026-03-03 — Bun 1.3.10, macOS arm64