Scalability

Diminuendo’s scaling model is a consequence of its data model, not an afterthought bolted on through distributed consensus protocols or shared-nothing clustering. The key insight is structural: if no two gateway instances ever need to write to the same database, then scaling out is simply a matter of routing tenants to instances. There is no coordination problem because there is nothing to coordinate. This page covers three aspects of the scaling architecture: the per-tenant data isolation that makes horizontal scaling possible, the multi-worker SQLite architecture that makes a single instance fast, and the benchmark data that measures the result.

Per-Tenant Data Isolation

Every tenant in Diminuendo receives its own SQLite database file for session metadata, and every session receives its own dedicated database for conversation history, events, and usage records:

data/
  tenants/
    {tenantId}/
      registry.db          # Session metadata for this tenant
  sessions/
    {sessionId}/
      session.db           # Conversation history, events, turn usage

There is no shared database between tenants or sessions. A query against tenant acme’s registry physically cannot touch tenant globex’s data — they reside in different files on different filesystem paths. There is no WHERE tenant_id = ? clause to forget, no row-level security policy to misconfigure, no cross-tenant join to accidentally permit.

This isolation extends to deletion semantics. Removing a session means deleting a directory. Removing a tenant means deleting a directory tree. No cascading deletes, no orphaned foreign key references, no vacuum passes over a shared tablespace.

Why This Enables Horizontal Scaling

Since there is no shared state between tenants, multiple Diminuendo instances can serve different tenants independently. A load balancer routes by tenant ID — using sticky sessions or tenant-affinity routing — to ensure all requests for a given tenant reach the same instance. The fundamental invariant is simple: at any point in time, exactly one gateway instance is responsible for a given tenant’s data. This is trivially satisfied by a load balancer that hashes on the tenant ID extracted from the JWT’s tenant_id claim.

                Load Balancer
                (tenant-affinity)
              /        |        \
        Instance A  Instance B  Instance C
        tenant: acme  tenant: globex  tenant: initech
        data/tenants/acme/  data/tenants/globex/  data/tenants/initech/

Adding capacity means adding instances and rebalancing the tenant-to-instance mapping. No data migration is required — just copy the tenant’s data/ directory to the new instance and update the routing table.

Sticky Session Requirement

WebSocket connections are inherently stateful. Each connected client maintains in-memory state on the gateway instance: the ActiveSession record, the ConnectionState tracking authentication and subscriptions, and the event streaming fiber that consumes Podium events and publishes them to session topics. A client must reconnect to the same instance that holds its session’s in-memory state. If a load balancer routes a reconnecting client to a different instance, that instance will not have the session’s active Podium connection, event fiber, or subscriber registrations.

Sticky sessions are a hard requirement for the current architecture, not a performance optimization. The gateway does not replicate in-memory state between instances. A misdirected WebSocket connection will fail to find an active session and will require the client to re-join, which triggers a fresh Podium connection and state snapshot.

During rolling deployments, tenants can be redistributed across instances by leveraging the stale session recovery mechanism: when an instance restarts, it resets all non-idle sessions to inactive. Clients reconnect, receive a state_snapshot reflecting the reset state, and the session activates cleanly on the new instance.

SQLite as Scaling Advantage

The choice of SQLite over PostgreSQL is often perceived as a scalability limitation. In Diminuendo’s architecture, it is precisely the opposite — SQLite enables a scaling model that a shared database would complicate:

No Cluster to Manage

There is no PostgreSQL primary, no read replicas, no connection pooler (PgBouncer/pgcat), no failover orchestrator. Each instance manages its own local files.

Copy-Based Backup

Backing up a tenant means copying a directory. Restoring means placing files. No pg_dump, no WAL archiving, no point-in-time recovery infrastructure.

Per-Session Archival

Completed sessions can be archived independently — compress the session directory, upload to object storage, and delete locally. No DELETE FROM events WHERE session_id = ? on a multi-terabyte table.

WAL Concurrency

WAL mode allows concurrent reads without blocking the writer. The two-worker architecture places reads and writes on separate threads, so a long-running history query never stalls event persistence.

Multi-Worker SQLite Architecture

SQLite serializes all writes within a single database connection. Left unaddressed, a single-threaded writer becomes a bottleneck the moment dozens of concurrent sessions generate events, messages, and token-usage records simultaneously. The solution is a two-worker architecture that separates read and write paths into dedicated Bun Web Workers, communicating with the main thread via structured message passing.

Architecture Overview

                       Main Thread
                     (Bun.serve + WS)
                           |
                     WorkerManager
                    (Effect Layer)
                    /              \
          postMessage()         postMessage()
              |                      |
     Writer Worker            Reader Worker
  sqlite-writer.worker.ts   sqlite-reader.worker.ts
       |        |                |        |
   DbLruCache(128)          DbLruCache(64)
    /    |    \               /    |    \
  WAL   WAL   WAL         RO    RO    RO
  .db   .db   .db         .db   .db   .db

Both workers open their own database handles to the same underlying SQLite files. WAL (Write-Ahead Logging) mode allows the reader worker to execute SELECT queries concurrently while the writer worker holds a write lock — there is never contention between a read and a write.

The Writer Worker

The writer worker (sqlite-writer.worker.ts) receives fire-and-forget write commands from the main thread. It never sends responses for ordinary writes — only for explicit flush and shutdown commands that require acknowledgement.

Batching Strategy

Rather than executing each write immediately, the worker buffers incoming commands and flushes on whichever condition is met first:

Timer: 50ms since the first buffered command
Batch size: 100 commands accumulated

On flush, commands are grouped by sessionId and each group runs inside a single BEGIN / COMMIT transaction. This dramatically improves throughput because SQLite’s per-transaction overhead (fsync, WAL checkpoint) is amortized across many writes rather than paid per-statement.

// Batching constants
const BATCH_INTERVAL_MS = 50
const BATCH_MAX_SIZE = 100

Supported Write Commands

The writer handles five data-bearing command types, each mapped to a prepared INSERT statement:

Command	Table	Description
`insert_event`	`events`	Persistent gateway events with sequence numbers
`insert_event_with_id`	`events`	Same, with an explicit `event_id`
`insert_message`	`messages`	User or assistant messages tied to a turn
`insert_message_meta`	`messages`	Messages with JSON metadata (e.g., question responses)
`insert_usage`	`turn_usage`	Token counts, model info, cost per turn

In addition, three lifecycle commands manage database handles:

Command	Behavior
`ensure_db`	Opens the DB connection lazily (no-op if already cached)
`close_db`	Deferred until after the current batch’s transaction commits, then evicts the handle
`flush`	Forces an immediate flush of all buffered commands and sends a `flush_ack` response

The flush command is critical for correctness before destructive operations. When a client deletes a session, the main thread calls flush(sessionId) to guarantee all pending writes are committed before the session directory is removed from disk.

Shutdown Protocol

Shutdown bypasses the buffer entirely: it flushes all pending commands, closes every cached database handle, and posts a shutdown_ack response. The WorkerManager enforces a 5-second timeout — if the worker does not acknowledge in time, it is forcibly terminated.

// Shutdown bypasses buffer
if (cmd.type === "shutdown") {
  flushBuffer()
  dbCache.closeAll()
  const ack: ShutdownAck = { type: "shutdown_ack", requestId: cmd.requestId }
  postMessage(ack)
  return
}

The Reader Worker

The reader worker (sqlite-reader.worker.ts) handles SELECT queries using a request/response pattern. Every request carries a requestId (a UUID generated by the main thread), and the response echoes it back for correlation.

Read Operations

Request Type	SQL Pattern	Use Case
`get_history`	`SELECT FROM messages WHERE session_id = ? AND rowid > ? LIMIT ?`	Paginated message history
`get_events`	`SELECT FROM events WHERE session_id = ? AND seq > ? LIMIT ?`	Event replay after a given sequence
`get_snapshot_messages`	`SELECT FROM messages WHERE session_id = ? ORDER BY created_at DESC LIMIT ?`	Recent messages for join snapshots

The reader opens databases in read-only mode. If the writer has not yet created a database file for a session, the reader temporarily opens it in writable mode to run migrations, closes that handle, and then re-opens read-only. This avoids caching a writable handle in the reader’s LRU.

Error Handling

Every read operation is wrapped in a try/catch. On failure, the worker posts a typed ReaderErrorRes with the requestId and a safe error message. The main thread’s sendReaderRequest helper rejects the corresponding Effect.async callback, surfacing the error through the Effect pipeline.

WorkerManager: The Effect Layer

The WorkerManager is an Effect Context.Tag that provides a typed API for the main thread. It abstracts away the worker boundary entirely — consumers interact with methods like write(), readHistory(), and flush() without knowing that structured messages are being passed across threads.

export class WorkerManager extends Context.Tag("WorkerManager")<WorkerManager, {
  readonly write: (cmd: WriterCommand) => void
  readonly readHistory: (params: { sessionId: string; afterSeq: number; limit: number }) => Effect.Effect<MessageRow[]>
  readonly readEvents: (params: { sessionId: string; afterSeq: number; limit: number }) => Effect.Effect<EventRow[]>
  readonly readSnapshotMessages: (params: { sessionId: string; limit: number }) => Effect.Effect<MessageRow[]>
  readonly flush: (sessionId: string) => Effect.Effect<void>
  readonly closeDb: (sessionId: string) => Effect.Effect<void>
  readonly shutdown: () => Effect.Effect<void>
}>() {}

Key design details:

write() is synchronous and void-returning. The main thread posts the command and moves on. No backpressure, no acknowledgement. This is safe because the writer’s batching strategy ensures writes are committed promptly.
readHistory(), readEvents(), and readSnapshotMessages() return Effect.Effect. Under the hood, each generates a UUID requestId, posts the request via postMessage, and suspends the current fiber with Effect.async until the reader responds.
flush() awaits a flush_ack. It is the only writer command that blocks the caller.
closeDb() closes handles in both workers. The writer receives a fire-and-forget close_db; the reader’s close_db is awaited for confirmation.

Both workers are spawned when the WorkerManagerLive layer is built. The first message sent to each is a string — the sessionsBaseDir path — which configures where they find SQLite files on disk.

Prepared Statement Cache

The PreparedStatements module provides a WeakMap-based cache that maps each Database handle to a Map<string, Statement>. Statements are prepared once per (db, key) pair and reused on every subsequent call, avoiding repeated SQL compilation on hot paths.

const cache = new WeakMap<Database, Map<string, Statement>>()

export function stmt<TRow, TParams>(db: Database, key: string, sql: string): Statement<TRow, TParams> {
  let stmtMap = cache.get(db)
  if (!stmtMap) {
    stmtMap = new Map()
    cache.set(db, stmtMap)
  }
  let s = stmtMap.get(key)
  if (!s) {
    s = db.query(sql)
    stmtMap.set(key, s)
  }
  return s as Statement<TRow, TParams>
}

The WeakMap keying is deliberate: when a Database handle is closed and evicted from the LRU cache, the entire statement map for that database becomes eligible for garbage collection. No manual cleanup is required beyond the evictStatements(db) call that the DbLruCache issues before closing each handle.

Database Handle LRU Cache

Both workers use DbLruCache to manage open database connections. The writer caches up to 128 handles; the reader caches up to 64. The cache uses Map insertion-order semantics for O(1) get, set, and eviction:

Get touches the entry by deleting and re-inserting it (moving it to the end of iteration order)
Set evicts the oldest entry (first in iteration order) if the cache is at capacity
Evict calls evictStatements(db) to clear the prepared statement cache, then db.close()

Worker Protocol: Type Safety Across Threads

The worker-protocol.ts module defines discriminated unions for every message that crosses the worker boundary:

Writer Commands
Writer Responses
Reader Requests
Reader Responses

export type WriterCommand =
  | InsertEventCmd
  | InsertEventWithIdCmd
  | InsertMessageCmd
  | InsertMessageMetaCmd
  | InsertUsageCmd
  | EnsureDbCmd
  | CloseDbCmd
  | FlushCmd
  | ShutdownCmd

export type WriterResponse = FlushAck | ShutdownAck

export type ReaderRequest =
  | GetHistoryReq
  | GetEventsReq
  | GetSnapshotMessagesReq
  | ReaderCloseDbReq
  | ReaderShutdownReq

export type ReaderResponse =
  | GetHistoryRes
  | GetEventsRes
  | GetSnapshotMessagesRes
  | ReaderCloseDbRes
  | ReaderShutdownRes
  | ReaderErrorRes

Every interface uses readonly fields, and command types are string literal discriminants. This makes exhaustive switch statements in both workers fully type-checked at compile time.

Resource Budget Per Instance

Each Diminuendo instance enforces bounded resource consumption through carefully sized caches and rate limiters:

Resource	Bound	Eviction Policy
Writer DB cache	128 max open handles	LRU eviction
Reader DB cache	64 max open handles	LRU eviction
Auth rate limiter	10,000 IP entries	Periodic cleanup (60s interval)
Per-connection dedup buffer	5,000 events	Per-connection, cleared on disconnect
Prepared statement cache	WeakMap per DB handle	GC’d when DB handle is evicted
Per-connection rate limit	60 messages per 10s window	Sliding window, per connection

These bounds ensure that memory consumption grows linearly with the number of active sessions (up to the LRU cache cap) and then plateaus. An instance serving 1,000 concurrent sessions uses approximately the same memory as one serving 200, because at most 128 session databases are open simultaneously.

Vertical Scaling Limits

Diminuendo runs on Bun’s single-threaded JavaScript event loop, with SQLite I/O offloaded to dedicated Web Workers. The practical bottlenecks for a single instance are:

CPU for JSON serialization — every WebSocket message is JSON.parse’d on receipt and JSON.stringify’d on send. For high-throughput sessions with rapid text_delta events, this is the dominant CPU cost.
SQLite write throughput — the writer worker batches commands (50ms or 100 commands, whichever comes first) and executes them within transactions. This sustains thousands of writes per second, but a single writer is ultimately serialized.
WebSocket connection count — Bun’s event loop can handle thousands of concurrent WebSocket connections, but each connection consumes a file descriptor and a small amount of memory for its WsData state.

The multi-worker architecture moves SQLite I/O off the main thread, ensuring that database writes never block event delivery. For most workloads, a single instance can serve hundreds of concurrent agent sessions before any of these limits become relevant.

Performance Benchmark

Both Diminuendo and Crescendo connect to the same Podium (agent orchestrator) and Ensemble (LLM inference). Since agent processing time is constant across both gateways, the measured delta is the gateway overhead — the tax each architecture imposes on every request before the actual work begins.

All benchmarks run locally on the same machine with shared backends. 10 warmup iterations are discarded before measurement begins.

Test Environment

Service	Port	Notes
Podium Gateway	:5083	Shared — both gateways route here
Podium Coordinator	:5082	Shared
Ensemble	:5180	Shared
Crescendo	:8002	Next.js on Bun (dev/turbo)
Diminuendo	:8080	Bun + Effect TS

Health Endpoint

100 iterations, 10 warmup

	Diminuendo	Crescendo	Speedup
p50	0.6ms	5.0ms	8.4x faster
p95	1.1ms	7.5ms	6.8x faster
p99	1.4ms	10.3ms	7.3x faster
mean	0.7ms	5.6ms	8.0x faster
stddev	0.3ms	1.6ms	5.3x tighter
RPS	10,390	291	35.7x throughput

Crescendo checks 4 dependencies (PostgreSQL, Redis, Ensemble, Podium). Diminuendo checks 2 (Ensemble, Podium). Even accounting for 2 fewer sub-millisecond probes, the dominant cost is Next.js per-request middleware and routing overhead.

Connection and Authentication

20 iterations

	Diminuendo	Crescendo	Speedup
p50	0.4ms	5.5ms	15.7x faster
p95	0.5ms	8.5ms	17.0x faster

Diminuendo establishes a WebSocket and auto-authenticates in dev mode with zero I/O. Crescendo sends POST /api/e2e/seed, which requires a PostgreSQL upsert round-trip.

Session Creation

50 iterations, 10 warmup

	Diminuendo	Crescendo	Speedup
p50	0.6ms	17.7ms	27.6x faster
p95	0.9ms	24.8ms	27.6x faster
p99	0.9ms	51.9ms	57.7x faster
mean	0.7ms	19.1ms	27.3x faster
stddev	0.1ms	8.9ms	89x less variance
min	0.5ms	10.9ms
max	0.9ms	75.9ms

Diminuendo’s sub-millisecond consistency (stddev 0.1ms) comes from in-process SQLite writes. Crescendo’s variance (stddev 8.9ms, max 75.9ms) reflects PostgreSQL network round-trips and Redis publish fan-out.

Summary

Metric	Diminuendo	Crescendo	Advantage
Health p50	0.6ms	5.0ms	8.4x faster
Health RPS	10,390	291	35.7x throughput
Auth/connect p50	0.4ms	5.5ms	15.7x faster
Session create p50	0.6ms	17.7ms	27.6x faster
Session create p95	0.9ms	24.8ms	27.6x faster
Session create jitter	0.1ms stddev	8.9ms stddev	89x less variance

Why Diminuendo Is Faster

Bun-native runtime

Bun’s native HTTP server + Effect TS vs Next.js middleware stack eliminates approximately 4ms of per-request overhead.

WebSocket transport

Persistent connections eliminate per-request TCP handshakes and cookie parsing. Authentication is amortized to zero after the initial connect.

In-process SQLite

Zero-network writes save 10—15ms per database operation compared to PostgreSQL over TCP.

In-process pub/sub

Bun’s built-in publish/subscribe avoids the Redis network hop, saving 1—2ms per event.

Raw Backend Baselines

Direct health-check latency to the shared backends (50 iterations), for reference:

Backend	p50	p95
Podium	0.37ms	0.76ms
Ensemble	0.24ms	0.39ms

These are the floor — all gateway overhead is additive on top.

What Would Require Redis or PostgreSQL

The current architecture is designed for tenant-affinity routing, where each tenant is served by exactly one instance. Several capabilities would require shared infrastructure:

Cross-instance event fan-out

If a client connects to instance A but the session’s Podium events arrive on instance B (because the Podium connection was established there), instance A has no way to receive those events. A shared pub/sub layer (Redis Streams, NATS) would be needed to bridge events across instances.

Cross-instance session handoff

Moving an active session from one instance to another — for example, during a rolling deployment — currently requires the session to be deactivated and reactivated. A shared state store would enable live handoff without interrupting the Podium connection.

Global rate limiting

The auth rate limiter tracks attempts per IP address within a single instance. A coordinated attacker distributing attempts across instances would bypass per-instance limits. A shared rate limiter (Redis-backed sliding window) would provide global protection.

Shared billing ledger

The BillingService currently operates per-instance with local credit reservation. A multi-instance deployment serving the same tenant from different instances would require a shared ledger to prevent over-spending.

These capabilities are not yet needed for the current deployment model. The architecture is designed so that adding them later is additive — it requires new service implementations behind the existing Effect Layer interfaces, not rewrites of the core logic.

Stale Recovery on Restart

When a gateway instance restarts — whether due to deployment, crash, or scaling event — it performs stale session recovery as part of its startup sequence:

Enumerate known tenants

The instance queries all known tenant IDs from the data/tenants/ directory, plus the default tenant (dev in dev mode, default otherwise).

Query non-idle sessions

For each tenant, the instance queries the registry database for sessions whose status is not inactive — these are sessions that were active when the previous process died.

Reset to inactive

Each stale session is reset to inactive. This is safe because Podium connections do not survive process death — the WebSocket to the Podium coordinator was severed when the process exited, and the compute instance has already been reclaimed or timed out.

Resume normal operation

When clients reconnect and join these sessions, they receive a state_snapshot showing inactive status. The client can then trigger re-activation, which creates a fresh Podium instance and establishes a new connection.

This recovery runs as a forked daemon fiber — it executes concurrently with the server startup and does not block incoming connections. Up to 4 tenants are reconciled in parallel.

Reproducing the Benchmark

# Prerequisites: Podium on :5083, Ensemble on :5180, both gateways running
cd ~/Projects/gateway-bench
bun install
bun run bench                            # all scenarios
bun run bench -- --scenarios health      # just health
bun run bench -- --scenarios session-create

The benchmark script auto-detects whether services are running and starts them if needed.

Benchmark run: 2026-03-03 — Bun 1.3.10, macOS arm64

Diminuendo

Protocol

Clients

Operations

​Scalability

​Per-Tenant Data Isolation

​Why This Enables Horizontal Scaling

​Sticky Session Requirement

​SQLite as Scaling Advantage

No Cluster to Manage

Copy-Based Backup

Per-Session Archival

WAL Concurrency

​Multi-Worker SQLite Architecture

​Architecture Overview

​The Writer Worker

​Batching Strategy

​Supported Write Commands

​Shutdown Protocol

​The Reader Worker

​Read Operations

​Error Handling

​WorkerManager: The Effect Layer

​Prepared Statement Cache

​Database Handle LRU Cache

​Worker Protocol: Type Safety Across Threads

​Resource Budget Per Instance

​Vertical Scaling Limits

​Performance Benchmark

​Test Environment

​Health Endpoint

​Connection and Authentication

​Session Creation

​Summary

​Why Diminuendo Is Faster

Bun-native runtime

WebSocket transport

In-process SQLite

In-process pub/sub

​Raw Backend Baselines

​What Would Require Redis or PostgreSQL

​Stale Recovery on Restart

​Reproducing the Benchmark

Scalability

Per-Tenant Data Isolation

Why This Enables Horizontal Scaling

Sticky Session Requirement

SQLite as Scaling Advantage

Multi-Worker SQLite Architecture

Architecture Overview

The Writer Worker

Batching Strategy

Supported Write Commands

Shutdown Protocol

The Reader Worker

Read Operations

Error Handling

WorkerManager: The Effect Layer

Prepared Statement Cache

Database Handle LRU Cache

Worker Protocol: Type Safety Across Threads

Resource Budget Per Instance

Vertical Scaling Limits

Performance Benchmark

Test Environment

Health Endpoint

Connection and Authentication

Session Creation

Summary

Why Diminuendo Is Faster

Raw Backend Baselines

What Would Require Redis or PostgreSQL

Stale Recovery on Restart

Reproducing the Benchmark