Scalability
Diminuendo’s scaling model is a consequence of its data model, not an afterthought bolted on through distributed consensus protocols or shared-nothing clustering. The key insight is structural: if no two gateway instances ever need to write to the same database, then scaling out is simply a matter of routing tenants to instances. There is no coordination problem because there is nothing to coordinate. This page covers three aspects of the scaling architecture: the per-tenant data isolation that makes horizontal scaling possible, the multi-worker SQLite architecture that makes a single instance fast, and the benchmark data that measures the result.Per-Tenant Data Isolation
Every tenant in Diminuendo receives its own SQLite database file for session metadata, and every session receives its own dedicated database for conversation history, events, and usage records:acme’s registry physically cannot touch tenant globex’s data — they reside in different files on different filesystem paths. There is no WHERE tenant_id = ? clause to forget, no row-level security policy to misconfigure, no cross-tenant join to accidentally permit.
This isolation extends to deletion semantics. Removing a session means deleting a directory. Removing a tenant means deleting a directory tree. No cascading deletes, no orphaned foreign key references, no vacuum passes over a shared tablespace.
Why This Enables Horizontal Scaling
Since there is no shared state between tenants, multiple Diminuendo instances can serve different tenants independently. A load balancer routes by tenant ID — using sticky sessions or tenant-affinity routing — to ensure all requests for a given tenant reach the same instance. The fundamental invariant is simple: at any point in time, exactly one gateway instance is responsible for a given tenant’s data. This is trivially satisfied by a load balancer that hashes on the tenant ID extracted from the JWT’stenant_id claim.
data/ directory to the new instance and update the routing table.
Sticky Session Requirement
WebSocket connections are inherently stateful. Each connected client maintains in-memory state on the gateway instance: theActiveSession record, the ConnectionState tracking authentication and subscriptions, and the event streaming fiber that consumes Podium events and publishes them to session topics.
A client must reconnect to the same instance that holds its session’s in-memory state. If a load balancer routes a reconnecting client to a different instance, that instance will not have the session’s active Podium connection, event fiber, or subscriber registrations.
During rolling deployments, tenants can be redistributed across instances by leveraging the stale session recovery mechanism: when an instance restarts, it resets all non-idle sessions to inactive. Clients reconnect, receive a state_snapshot reflecting the reset state, and the session activates cleanly on the new instance.
SQLite as Scaling Advantage
The choice of SQLite over PostgreSQL is often perceived as a scalability limitation. In Diminuendo’s architecture, it is precisely the opposite — SQLite enables a scaling model that a shared database would complicate:No Cluster to Manage
There is no PostgreSQL primary, no read replicas, no connection pooler (PgBouncer/pgcat), no failover orchestrator. Each instance manages its own local files.
Copy-Based Backup
Backing up a tenant means copying a directory. Restoring means placing files. No
pg_dump, no WAL archiving, no point-in-time recovery infrastructure.Per-Session Archival
Completed sessions can be archived independently — compress the session directory, upload to object storage, and delete locally. No
DELETE FROM events WHERE session_id = ? on a multi-terabyte table.WAL Concurrency
WAL mode allows concurrent reads without blocking the writer. The two-worker architecture places reads and writes on separate threads, so a long-running history query never stalls event persistence.
Multi-Worker SQLite Architecture
SQLite serializes all writes within a single database connection. Left unaddressed, a single-threaded writer becomes a bottleneck the moment dozens of concurrent sessions generate events, messages, and token-usage records simultaneously. The solution is a two-worker architecture that separates read and write paths into dedicated Bun Web Workers, communicating with the main thread via structured message passing.Architecture Overview
Both workers open their own database handles to the same underlying SQLite files. WAL (Write-Ahead Logging) mode allows the reader worker to execute
SELECT queries concurrently while the writer worker holds a write lock — there is never contention between a read and a write.The Writer Worker
The writer worker (sqlite-writer.worker.ts) receives fire-and-forget write commands from the main thread. It never sends responses for ordinary writes — only for explicit flush and shutdown commands that require acknowledgement.
Batching Strategy
Rather than executing each write immediately, the worker buffers incoming commands and flushes on whichever condition is met first:- Timer: 50ms since the first buffered command
- Batch size: 100 commands accumulated
sessionId and each group runs inside a single BEGIN / COMMIT transaction. This dramatically improves throughput because SQLite’s per-transaction overhead (fsync, WAL checkpoint) is amortized across many writes rather than paid per-statement.
Supported Write Commands
The writer handles five data-bearing command types, each mapped to a prepared INSERT statement:| Command | Table | Description |
|---|---|---|
insert_event | events | Persistent gateway events with sequence numbers |
insert_event_with_id | events | Same, with an explicit event_id |
insert_message | messages | User or assistant messages tied to a turn |
insert_message_meta | messages | Messages with JSON metadata (e.g., question responses) |
insert_usage | turn_usage | Token counts, model info, cost per turn |
| Command | Behavior |
|---|---|
ensure_db | Opens the DB connection lazily (no-op if already cached) |
close_db | Deferred until after the current batch’s transaction commits, then evicts the handle |
flush | Forces an immediate flush of all buffered commands and sends a flush_ack response |
Shutdown Protocol
Shutdown bypasses the buffer entirely: it flushes all pending commands, closes every cached database handle, and posts ashutdown_ack response. The WorkerManager enforces a 5-second timeout — if the worker does not acknowledge in time, it is forcibly terminated.
The Reader Worker
The reader worker (sqlite-reader.worker.ts) handles SELECT queries using a request/response pattern. Every request carries a requestId (a UUID generated by the main thread), and the response echoes it back for correlation.
Read Operations
| Request Type | SQL Pattern | Use Case |
|---|---|---|
get_history | SELECT FROM messages WHERE session_id = ? AND rowid > ? LIMIT ? | Paginated message history |
get_events | SELECT FROM events WHERE session_id = ? AND seq > ? LIMIT ? | Event replay after a given sequence |
get_snapshot_messages | SELECT FROM messages WHERE session_id = ? ORDER BY created_at DESC LIMIT ? | Recent messages for join snapshots |
The reader opens databases in read-only mode. If the writer has not yet created a database file for a session, the reader temporarily opens it in writable mode to run migrations, closes that handle, and then re-opens read-only. This avoids caching a writable handle in the reader’s LRU.
Error Handling
Every read operation is wrapped in a try/catch. On failure, the worker posts a typedReaderErrorRes with the requestId and a safe error message. The main thread’s sendReaderRequest helper rejects the corresponding Effect.async callback, surfacing the error through the Effect pipeline.
WorkerManager: The Effect Layer
TheWorkerManager is an Effect Context.Tag that provides a typed API for the main thread. It abstracts away the worker boundary entirely — consumers interact with methods like write(), readHistory(), and flush() without knowing that structured messages are being passed across threads.
write()is synchronous and void-returning. The main thread posts the command and moves on. No backpressure, no acknowledgement. This is safe because the writer’s batching strategy ensures writes are committed promptly.readHistory(),readEvents(), andreadSnapshotMessages()returnEffect.Effect. Under the hood, each generates a UUIDrequestId, posts the request viapostMessage, and suspends the current fiber withEffect.asyncuntil the reader responds.flush()awaits aflush_ack. It is the only writer command that blocks the caller.closeDb()closes handles in both workers. The writer receives a fire-and-forgetclose_db; the reader’sclose_dbis awaited for confirmation.
WorkerManagerLive layer is built. The first message sent to each is a string — the sessionsBaseDir path — which configures where they find SQLite files on disk.
Prepared Statement Cache
ThePreparedStatements module provides a WeakMap-based cache that maps each Database handle to a Map<string, Statement>. Statements are prepared once per (db, key) pair and reused on every subsequent call, avoiding repeated SQL compilation on hot paths.
WeakMap keying is deliberate: when a Database handle is closed and evicted from the LRU cache, the entire statement map for that database becomes eligible for garbage collection. No manual cleanup is required beyond the evictStatements(db) call that the DbLruCache issues before closing each handle.
Database Handle LRU Cache
Both workers useDbLruCache to manage open database connections. The writer caches up to 128 handles; the reader caches up to 64. The cache uses Map insertion-order semantics for O(1) get, set, and eviction:
- Get touches the entry by deleting and re-inserting it (moving it to the end of iteration order)
- Set evicts the oldest entry (first in iteration order) if the cache is at capacity
- Evict calls
evictStatements(db)to clear the prepared statement cache, thendb.close()
Worker Protocol: Type Safety Across Threads
Theworker-protocol.ts module defines discriminated unions for every message that crosses the worker boundary:
- Writer Commands
- Writer Responses
- Reader Requests
- Reader Responses
readonly fields, and command types are string literal discriminants. This makes exhaustive switch statements in both workers fully type-checked at compile time.
Resource Budget Per Instance
Each Diminuendo instance enforces bounded resource consumption through carefully sized caches and rate limiters:| Resource | Bound | Eviction Policy |
|---|---|---|
| Writer DB cache | 128 max open handles | LRU eviction |
| Reader DB cache | 64 max open handles | LRU eviction |
| Auth rate limiter | 10,000 IP entries | Periodic cleanup (60s interval) |
| Per-connection dedup buffer | 5,000 events | Per-connection, cleared on disconnect |
| Prepared statement cache | WeakMap per DB handle | GC’d when DB handle is evicted |
| Per-connection rate limit | 60 messages per 10s window | Sliding window, per connection |
Vertical Scaling Limits
Diminuendo runs on Bun’s single-threaded JavaScript event loop, with SQLite I/O offloaded to dedicated Web Workers. The practical bottlenecks for a single instance are:-
CPU for JSON serialization — every WebSocket message is
JSON.parse’d on receipt andJSON.stringify’d on send. For high-throughput sessions with rapidtext_deltaevents, this is the dominant CPU cost. - SQLite write throughput — the writer worker batches commands (50ms or 100 commands, whichever comes first) and executes them within transactions. This sustains thousands of writes per second, but a single writer is ultimately serialized.
-
WebSocket connection count — Bun’s event loop can handle thousands of concurrent WebSocket connections, but each connection consumes a file descriptor and a small amount of memory for its
WsDatastate.
Performance Benchmark
Both Diminuendo and Crescendo connect to the same Podium (agent orchestrator) and Ensemble (LLM inference). Since agent processing time is constant across both gateways, the measured delta is the gateway overhead — the tax each architecture imposes on every request before the actual work begins.All benchmarks run locally on the same machine with shared backends. 10 warmup iterations are discarded before measurement begins.
Test Environment
| Service | Port | Notes |
|---|---|---|
| Podium Gateway | :5083 | Shared — both gateways route here |
| Podium Coordinator | :5082 | Shared |
| Ensemble | :5180 | Shared |
| Crescendo | :8002 | Next.js on Bun (dev/turbo) |
| Diminuendo | :8080 | Bun + Effect TS |
Health Endpoint
100 iterations, 10 warmup
| Diminuendo | Crescendo | Speedup | |
|---|---|---|---|
| p50 | 0.6ms | 5.0ms | 8.4x faster |
| p95 | 1.1ms | 7.5ms | 6.8x faster |
| p99 | 1.4ms | 10.3ms | 7.3x faster |
| mean | 0.7ms | 5.6ms | 8.0x faster |
| stddev | 0.3ms | 1.6ms | 5.3x tighter |
| RPS | 10,390 | 291 | 35.7x throughput |
Connection and Authentication
20 iterations
| Diminuendo | Crescendo | Speedup | |
|---|---|---|---|
| p50 | 0.4ms | 5.5ms | 15.7x faster |
| p95 | 0.5ms | 8.5ms | 17.0x faster |
POST /api/e2e/seed, which requires a PostgreSQL upsert round-trip.
Session Creation
50 iterations, 10 warmup
| Diminuendo | Crescendo | Speedup | |
|---|---|---|---|
| p50 | 0.6ms | 17.7ms | 27.6x faster |
| p95 | 0.9ms | 24.8ms | 27.6x faster |
| p99 | 0.9ms | 51.9ms | 57.7x faster |
| mean | 0.7ms | 19.1ms | 27.3x faster |
| stddev | 0.1ms | 8.9ms | 89x less variance |
| min | 0.5ms | 10.9ms | |
| max | 0.9ms | 75.9ms |
Summary
| Metric | Diminuendo | Crescendo | Advantage |
|---|---|---|---|
| Health p50 | 0.6ms | 5.0ms | 8.4x faster |
| Health RPS | 10,390 | 291 | 35.7x throughput |
| Auth/connect p50 | 0.4ms | 5.5ms | 15.7x faster |
| Session create p50 | 0.6ms | 17.7ms | 27.6x faster |
| Session create p95 | 0.9ms | 24.8ms | 27.6x faster |
| Session create jitter | 0.1ms stddev | 8.9ms stddev | 89x less variance |
Why Diminuendo Is Faster
Bun-native runtime
Bun’s native HTTP server + Effect TS vs Next.js middleware stack eliminates approximately 4ms of per-request overhead.
WebSocket transport
Persistent connections eliminate per-request TCP handshakes and cookie parsing. Authentication is amortized to zero after the initial connect.
In-process SQLite
Zero-network writes save 10—15ms per database operation compared to PostgreSQL over TCP.
In-process pub/sub
Bun’s built-in publish/subscribe avoids the Redis network hop, saving 1—2ms per event.
Raw Backend Baselines
Direct health-check latency to the shared backends (50 iterations), for reference:| Backend | p50 | p95 |
|---|---|---|
| Podium | 0.37ms | 0.76ms |
| Ensemble | 0.24ms | 0.39ms |
What Would Require Redis or PostgreSQL
The current architecture is designed for tenant-affinity routing, where each tenant is served by exactly one instance. Several capabilities would require shared infrastructure:Cross-instance event fan-out
Cross-instance event fan-out
If a client connects to instance A but the session’s Podium events arrive on instance B (because the Podium connection was established there), instance A has no way to receive those events. A shared pub/sub layer (Redis Streams, NATS) would be needed to bridge events across instances.
Cross-instance session handoff
Cross-instance session handoff
Moving an active session from one instance to another — for example, during a rolling deployment — currently requires the session to be deactivated and reactivated. A shared state store would enable live handoff without interrupting the Podium connection.
Global rate limiting
Global rate limiting
The auth rate limiter tracks attempts per IP address within a single instance. A coordinated attacker distributing attempts across instances would bypass per-instance limits. A shared rate limiter (Redis-backed sliding window) would provide global protection.
Shared billing ledger
Shared billing ledger
Layer interfaces, not rewrites of the core logic.
Stale Recovery on Restart
When a gateway instance restarts — whether due to deployment, crash, or scaling event — it performs stale session recovery as part of its startup sequence:Enumerate known tenants
The instance queries all known tenant IDs from the
data/tenants/ directory, plus the default tenant (dev in dev mode, default otherwise).Query non-idle sessions
For each tenant, the instance queries the registry database for sessions whose status is not
inactive — these are sessions that were active when the previous process died.Reset to inactive
Each stale session is reset to
inactive. This is safe because Podium connections do not survive process death — the WebSocket to the Podium coordinator was severed when the process exited, and the compute instance has already been reclaimed or timed out.