ADR-024: PowerFill Async Run Pattern
Status
Proposed (PSSaaS Architect) — pending PO acceptance with the Phase 6e completion review.
Context
The PowerFill spec (§Run Execution Model line 247) requires runs to be asynchronous: POST /api/powerfill/run returns a run_id immediately and the engine executes in the background. PoC observation against PS_DemoData (6a-6d) shows a typical run takes 25-60 seconds even on the smaller dataset; production-scale tenants (50K+ loans) will likely take minutes. Holding an HTTP request thread for that duration is not viable — clients time out, retries cascade, the API pod becomes unhealthy under any concurrent load.
Sub-phases 6a-6d shipped with a synchronous best-effort POST /run (200 OK + RunResponse on success; 500 + RunResponse on failure) so the orchestration was independently verifiable before adding the async runtime. Sub-phase 6e converts the surface to a true async submission model.
PSSaaS has no precedent for background execution — every existing endpoint is request-scoped. Whatever pattern 6e adopts will set the path for any future PSSaaS module that needs background work (Phase 7 reports, Phase 8+ scheduled BestEx runs, Risk Manager batch jobs).
This ADR records the decision made by Phase 6 Open Question Q1 (PO-confirmed at planning checkpoint).
Option A: .NET hosted service + in-memory Channel<T> queue
IHostedService background worker; System.Threading.Channels.Channel<T> for the queue. The endpoint enqueues a RunJob record carrying the run id, captured tenant identity, and resolved options; the worker dequeues, opens a per-job DI scope, populates the scope's TenantContext from the job, and invokes PowerFillRunService.ExecuteResolvedAsync.
Pros:
- Zero new infrastructure dependencies —
Channel<T>is in the BCL,BackgroundServiceis in the standard hosting abstractions. - Fits the modular monolith model (ADR-004).
- Cancellation works naturally via
CancellationTokenSourcekeyed byrun_id. - Simple to reason about; one process boundary.
Cons:
- Pod restart mid-run = orphaned
pfill_run_historyrow stuck in an active state. - No replay capability; the queue is volatile.
- Single-pod safe only — multi-pod deployments would race on BR-8 enforcement (the SQL filtered unique index would still prevent two concurrent INSERTs but the queue itself wouldn't be shared).
Option B: .NET hosted service + DB-backed queue table
A pfill_job_queue row insert on POST /run; the hosted service polls the table; a single-row UPDATE-with-OUTPUT picks up the next job atomically.
Pros:
- Pod restart safe — DB row survives.
- Multi-pod compatible if the polling SELECT uses
WITH (UPDLOCK, READPAST). - Queue state visible in DB.
Cons:
- Polling latency vs cost trade-off.
- Another PSSaaS-only table to maintain (
pfill_job_queueplus the auditpfill_run_history). - More moving parts than the workload demands at v1.
Option C: RabbitMQ (already in the full Docker Compose profile)
Publish a job message on POST /run; subscriber background worker consumes.
Pros:
- Production-grade durability + retry semantics.
- Pub/sub for future consumers (Phase 7 reports could subscribe too).
- Aligns with the eventual
fullprofile shape.
Cons:
- RabbitMQ is not in the
devprofile today; the PSSaaS API doesn't currently take a dependency on it. - Multi-tenant routing requires careful exchange/queue design.
- Significant new infrastructure surface for a single-pod dev/staging footprint.
Option D: Hangfire / Quartz.NET / similar OSS library
External library handles queue + worker + persistence + dashboard.
Pros:
- Mature; dashboard, retries, scheduled jobs free.
- Per-tenant DB persistence aligns with ADR-005.
Cons:
- New dependency; opinionated APIs.
- Hangfire Pro is paid; OSS Hangfire is per-DB and unfriendly to database-per-tenant (would need one Hangfire DB per tenant or one shared DB outside the tenant boundary).
- Quartz.NET is heavier than the workload needs.
Decision
Option A — in-memory Channel<T> + BackgroundService.
Specifically
PowerFillRunQueue— process-singleton wrappingChannel.CreateBounded<RunJob>(capacity: 64, FullMode: Wait). Bounded capacity prevents runaway enqueue if the worker stalls; full-channel attempts surface as503 Service Unavailablewith a 2-second timeout (operator-facing latency matters more than indefinite blocking).PowerFillRunBackgroundService—BackgroundServicereading the channel; for every job opens_serviceProvider.CreateScope(), populates the scope's mutableTenantContextfrom the capturedRunJob, then resolvesPowerFillRunService+PowerFillRunHistoryServicefrom the per-job scope. The scopedTenantDbContextpicks up the connection string in itsOnConfiguringoverride.PowerFillRunCancelRegistry— process-singletonConcurrentDictionary<Guid, CancellationTokenSource>. The endpoint registers a CTS before enqueue; the cancel endpoint signals it; the worker honours at step boundaries and viaExecuteSqlInterpolatedAsync(ct)propagation. The worker's finally block disposes + unregisters the CTS regardless of outcome.PowerFillRunStartupReconciliationService—IHostedServicerunning once at app startup. Iterates every known tenant inTenantRegistry; opens a per-tenant DI scope; marks anypfill_run_historyrows still in an active state (left behind by a prior pod) asFailedwithfailure_step='startup_reconciliation'. This is the BR-8 unblocking mechanism after pod restart.pfill_run_historyfiltered unique index (Q2 Option A) is the BR-8 enforcement vehicle — it survives pod restarts, channel resets, and any other in-process state loss. The in-memory queue is purely a work-distribution mechanism; the audit table is the system of record.
Tenant-context propagation across the request → worker boundary
The critical seam is that TenantContext is request-scoped + mutable; the request thread reads it from the TenantMiddleware-populated value, the worker thread runs outside any HTTP request scope. The worker handles this by:
- Capturing
TenantId+ConnectionStringinto theRunJobrecord on the request thread (immutable copy). - Per-job
_serviceProvider.CreateScope()on the worker thread. - Resolving the new scope's
TenantContextand assigning the captured values BEFORE resolvingPowerFillRunService. - The scoped
TenantDbContextresolves its connection string from the now-populatedTenantContextat first DB access.
This is the standard ASP.NET Core background-worker tenant-propagation idiom. Multi-tenant isolation is preserved (per-job scope; no scope sharing across jobs).
Rationale
Phase 6e's job is to add the orchestration shape — async submission, audit row, BR-8/BR-9, GET endpoints, cancel — not to add infrastructure. Option A is the smallest viable change that satisfies the spec while preserving all the existing ADRs (ADR-005 database-per-tenant; ADR-020 single-replica AKS staging).
The pod-restart risk is mitigated by the startup reconciliation sweep + the SQL-side BR-8 index; the multi-pod concern is not currently a constraint per ADR-020.
If/when PSSaaS gets a second background-work consumer (Phase 7+ reports or Phase 8+ scheduled jobs), Option B (DB-backed queue) is the natural successor — it requires no API/contract change to the endpoints, just swapping the IPowerFillRunQueue reader/writer pair. Option C (RabbitMQ) is the right answer when PSSaaS adopts the full profile in production and has multiple consumers; revisit at that point.
Consequences
Positive
- v1 scope is small (one new SQL artifact + ~6 new C# files + ~10 LOC endpoint refactor).
- BR-8 is database-enforced (cannot drift from the queue).
- BR-9 cleanup is bounded scope (7 user-facing tables; syn-trades + log preserved for forensics).
- Cancel works via the standard
CancellationTokenchain; no special-case plumbing. - The contract surface is forward-compatible with Option B / Option C — the endpoints + the audit row + the cancel registry all stay; only the channel reader/writer pair would swap.
Negative
- Single-pod safety only; multi-pod requires a different queue.
- Pod restart abandons in-flight runs (mitigated but not eliminated by the reconciliation sweep).
- No replay; the channel is volatile.
Operational
- The startup reconciliation sweep is a small per-tenant cost at app startup (one indexed UPDATE per tenant). For dozens of tenants this is sub-second.
- The 2s queue-saturation timeout is short on purpose — a saturated queue indicates the worker is stalled; a fast 503 lets clients retry rather than waiting.
- BR-9 cleanup runs as a single SQL batch (7
DELETE FROMstatements) wrapped in try/catch so cleanup failure doesn't mask the original run failure.
Alternatives Reconsidered
These options remain viable for later phases when their trade-offs change:
- Option B (DB-backed queue) — Phase 7 if a second background consumer arrives.
- Option C (RabbitMQ) — when PSSaaS adopts the
fullprofile in production AND a multi-pod deployment becomes a requirement. - Option D (Hangfire/Quartz.NET) — only if a scheduling requirement (cron-style) emerges that the BackgroundService model can't accommodate cleanly.
Future Considerations
- Multi-pod: when PSSaaS runs multiple replicas, the in-memory
Channel<T>is no longer correct. Option B is the natural migration path; Option C is the right answer if RabbitMQ is already operational. - Replay: Q3 Option B (input loan IDs in the audit row) is forensic; Q3 Option C (full input snapshot tables) is replay-capable. If parallel-validation against the Desktop App reveals a need to replay specific runs against historical conditions, the audit table can grow to support snapshots without changing the endpoint surface.
- Scheduled runs (e.g. nightly PowerFill for a tenant): would require either a wrapper that submits to the existing endpoint on a schedule (cron-like) OR migration to Option D (Quartz.NET/Hangfire). Not a current requirement.
- Per-tenant queue isolation: currently a single global channel; if a noisy-neighbour tenant emerges, switch to per-tenant channels with weighted scheduling.
Related ADRs
- ADR-004: Modular Monolith First — async runtime stays inside the monolith
- ADR-005: Database-per-Tenant — every queued job carries its tenant connection string
- ADR-020: Shared Kubernetes Cluster with PSX — single-replica deployment makes Option A safe
- ADR-021: PowerFill Port Strategy — async runtime wraps the Phase 6a-6d synchronous orchestrator without modifying it
Source References
- PowerFill Engine Spec §Run Execution Model — async requirement
- PowerFill Engine Spec §Audit Trail —
pfill_run_historycolumns - PowerFill Engine Spec BR-8 — single active run per tenant
- PowerFill Engine Spec BR-9 — failure-state semantics
- Phase 6 Open Questions Q1 — Q1 PO-confirmed Option A
- Phase 6 Open Questions Q2 — Q2 PO-confirmed Option A (filtered unique index)
- Phase 6 Open Questions Q7 — Q7 PO-confirmed Option B (clear with error markers)