ADR-024: PowerFill Async Run Pattern

Status

Proposed (PSSaaS Architect) — pending PO acceptance with the Phase 6e completion review.

Context

The PowerFill spec (§Run Execution Model line 247) requires runs to be asynchronous: POST /api/powerfill/run returns a run_id immediately and the engine executes in the background. PoC observation against PS_DemoData (6a-6d) shows a typical run takes 25-60 seconds even on the smaller dataset; production-scale tenants (50K+ loans) will likely take minutes. Holding an HTTP request thread for that duration is not viable — clients time out, retries cascade, the API pod becomes unhealthy under any concurrent load.

Sub-phases 6a-6d shipped with a synchronous best-effort POST /run (200 OK + RunResponse on success; 500 + RunResponse on failure) so the orchestration was independently verifiable before adding the async runtime. Sub-phase 6e converts the surface to a true async submission model.

PSSaaS has no precedent for background execution — every existing endpoint is request-scoped. Whatever pattern 6e adopts will set the path for any future PSSaaS module that needs background work (Phase 7 reports, Phase 8+ scheduled BestEx runs, Risk Manager batch jobs).

This ADR records the decision made by Phase 6 Open Question Q1 (PO-confirmed at planning checkpoint).

Option A: .NET hosted service + in-memory `Channel<T>` queue

IHostedService background worker; System.Threading.Channels.Channel<T> for the queue. The endpoint enqueues a RunJob record carrying the run id, captured tenant identity, and resolved options; the worker dequeues, opens a per-job DI scope, populates the scope's TenantContext from the job, and invokes PowerFillRunService.ExecuteResolvedAsync.

Pros:

Zero new infrastructure dependencies — Channel<T> is in the BCL, BackgroundService is in the standard hosting abstractions.
Fits the modular monolith model (ADR-004).
Cancellation works naturally via CancellationTokenSource keyed by run_id.
Simple to reason about; one process boundary.

Cons:

Pod restart mid-run = orphaned pfill_run_history row stuck in an active state.
No replay capability; the queue is volatile.
Single-pod safe only — multi-pod deployments would race on BR-8 enforcement (the SQL filtered unique index would still prevent two concurrent INSERTs but the queue itself wouldn't be shared).

Option B: .NET hosted service + DB-backed queue table

A pfill_job_queue row insert on POST /run; the hosted service polls the table; a single-row UPDATE-with-OUTPUT picks up the next job atomically.

Pros:

Pod restart safe — DB row survives.
Multi-pod compatible if the polling SELECT uses WITH (UPDLOCK, READPAST).
Queue state visible in DB.

Cons:

Polling latency vs cost trade-off.
Another PSSaaS-only table to maintain (pfill_job_queue plus the audit pfill_run_history).
More moving parts than the workload demands at v1.

Option C: RabbitMQ (already in the `full` Docker Compose profile)

Publish a job message on POST /run; subscriber background worker consumes.

Pros:

Production-grade durability + retry semantics.
Pub/sub for future consumers (Phase 7 reports could subscribe too).
Aligns with the eventual full profile shape.

Cons:

RabbitMQ is not in the dev profile today; the PSSaaS API doesn't currently take a dependency on it.
Multi-tenant routing requires careful exchange/queue design.
Significant new infrastructure surface for a single-pod dev/staging footprint.

Option D: Hangfire / Quartz.NET / similar OSS library

External library handles queue + worker + persistence + dashboard.

Pros:

Mature; dashboard, retries, scheduled jobs free.
Per-tenant DB persistence aligns with ADR-005.

Cons:

New dependency; opinionated APIs.
Hangfire Pro is paid; OSS Hangfire is per-DB and unfriendly to database-per-tenant (would need one Hangfire DB per tenant or one shared DB outside the tenant boundary).
Quartz.NET is heavier than the workload needs.

Decision

Option A — in-memory Channel<T> + BackgroundService.

Specifically

PowerFillRunQueue — process-singleton wrapping Channel.CreateBounded<RunJob>(capacity: 64, FullMode: Wait). Bounded capacity prevents runaway enqueue if the worker stalls; full-channel attempts surface as 503 Service Unavailable with a 2-second timeout (operator-facing latency matters more than indefinite blocking).
PowerFillRunBackgroundService — BackgroundService reading the channel; for every job opens _serviceProvider.CreateScope(), populates the scope's mutable TenantContext from the captured RunJob, then resolves PowerFillRunService + PowerFillRunHistoryService from the per-job scope. The scoped TenantDbContext picks up the connection string in its OnConfiguring override.
PowerFillRunCancelRegistry — process-singleton ConcurrentDictionary<Guid, CancellationTokenSource>. The endpoint registers a CTS before enqueue; the cancel endpoint signals it; the worker honours at step boundaries and via ExecuteSqlInterpolatedAsync(ct) propagation. The worker's finally block disposes + unregisters the CTS regardless of outcome.
PowerFillRunStartupReconciliationService — IHostedService running once at app startup. Iterates every known tenant in TenantRegistry; opens a per-tenant DI scope; marks any pfill_run_history rows still in an active state (left behind by a prior pod) as Failed with failure_step='startup_reconciliation'. This is the BR-8 unblocking mechanism after pod restart.
pfill_run_history filtered unique index (Q2 Option A) is the BR-8 enforcement vehicle — it survives pod restarts, channel resets, and any other in-process state loss. The in-memory queue is purely a work-distribution mechanism; the audit table is the system of record.

Tenant-context propagation across the request → worker boundary

The critical seam is that TenantContext is request-scoped + mutable; the request thread reads it from the TenantMiddleware-populated value, the worker thread runs outside any HTTP request scope. The worker handles this by:

Capturing TenantId + ConnectionString into the RunJob record on the request thread (immutable copy).
Per-job _serviceProvider.CreateScope() on the worker thread.
Resolving the new scope's TenantContext and assigning the captured values BEFORE resolving PowerFillRunService.
The scoped TenantDbContext resolves its connection string from the now-populated TenantContext at first DB access.

This is the standard ASP.NET Core background-worker tenant-propagation idiom. Multi-tenant isolation is preserved (per-job scope; no scope sharing across jobs).

Rationale

Phase 6e's job is to add the orchestration shape — async submission, audit row, BR-8/BR-9, GET endpoints, cancel — not to add infrastructure. Option A is the smallest viable change that satisfies the spec while preserving all the existing ADRs (ADR-005 database-per-tenant; ADR-020 single-replica AKS staging).

The pod-restart risk is mitigated by the startup reconciliation sweep + the SQL-side BR-8 index; the multi-pod concern is not currently a constraint per ADR-020.

If/when PSSaaS gets a second background-work consumer (Phase 7+ reports or Phase 8+ scheduled jobs), Option B (DB-backed queue) is the natural successor — it requires no API/contract change to the endpoints, just swapping the IPowerFillRunQueue reader/writer pair. Option C (RabbitMQ) is the right answer when PSSaaS adopts the full profile in production and has multiple consumers; revisit at that point.

Consequences

Positive

v1 scope is small (one new SQL artifact + ~6 new C# files + ~10 LOC endpoint refactor).
BR-8 is database-enforced (cannot drift from the queue).
BR-9 cleanup is bounded scope (7 user-facing tables; syn-trades + log preserved for forensics).
Cancel works via the standard CancellationToken chain; no special-case plumbing.
The contract surface is forward-compatible with Option B / Option C — the endpoints + the audit row + the cancel registry all stay; only the channel reader/writer pair would swap.

Negative

Single-pod safety only; multi-pod requires a different queue.
Pod restart abandons in-flight runs (mitigated but not eliminated by the reconciliation sweep).
No replay; the channel is volatile.

Operational

The startup reconciliation sweep is a small per-tenant cost at app startup (one indexed UPDATE per tenant). For dozens of tenants this is sub-second.
The 2s queue-saturation timeout is short on purpose — a saturated queue indicates the worker is stalled; a fast 503 lets clients retry rather than waiting.
BR-9 cleanup runs as a single SQL batch (7 DELETE FROM statements) wrapped in try/catch so cleanup failure doesn't mask the original run failure.

Alternatives Reconsidered

These options remain viable for later phases when their trade-offs change:

Option B (DB-backed queue) — Phase 7 if a second background consumer arrives.
Option C (RabbitMQ) — when PSSaaS adopts the full profile in production AND a multi-pod deployment becomes a requirement.
Option D (Hangfire/Quartz.NET) — only if a scheduling requirement (cron-style) emerges that the BackgroundService model can't accommodate cleanly.

Future Considerations

Multi-pod: when PSSaaS runs multiple replicas, the in-memory Channel<T> is no longer correct. Option B is the natural migration path; Option C is the right answer if RabbitMQ is already operational.
Replay: Q3 Option B (input loan IDs in the audit row) is forensic; Q3 Option C (full input snapshot tables) is replay-capable. If parallel-validation against the Desktop App reveals a need to replay specific runs against historical conditions, the audit table can grow to support snapshots without changing the endpoint surface.
Scheduled runs (e.g. nightly PowerFill for a tenant): would require either a wrapper that submits to the existing endpoint on a schedule (cron-like) OR migration to Option D (Quartz.NET/Hangfire). Not a current requirement.
Per-tenant queue isolation: currently a single global channel; if a noisy-neighbour tenant emerges, switch to per-tenant channels with weighted scheduling.

ADR-004: Modular Monolith First — async runtime stays inside the monolith
ADR-005: Database-per-Tenant — every queued job carries its tenant connection string
ADR-020: Shared Kubernetes Cluster with PSX — single-replica deployment makes Option A safe
ADR-021: PowerFill Port Strategy — async runtime wraps the Phase 6a-6d synchronous orchestrator without modifying it

Source References

PowerFill Engine Spec §Run Execution Model — async requirement
PowerFill Engine Spec §Audit Trail — pfill_run_history columns
PowerFill Engine Spec BR-8 — single active run per tenant
PowerFill Engine Spec BR-9 — failure-state semantics
Phase 6 Open Questions Q1 — Q1 PO-confirmed Option A
Phase 6 Open Questions Q2 — Q2 PO-confirmed Option A (filtered unique index)
Phase 6 Open Questions Q7 — Q7 PO-confirmed Option B (clear with error markers)

Status​

Context​

Option A: .NET hosted service + in-memory Channel<T> queue​

Option B: .NET hosted service + DB-backed queue table​

Option C: RabbitMQ (already in the full Docker Compose profile)​

Option D: Hangfire / Quartz.NET / similar OSS library​

Decision​

Specifically​

Tenant-context propagation across the request → worker boundary​

Rationale​

Consequences​

Positive​

Negative​

Operational​

Alternatives Reconsidered​

Future Considerations​

Related ADRs​

Source References​

Status

Context

Option A: .NET hosted service + in-memory `Channel<T>` queue

Option B: .NET hosted service + DB-backed queue table

Option C: RabbitMQ (already in the `full` Docker Compose profile)

Option D: Hangfire / Quartz.NET / similar OSS library

Decision

Specifically

Tenant-context propagation across the request → worker boundary

Rationale

Consequences

Positive

Negative

Operational

Alternatives Reconsidered

Future Considerations

Related ADRs

Source References