Skip to main content

ADR-024: PowerFill Async Run Pattern

Status

Proposed (PSSaaS Architect) — pending PO acceptance with the Phase 6e completion review.

Context

The PowerFill spec (§Run Execution Model line 247) requires runs to be asynchronous: POST /api/powerfill/run returns a run_id immediately and the engine executes in the background. PoC observation against PS_DemoData (6a-6d) shows a typical run takes 25-60 seconds even on the smaller dataset; production-scale tenants (50K+ loans) will likely take minutes. Holding an HTTP request thread for that duration is not viable — clients time out, retries cascade, the API pod becomes unhealthy under any concurrent load.

Sub-phases 6a-6d shipped with a synchronous best-effort POST /run (200 OK + RunResponse on success; 500 + RunResponse on failure) so the orchestration was independently verifiable before adding the async runtime. Sub-phase 6e converts the surface to a true async submission model.

PSSaaS has no precedent for background execution — every existing endpoint is request-scoped. Whatever pattern 6e adopts will set the path for any future PSSaaS module that needs background work (Phase 7 reports, Phase 8+ scheduled BestEx runs, Risk Manager batch jobs).

This ADR records the decision made by Phase 6 Open Question Q1 (PO-confirmed at planning checkpoint).

Option A: .NET hosted service + in-memory Channel<T> queue

IHostedService background worker; System.Threading.Channels.Channel<T> for the queue. The endpoint enqueues a RunJob record carrying the run id, captured tenant identity, and resolved options; the worker dequeues, opens a per-job DI scope, populates the scope's TenantContext from the job, and invokes PowerFillRunService.ExecuteResolvedAsync.

Pros:

  • Zero new infrastructure dependencies — Channel<T> is in the BCL, BackgroundService is in the standard hosting abstractions.
  • Fits the modular monolith model (ADR-004).
  • Cancellation works naturally via CancellationTokenSource keyed by run_id.
  • Simple to reason about; one process boundary.

Cons:

  • Pod restart mid-run = orphaned pfill_run_history row stuck in an active state.
  • No replay capability; the queue is volatile.
  • Single-pod safe only — multi-pod deployments would race on BR-8 enforcement (the SQL filtered unique index would still prevent two concurrent INSERTs but the queue itself wouldn't be shared).

Option B: .NET hosted service + DB-backed queue table

A pfill_job_queue row insert on POST /run; the hosted service polls the table; a single-row UPDATE-with-OUTPUT picks up the next job atomically.

Pros:

  • Pod restart safe — DB row survives.
  • Multi-pod compatible if the polling SELECT uses WITH (UPDLOCK, READPAST).
  • Queue state visible in DB.

Cons:

  • Polling latency vs cost trade-off.
  • Another PSSaaS-only table to maintain (pfill_job_queue plus the audit pfill_run_history).
  • More moving parts than the workload demands at v1.

Option C: RabbitMQ (already in the full Docker Compose profile)

Publish a job message on POST /run; subscriber background worker consumes.

Pros:

  • Production-grade durability + retry semantics.
  • Pub/sub for future consumers (Phase 7 reports could subscribe too).
  • Aligns with the eventual full profile shape.

Cons:

  • RabbitMQ is not in the dev profile today; the PSSaaS API doesn't currently take a dependency on it.
  • Multi-tenant routing requires careful exchange/queue design.
  • Significant new infrastructure surface for a single-pod dev/staging footprint.

Option D: Hangfire / Quartz.NET / similar OSS library

External library handles queue + worker + persistence + dashboard.

Pros:

  • Mature; dashboard, retries, scheduled jobs free.
  • Per-tenant DB persistence aligns with ADR-005.

Cons:

  • New dependency; opinionated APIs.
  • Hangfire Pro is paid; OSS Hangfire is per-DB and unfriendly to database-per-tenant (would need one Hangfire DB per tenant or one shared DB outside the tenant boundary).
  • Quartz.NET is heavier than the workload needs.

Decision

Option A — in-memory Channel<T> + BackgroundService.

Specifically

  • PowerFillRunQueue — process-singleton wrapping Channel.CreateBounded<RunJob>(capacity: 64, FullMode: Wait). Bounded capacity prevents runaway enqueue if the worker stalls; full-channel attempts surface as 503 Service Unavailable with a 2-second timeout (operator-facing latency matters more than indefinite blocking).
  • PowerFillRunBackgroundServiceBackgroundService reading the channel; for every job opens _serviceProvider.CreateScope(), populates the scope's mutable TenantContext from the captured RunJob, then resolves PowerFillRunService + PowerFillRunHistoryService from the per-job scope. The scoped TenantDbContext picks up the connection string in its OnConfiguring override.
  • PowerFillRunCancelRegistry — process-singleton ConcurrentDictionary<Guid, CancellationTokenSource>. The endpoint registers a CTS before enqueue; the cancel endpoint signals it; the worker honours at step boundaries and via ExecuteSqlInterpolatedAsync(ct) propagation. The worker's finally block disposes + unregisters the CTS regardless of outcome.
  • PowerFillRunStartupReconciliationServiceIHostedService running once at app startup. Iterates every known tenant in TenantRegistry; opens a per-tenant DI scope; marks any pfill_run_history rows still in an active state (left behind by a prior pod) as Failed with failure_step='startup_reconciliation'. This is the BR-8 unblocking mechanism after pod restart.
  • pfill_run_history filtered unique index (Q2 Option A) is the BR-8 enforcement vehicle — it survives pod restarts, channel resets, and any other in-process state loss. The in-memory queue is purely a work-distribution mechanism; the audit table is the system of record.

Tenant-context propagation across the request → worker boundary

The critical seam is that TenantContext is request-scoped + mutable; the request thread reads it from the TenantMiddleware-populated value, the worker thread runs outside any HTTP request scope. The worker handles this by:

  1. Capturing TenantId + ConnectionString into the RunJob record on the request thread (immutable copy).
  2. Per-job _serviceProvider.CreateScope() on the worker thread.
  3. Resolving the new scope's TenantContext and assigning the captured values BEFORE resolving PowerFillRunService.
  4. The scoped TenantDbContext resolves its connection string from the now-populated TenantContext at first DB access.

This is the standard ASP.NET Core background-worker tenant-propagation idiom. Multi-tenant isolation is preserved (per-job scope; no scope sharing across jobs).

Rationale

Phase 6e's job is to add the orchestration shape — async submission, audit row, BR-8/BR-9, GET endpoints, cancel — not to add infrastructure. Option A is the smallest viable change that satisfies the spec while preserving all the existing ADRs (ADR-005 database-per-tenant; ADR-020 single-replica AKS staging).

The pod-restart risk is mitigated by the startup reconciliation sweep + the SQL-side BR-8 index; the multi-pod concern is not currently a constraint per ADR-020.

If/when PSSaaS gets a second background-work consumer (Phase 7+ reports or Phase 8+ scheduled jobs), Option B (DB-backed queue) is the natural successor — it requires no API/contract change to the endpoints, just swapping the IPowerFillRunQueue reader/writer pair. Option C (RabbitMQ) is the right answer when PSSaaS adopts the full profile in production and has multiple consumers; revisit at that point.

Consequences

Positive

  • v1 scope is small (one new SQL artifact + ~6 new C# files + ~10 LOC endpoint refactor).
  • BR-8 is database-enforced (cannot drift from the queue).
  • BR-9 cleanup is bounded scope (7 user-facing tables; syn-trades + log preserved for forensics).
  • Cancel works via the standard CancellationToken chain; no special-case plumbing.
  • The contract surface is forward-compatible with Option B / Option C — the endpoints + the audit row + the cancel registry all stay; only the channel reader/writer pair would swap.

Negative

  • Single-pod safety only; multi-pod requires a different queue.
  • Pod restart abandons in-flight runs (mitigated but not eliminated by the reconciliation sweep).
  • No replay; the channel is volatile.

Operational

  • The startup reconciliation sweep is a small per-tenant cost at app startup (one indexed UPDATE per tenant). For dozens of tenants this is sub-second.
  • The 2s queue-saturation timeout is short on purpose — a saturated queue indicates the worker is stalled; a fast 503 lets clients retry rather than waiting.
  • BR-9 cleanup runs as a single SQL batch (7 DELETE FROM statements) wrapped in try/catch so cleanup failure doesn't mask the original run failure.

Alternatives Reconsidered

These options remain viable for later phases when their trade-offs change:

  • Option B (DB-backed queue) — Phase 7 if a second background consumer arrives.
  • Option C (RabbitMQ) — when PSSaaS adopts the full profile in production AND a multi-pod deployment becomes a requirement.
  • Option D (Hangfire/Quartz.NET) — only if a scheduling requirement (cron-style) emerges that the BackgroundService model can't accommodate cleanly.

Future Considerations

  • Multi-pod: when PSSaaS runs multiple replicas, the in-memory Channel<T> is no longer correct. Option B is the natural migration path; Option C is the right answer if RabbitMQ is already operational.
  • Replay: Q3 Option B (input loan IDs in the audit row) is forensic; Q3 Option C (full input snapshot tables) is replay-capable. If parallel-validation against the Desktop App reveals a need to replay specific runs against historical conditions, the audit table can grow to support snapshots without changing the endpoint surface.
  • Scheduled runs (e.g. nightly PowerFill for a tenant): would require either a wrapper that submits to the existing endpoint on a schedule (cron-like) OR migration to Option D (Quartz.NET/Hangfire). Not a current requirement.
  • Per-tenant queue isolation: currently a single global channel; if a noisy-neighbour tenant emerges, switch to per-tenant channels with weighted scheduling.

Source References