Score it twice — Field Notes

The platform looks, from the outside, like a hundred other rating systems. A Vienna-based international organization runs an institutional self-assessment: roughly sixty member agencies measure themselves, year over year, against about twenty thematic indicators — some hundred and twenty structured questions in all. Responses pass through a multi-stage human review, then a deterministic, config-driven engine scores each indicator on a 0–100 scale. Only anonymized cohort averages and medians are ever published, and only once a minimum number of agencies have reported. React 19 and Vite on the front; Express, Prisma, and PostgreSQL behind it; Azure App Service holding it up.

The interesting part isn’t the stack. It’s a quiet structural fact that breaks the obvious design: you cannot score one agency by looking only at that agency. A meaningful share of the indicators are defined relative to everyone else, and that single property is the difference between a scoring engine that works and one that deadlocks the first time you try to finalize a number before the cohort is in.

The formula zoo, and the one hard class

Before the cohort problem, there’s a more mundane one: the indicators don’t share a formula. They barely share a shape. What looks like “score each indicator 0–100” is really a zoo of unrelated calculations wearing the same uniform:

Option means. Qualitative answers mapped to fixed points — 0, 50, 100 — then averaged. A “mostly” is worth exactly fifty, by decree, not by feel.
Ratios. One count over another, scaled: adopted ÷ issued × 100. Clean until the denominator is zero, which it sometimes is.
Banded thresholds. A raw value falls into a band, and the band carries the score. Below 20% is one tier; 20–50% another; the boundaries are policy, not math.
Hard gates. A single answer can force the whole indicator to 0, regardless of everything else around it. Some failures aren’t averaged away; they’re disqualifying.
Inverted bands. The trap. For a handful of indicators, a rate near 100% is a red flag rather than excellence — a number that high means something is being over-reported, not done well. High is bad, and the band table has to know it.

All of those, awkward as they are, share one redeeming property: they are local. Give me one agency’s answers and I can compute every one of them, alone, in a pure function, with no knowledge of any other agency in the world.

Then there is the hard class, and it does not have that property. A subset of indicators — the year-over-year trend measures — are min-max scaled across the entire cohort:

s = (value − min) / (max − min) × 100

Read that formula honestly and the problem is unavoidable. The min and max are not constants. They are the smallest and largest values across every agency that year. The score for agency A is a function of agency B’s data, and C’s, and all the rest. You cannot compute it for one subject in isolation, because the formula is, by definition, about the group. Roughly a third of the indicators live here. That is not an edge case you can shrug off; it’s a third of the product.

Two phases, because there are honestly two problems

The instinct is to write one big scoring function that takes a submission and returns a finished result. It fails in a specific, expensive way: the moment it reaches a cohort-relative indicator it has no min and no max, so it either blocks or invents a number. Both are worse than admitting the truth — that the work splits cleanly into two passes that run at two different times.

Phase one is pure. No database, no cohort, no I/O — just a function from one subject’s answers to a partially-scored tree. It resolves every local leaf using the zoo above, then rolls those leaves up the indicator hierarchy — section to indicator to sub-indicator to leaf — by equal-weighted or weighted mean, depending on the node. For the cohort-relative indicators, phase one does the one honest thing it can: it computes the raw input value, stores it, and marks the score itself deferred with an explicit NO_DATA status. It does not guess. A raw trend value is never written into the column as if it were a finished, normalized score. The slot is visibly empty, on purpose.

Phase two runs after the cohort is in. Now the missing context exists. Phase two pools the deferred raw values per [year, indicator], finds the min and max across that pool, applies the min-max scale, writes the finalized scores into the slots phase one left open, and recomputes the global aggregates that get published. Two passes, two moments, two responsibilities. Each one is small enough to hold in your head and dull enough to test exhaustively.

This isn’t a clever trick. It’s the refusal of one — the refusal to make a function pretend it has information it cannot have. Splitting on the data dependency turns one impossible single-pass problem into two boring, testable, idempotent ones.

Config as data, not code

A scoring engine for twenty indicators that change every cycle has a second failure mode waiting: every indicator becomes a branch in a growing switch statement, and retuning a band table means a code review and a redeploy. That’s how scoring logic ossifies.

So the math and the definitions are kept apart. A small FormulaKey enum dispatches to a handful of tiny, pure functions — ratio, optionMean, bandLookup, minMaxScale, threshold, changeIndex — and that is the whole vocabulary. The functions don’t know which indicator they’re serving. The definitions — which formula an indicator uses, which fields feed it, what the band boundaries are — live in database config as JSON bindings and band tables.

The payoff is governance, not elegance. Adding an indicator, retuning a threshold, or flipping a band from rising to inverted is a config change with an audit trail, reviewable by someone who understands the policy without reading the code. The engine’s behavior is data you can inspect, diff, and reason about — not a redeploy you have to schedule.

What makes a score defensible

Here is the part that doesn’t show up in a demo and is the entire reason the system survives contact with the real world: a year after you compute a number, someone will dispute it. An agency will email and ask why their indicator dropped four points, and “the engine said so” is not an answer you can give a stakeholder. Every design decision below exists to make that conversation survivable.

Idempotent by key. Scores upsert on [subject, year, indicator]. Re-running the engine over the same inputs produces the same rows, not duplicates and not drift. You can recompute the whole cohort at will, because a re-run is a no-op when nothing changed.
Computed scores live in their own table. Raw answers sit in one place; calculated scores in another. A recompute reads source data and writes results; it never mutates the source. The thing being measured and the measurement of it are not allowed to share a row.
Every leaf formula has a unit test. Each of the tiny pure functions — ratio with a zero denominator, the inverted band at 99%, the hard gate that zeroes everything — is pinned by a test that asserts a known input gives a known output. The zoo is exhaustively caged.
A score outlives its source. When a source response is deleted, the historical score it produced does not vanish with it — the relation is onDelete: SetNull, and the row carries a note that the source was expunged on purpose. A score with no surviving source is not a bug; it’s a record that something was, at the time, computed and stood behind.

Put together, those four let you reopen a number from eighteen months ago, show which formula produced it, which inputs fed it, and that re-running the engine reproduces it exactly. When I built a multi-agency platform I built, that reproducibility was worth more than any feature on the roadmap: it was the only thing that turned a dispute into a five-minute explanation instead of an argument nobody could win.

The reusable shape

Strip away the indicators and the agencies and there’s a pattern here you’ll meet again the moment any computation has mixed data dependencies: separate the per-subject pass from the cohort-relative pass, and make the gap between them explicit. Anything you can compute from one subject’s data, compute now, purely. Anything that needs the group, defer — with a status that says so out loud, not a placeholder zero that lies quietly. Run the second pass when the context exists, keyed so it’s idempotent, writing to a table that never touches the source.

None of that is exciting. There is no streaming, no incremental cleverness, no single all-knowing function. There is a pure pass, a deferred status, a config table, a second pass, and a pile of unit tests. And that is exactly the version you can defend when someone disputes their number a year later: the deterministic, config-driven, separately-stored, unit-tested one. The clever single-pass engine that scored everyone in one go would have been faster to demo and impossible to stand behind. The boring one scores it twice — and is still standing when the question comes.

Building something where the boring decisions matter?

I work with Vienna and EU teams on internal tools, dashboards, and the scoring and governance layers around them — the deterministic, testable groundwork that holds up when someone disputes the output.

hello@albimeta.com Back to all field notes

hello@albimeta.com · Response within 1 business day