The Thesis · The Unmeasured Node

The Missing Infrastructure Layer for High-Risk AI Decisions.

COHESION saves humanity by keeping human judgment alive in the age of AI -- the global certification standard for human oversight of AI.

When an AI system recommends a high-stakes decision and a human approves it, two completely different things produce the identical output: a person who read the evidence and signed, and a person who has approved nine hundred in a row and clicked without looking. The safety infrastructure being built assumes the first. It has no way to detect the second. That gap is what COHESION closes.

1 · The problem today

Every AI deployment is a control loop with one node nobody reads.

A dead phone line and a phone line with someone listening quietly on the other end produce the exact same silence. From the outside, you cannot tell them apart. You have to send a signal down the line and read what comes back.

Strip a high-stakes AI deployment to first principles and it is a control loop: the model produces an output, a human reviews it, the decision goes into the world. The field has built extraordinary instrumentation on one half of it -- accuracy, drift, bias, failure modes, evaluation harnesses, red-teaming, monitoring dashboards, explainability. A whole industry watches the machine.

The other half has almost none. The human is the node every AI decision still depends on, and it is the one node nobody measures. You cannot close a control loop around a node you never read. Most of the field is running open-loop on the part that matters most, calling the human approval step "oversight" as if naming it were the same as measuring it.

2 · Why current systems fail

"Human in the loop" asserts the human is there. It never checks whether the human is still thinking.

  • The judgment path is invisible. No record exists of whether the reviewer tested the recommendation or inherited it. The artifact says "approved" either way.
  • Review is fragmented. Oversight is scattered across tools, tickets, and inboxes, with no single place that holds the decision and the reasoning together.
  • Decisions are not replayable. When an auditor or a court asks what happened, there is no way to step back through the decision as it actually unfolded.
  • Human judgment is unmeasured. There is no score for whether the person was calibrated, engaged, and independent of the model, so "effective oversight" cannot be evidenced.
  • Auditability is weak. Logs record that a click happened, not that judgment occurred. That distinction is exactly what the next wave of regulation asks you to prove.

3 · The COHESION model

Score the risk. Route the decision. Measure the human. Make it replayable.

COHESION is middleware. It sits between the AI system's output and the screen the human operator sees, and it measures judgment signals continuously and invisibly -- no surveys, no extensions to install, no offline assessments that interrupt the work. The measurement lives at the API layer, where the decision actually happens.

AI

Recommendation

The model produces a high-stakes output.

DRS

Decision Risk Score

The decision is risk-ranked at the moment it is made.

Route

Auto · Review · Block

Risk drives whether a human must look.

Human

The reviewer acts

Behavior is observed as the work happens.

JIS

Judgment quality

The human's judgment is scored, not assumed.

Replay

Audit envelope

The whole decision is reconstructable later.

Close the loop on the human node, and "oversight" stops being a word in a policy and becomes a measurement an auditor can verify.

4 · Cross-domain

The same engine runs every high-stakes domain.

The control loop does not change when the domain does. An AI recommends, a human decides, and oversight has to prove the human was real. The interactive demo walks through five of these domains on one engine:

Hiring

The model ranks a candidate below the threshold and recommends rejection.

Oversight must prove: Did the reviewer test the recommendation, or inherit it?

Lending

The model scores an application as high-risk and recommends denial.

Oversight must prove: Was the denial a judgment, or a rubber stamp on the score?

Healthcare

The model flags a patient as low-acuity and recommends de-prioritizing.

Oversight must prove: Did the clinician weigh the flag, or defer to it?

Insurance fraud

The model marks a claim as suspicious and recommends a hold.

Oversight must prove: Was the hold reasoned, or anchored to the model output?

Benefits eligibility

The model determines an applicant ineligible and recommends termination.

Oversight must prove: Did the caseworker reopen the evidence, or sign through it?

5 · The differentiator

Most stop at "human in the loop." COHESION asks whether the human was calibrated, defensible, and policy-aligned.

The reason a human node decays is not weakness of character. It is the predictable output of a normal attention system doing exactly what it evolved to do. When a person is exposed to a tool that is reliably correct, their behavior shifts in a measurable direction: they override it less, they spend less time before approving, they stop reopening the source evidence and start treating the recommendation as the starting point rather than a claim to be tested.

That is the entire thesis in one line: the failure is visible in behavior before it is visible in outcomes. If you can watch the slope, you can intervene while it is still cheap. If you cannot, you find out from the lawsuit.

The Judgment Independence Score measures that slope across seven weighted dimensions, each mapped to specific regulatory criteria. The weights sum to one and are frozen per version of the standard, so a score means the same thing from one audit to the next.

Deferral ResistanceError Detection CapabilityIndependent PerformanceDeliberation DepthPost-Error RecalibrationDomain ConfidenceDecision Autonomy

The full normative specification is open and public at cohesionauth.com/standard, with the Methodology Annex published as a PDF. The grounding is documented, not rhetorical: a library of 21 documented oversight failures -- court-documented or regulator-reported events -- each scored against the seven dimensions to show which one collapsed. The methodological foundation, "Judgment Decay in AI-Augmented Environments," is published as a peer-reviewable preprint on SSRN.

6 · Why this is inevitable

Humanity has measured this kind of degradation before. The forcing function is already law.

The cleanest precedent is anesthesia. Once a patient is under, they cannot signal distress, so the anesthesiologist's continuous attention is the only thing between the patient and a silent deterioration. The answer, formalized in the American Society of Anesthesiologists monitoring standards adopted in 1986, was to make continuous instrumented monitoring a standard of care rather than a suggestion. Patient safety improved, and insurers who until then struggled to price a risk they could not measure could price it concretely. Once the measurement existed, the standard did not stay optional. Aviation ran the same arc with Crew Resource Management. A machine takes over the routine, the human goes quiet, and the durable fix is to make measuring the human a standard.

This is no longer a thesis argued from the future. In Europe, Article 14 of the EU AI Act requires that human oversight of high-risk AI be not merely present but effective -- that the overseer can understand the system and detect and address dysfunction. "Effective" is an outcome, and proving an outcome requires a measurement. In the United States, Colorado SB 26-189 was signed by Governor Polis on May 14, 2026, effective January 1, 2027; it codifies a four-prong test for "meaningful human review," and one prong requires that the reviewer "does not default to the system output" -- automation bias written into state statute. At the federal level, OMB Memorandum M-25-21 requires agencies to attest to human-oversight accountability. And the dimensions COHESION measures map directly onto the MEASURE function of the NIST AI Risk Management Framework. Every one of these asks the same question the existing infrastructure was never built to answer: is your human still actually in the loop?

One more thing makes the position durable. The AI governance tooling layer is being absorbed into the AI stack itself -- Promptfoo went to OpenAI, Helicone went to Mintlify, Langfuse went to ClickHouse. The assurance a frontier lab is least credible selling is a disinterested measurement of whether the human supervising its model has been reduced to a formality. The answer to "who measures the human?" cannot be the same party that built what the human is overseeing. That is not a feature gap. It is a category, and the bet is that its durable position is the independent one.

Find out whether your oversight would survive an effectiveness audit.

COHESION is real software, not a slide: a live scoring API, an open standard, a published methodology, a 21-case empirical library, and three USPTO provisional patents pending. If you are a compliance lead, a risk officer, or an AI governance practitioner at an organization deploying AI in high-stakes decisions, the Founding Design Partner cohort co-designs the certification your industry will be measured against.