Hiring
The model ranks a candidate below the threshold and recommends rejection.
Oversight must prove: Did the reviewer test the recommendation, or inherit it?
The Thesis · The Unmeasured Node
COHESION saves humanity by keeping human judgment alive in the age of AI -- the global certification standard for human oversight of AI.
When an AI system recommends a high-stakes decision and a human approves it, two completely different things produce the identical output: a person who read the evidence and signed, and a person who has approved nine hundred in a row and clicked without looking. The safety infrastructure being built assumes the first. It has no way to detect the second. That gap is what COHESION closes.
1 · The problem today
A dead phone line and a phone line with someone listening quietly on the other end produce the exact same silence. From the outside, you cannot tell them apart. You have to send a signal down the line and read what comes back.
Strip a high-stakes AI deployment to first principles and it is a control loop: the model produces an output, a human reviews it, the decision goes into the world. The field has built extraordinary instrumentation on one half of it -- accuracy, drift, bias, failure modes, evaluation harnesses, red-teaming, monitoring dashboards, explainability. A whole industry watches the machine.
The other half has almost none. The human is the node every AI decision still depends on, and it is the one node nobody measures. You cannot close a control loop around a node you never read. Most of the field is running open-loop on the part that matters most, calling the human approval step "oversight" as if naming it were the same as measuring it.
2 · Why current systems fail
3 · The COHESION model
COHESION is middleware. It sits between the AI system's output and the screen the human operator sees, and it measures judgment signals continuously and invisibly -- no surveys, no extensions to install, no offline assessments that interrupt the work. The measurement lives at the API layer, where the decision actually happens.
AI
Recommendation
The model produces a high-stakes output.
DRS
Decision Risk Score
The decision is risk-ranked at the moment it is made.
Route
Auto · Review · Block
Risk drives whether a human must look.
Human
The reviewer acts
Behavior is observed as the work happens.
JIS
Judgment quality
The human's judgment is scored, not assumed.
Replay
Audit envelope
The whole decision is reconstructable later.
Close the loop on the human node, and "oversight" stops being a word in a policy and becomes a measurement an auditor can verify.
4 · Cross-domain
The control loop does not change when the domain does. An AI recommends, a human decides, and oversight has to prove the human was real. The interactive demo walks through five of these domains on one engine:
The model ranks a candidate below the threshold and recommends rejection.
Oversight must prove: Did the reviewer test the recommendation, or inherit it?
The model scores an application as high-risk and recommends denial.
Oversight must prove: Was the denial a judgment, or a rubber stamp on the score?
The model flags a patient as low-acuity and recommends de-prioritizing.
Oversight must prove: Did the clinician weigh the flag, or defer to it?
The model marks a claim as suspicious and recommends a hold.
Oversight must prove: Was the hold reasoned, or anchored to the model output?
The model determines an applicant ineligible and recommends termination.
Oversight must prove: Did the caseworker reopen the evidence, or sign through it?
5 · The differentiator
The reason a human node decays is not weakness of character. It is the predictable output of a normal attention system doing exactly what it evolved to do. When a person is exposed to a tool that is reliably correct, their behavior shifts in a measurable direction: they override it less, they spend less time before approving, they stop reopening the source evidence and start treating the recommendation as the starting point rather than a claim to be tested.
That is the entire thesis in one line: the failure is visible in behavior before it is visible in outcomes. If you can watch the slope, you can intervene while it is still cheap. If you cannot, you find out from the lawsuit.
The Judgment Independence Score measures that slope across seven weighted dimensions, each mapped to specific regulatory criteria. The weights sum to one and are frozen per version of the standard, so a score means the same thing from one audit to the next.
The full normative specification is open and public at cohesionauth.com/standard, with the Methodology Annex published as a PDF. The grounding is documented, not rhetorical: a library of 21 documented oversight failures -- court-documented or regulator-reported events -- each scored against the seven dimensions to show which one collapsed. The methodological foundation, "Judgment Decay in AI-Augmented Environments," is published as a peer-reviewable preprint on SSRN.
6 · Why this is inevitable
The cleanest precedent is anesthesia. Once a patient is under, they cannot signal distress, so the anesthesiologist's continuous attention is the only thing between the patient and a silent deterioration. The answer, formalized in the American Society of Anesthesiologists monitoring standards adopted in 1986, was to make continuous instrumented monitoring a standard of care rather than a suggestion. Patient safety improved, and insurers who until then struggled to price a risk they could not measure could price it concretely. Once the measurement existed, the standard did not stay optional. Aviation ran the same arc with Crew Resource Management. A machine takes over the routine, the human goes quiet, and the durable fix is to make measuring the human a standard.
This is no longer a thesis argued from the future. In Europe, Article 14 of the EU AI Act requires that human oversight of high-risk AI be not merely present but effective -- that the overseer can understand the system and detect and address dysfunction. "Effective" is an outcome, and proving an outcome requires a measurement. In the United States, Colorado SB 26-189 was signed by Governor Polis on May 14, 2026, effective January 1, 2027; it codifies a four-prong test for "meaningful human review," and one prong requires that the reviewer "does not default to the system output" -- automation bias written into state statute. At the federal level, OMB Memorandum M-25-21 requires agencies to attest to human-oversight accountability. And the dimensions COHESION measures map directly onto the MEASURE function of the NIST AI Risk Management Framework. Every one of these asks the same question the existing infrastructure was never built to answer: is your human still actually in the loop?
One more thing makes the position durable. The AI governance tooling layer is being absorbed into the AI stack itself -- Promptfoo went to OpenAI, Helicone went to Mintlify, Langfuse went to ClickHouse. The assurance a frontier lab is least credible selling is a disinterested measurement of whether the human supervising its model has been reduced to a formality. The answer to "who measures the human?" cannot be the same party that built what the human is overseeing. That is not a feature gap. It is a category, and the bet is that its durable position is the independent one.
COHESION is real software, not a slide: a live scoring API, an open standard, a published methodology, a 21-case empirical library, and three USPTO provisional patents pending. If you are a compliance lead, a risk officer, or an AI governance practitioner at an organization deploying AI in high-stakes decisions, the Founding Design Partner cohort co-designs the certification your industry will be measured against.