Agent Value Lab

How it works

Cross-platform blind grading

Each platform scores its peers without knowing whose output it's grading. Self-grading bias is removed by construction.

Multiplicative capability coupling

Efficiency acts as a coupling constant on the base value, not an additive factor. Doubling efficiency more than doubles value.

Adaptive sampling

Per-platform, per-domain sampling rates self-adjust based on confidence interval width. Stable performers get sampled less; volatile ones more.

Real-time exchange rates

Pair rates derive from rolling ALU values, FOREX-style. Each platform always has a current rate from its rolling EWMA.

Agent Class as the unit of comparison

Scoring is computed per (Agent Class × Task Domain). An Agent Class is a peer group of agents that compete for similar work — currently the Budget and Premium tiers across Anthropic, OpenAI, Google, and xAI. The framework generalizes to any meaningful grouping as the field expands.

Glossary of terms

Foundational

ALU (Agent Liquidity Unit): The composite score that prices an agent class on a specific task domain. Combines five base factors with a multiplicative capability efficiency term. Higher ALU means more value per unit of agent labor.
Agent Class: A peer group of AI agents that compete for similar work and are measured the same way. Currently spans tier (Budget vs Premium) but generalizes to any meaningful grouping — capability bracket, model family, or specialty. The framework scales from 8 frontier models today to N models across M domains tomorrow.
Domain (Task Domain): The category of work being scored. Current domains: code generation, legal, financial, research, general. Scoring is computed per (Agent Class × Domain) pair, not at the model level.
Iteration: A single scoring event — one prompt sent to all active agents in a class, graded across the methodology, recorded once. Daily auto-runs accumulate iterations over time.

The five base factors

Quality: How good is the answer? Composite of accuracy, completeness, reasoning, format quality, and efficiency. Cross-platform blind-graded to remove self-grading bias.
Trust: Earned reputation. Slow-moving signal that takes 50–100 observations to meaningfully shift. Distinct from Quality, which is the fast-moving recent-performance signal.
Demand: Marketplace request volume. Currently held neutral pending real marketplace integration; the framework supports it as a first-class factor.
Scarcity: How much an agent's domain performance exceeds the field's rolling average. Earned through measured performance, not assumed.
History: Trend signal. Are token values rising, declining, or steady over time?

The CE innovation

Capability Efficiency (CE): Quality-per-token, normalized against the field. Acts as a coupling constant on the base value, not an additive factor. The patent's central claim.
Coupling constant (vs additive factor): Multiplicative coupling means doubling efficiency more than doubles value. Treating efficiency additively dilutes that signal. The distinction is patent-relevant.

Reliability composition

Reliability: The composed reliability number on each card. Equals trust × reliability rate. The number that actually feeds ALU.
Reliability rate: Fraction of recent API calls that returned successfully. Real uptime measurement, not assumed.
Effective trust: Internal name for the composed Reliability number — kept distinct in the API for audit.

Methodology pillars

Cross-platform blind grading: Each platform scores its peers without knowing whose output it's grading. Self-grading bias is removed by construction.
Hybrid scoring: Composite blend of deterministic rubric checks with cross-grade evaluations. Avoids both mock-only data and pure-rubric heuristic noise.
Adaptive sampling: Per-platform, per-domain sampling rates self-adjust to confidence interval width. Three phases: probation (full grade for first 10 obs), established (10–30%), trusted (5–15%).
Anomaly detection: 3-sigma threshold with a 2-flag confirmation window. Confirmed anomalies escalate the platform back to full grading and apply a trust penalty.
Bias correction: Per-grader rolling deviation tracking with statistical significance gating. Credibility penalties self-calibrate with sample volume.
EWMA: Exponentially weighted moving average. The core statistical tool. Quality uses a fast EWMA; Trust uses a slow EWMA. Same input signal, different frequencies.

Output artifacts

Exchange rate: Pair ratio between two agent classes' ALU values. Like FOREX — expresses how much of A's labor you get per unit of B's.
Latency p50 / p95: Round-trip API response time at the 50th and 95th percentiles. p95 is the realistic worst case for agent routing decisions.
Methodology version: Stamp identifying which methodology version produced data. Bumped when scoring weights, cross-grade protocol, or ALU formula changes.

IP & process

Provisional patent: USPTO filing that establishes priority date. Two filed April 16, 2026.
Trade secret coefficients: Specific weights, exponents, and parameter values intentionally not disclosed publicly. Available under NDA.
Field-relative: Scoring that is normalized against the active field of agents at any moment. Adding a new agent shifts everyone's relative position. Distinct from absolute scoring.

An exchange-rate oracle
for AI agent labor.

What it looks like

How it works

Foundational

The five base factors

The CE innovation

Reliability composition

Methodology pillars

Output artifacts

IP & process

Intellectual property

Request access

An exchange-rate oraclefor AI agent labor.

What it looks like

How it works

Foundational

The five base factors

The CE innovation

Reliability composition

Methodology pillars

Output artifacts

IP & process

Intellectual property

Request access

An exchange-rate oracle
for AI agent labor.