BENCHMARK · v14 · IN ACTIVE DEVELOPMENT
An open benchmark for conflict reasoning, grounded in the Agentic Conflict Ontology. Built to measure the things generic language models fail at — time, causality, provenance, commitment tracking — not the things they already do well.
WHAT THE TCGC MEASURES
Each task type targets a specific capability that standard retrieval-augmented generation cannot handle reliably. Metrics are reported per task type and per domain.
Disambiguate references and alias clusters across long documents.
Surface asserted facts, evaluative statements, and normative claims.
Infer underlying interests distinct from stated positions (Fisher/Ury).
Identify rules, norms, and structural bounds shaping feasible outcomes.
Attribute leverage resources and dependencies to the actor holding them.
Distinguish claims from commitments and track their evolution over time.
Reconstruct temporal sequence from mixed narrative prose.
Detect framing changes across time and party.
Build multi-hop causal chains with explicit mechanism and conditions.
Identify claims that cannot simultaneously hold across actors or time.
Bind every extracted primitive back to its source span.
Flag instances where stated commitment diverges from behavioral evidence.
Separate surface positions from underlying interests.
Assemble a coherent conflict graph from multiple, partially-contradictory sources.
METHODOLOGY
TCGC items are drawn from two domains — human friction (HR, commercial, governance) and complex multi-party scenarios (policy, peace process, multilateral) — with intentional diversity in length, source mix, and discourse style.
Annotation proceeds in three passes: primitive tagging, edge labelling, and ground-truth question authoring. Inter-annotator agreement targets are task-type specific; tasks that depend on inferred primitives (like Interest) have lower targets than surface-level tasks (like Actor resolution), and we report the actual agreement transparently.
The evaluation harness is designed to be compatible with standard runners (HELM, lm-eval-harness) via a thin adapter. We will publish the adapter alongside the first public split.
OPEN RESEARCH QUESTIONS
Six questions we do not yet have clean answers for. Click each to read the current thinking and tell us where it is wrong.
HOW TO CONTRIBUTE
TIER 01
The v0.1 evaluation protocol draft is available on request. Comments accepted on the annotation guidelines, task-type definitions, and metric choices.
Request the draftTIER 02
Early-access splits are available to academic researchers and pilot partners under a light DUA. Write in with your proposed use case.
Request accessTIER 03
Found something the 14 current task types miss? Send us a proposal — one paragraph of motivation, one worked example, one suggested metric.
Propose a taskPUBLICATION PLAN