A single AI engine's confidence score ("I am 87% sure this claim is true") reflects only that engine's probability estimate โ which may be systematically biased by its training data. Multi-engine confidence scoring aggregates the probability estimates of three independent engines, producing a genuinely adversarial check that reduces systematic bias by an order of magnitude.
The Three-Engine Confidence Architecture
For each claim: ChatGPT-4o generates a verdict (True/Mostly True/Mixed/Mostly False/False/Opinion/Unverifiable) with a confidence percentage and cited sources. Perplexity Sonar Pro independently generates the same verdict with live web citations. Google Gemini generates its independent verdict. The three verdicts are weighted by each engine's measured calibration accuracy for the relevant claim domain and aggregated into a consensus score. Claims with low inter-engine agreement (high variance) are flagged for human review.
Interpretation and Display
A consensus score above 85% agreement across engines (with matching verdicts) can be presented to readers as a high-confidence fact check. Scores of 60โ85% indicate partial consensus and should display the range of verdicts. Below 60% indicates genuine uncertainty or contested facts โ the most valuable output, because it tells readers "this claim is contested" rather than falsely reassuring them.