Overview
Large language models (LLMs) are increasingly used by citizens, journalists, and institutions as sources of historical information, raising concerns about their potential to reproduce or amplify historical revisionism -- the distortion, omission, or reframing of established historical facts.
The HistoricalMisinfo Benchmark
We introduce HistoricalMisinfo, a curated dataset of 500 historically contested events from 45 countries, each paired with both factual and revisionist narratives. To simulate real-world pathways of information dissemination, we design eleven prompt scenarios per event, mimicking diverse ways historical content is conveyed in practice.
Key Findings
Evaluating responses from multiple medium-sized LLMs, we observe:
- Vulnerabilities and systematic variation in revisionism across models, countries, and prompt types
- Different models exhibit different patterns of susceptibility to historical revisionism
- Geographic and temporal factors influence the likelihood of revisionist outputs
Impact
This benchmark offers policymakers, researchers, and industry a practical foundation for auditing the historical reliability of generative systems, supporting emerging safeguards for democratic integrity.