EuroSafeAI Lab

Our Research

Our technical research focuses on developing methods to ensure that AI systems act safely and cooperatively, even in multi-agent settings and under adversarial conditions.

Highlighted Research

Democracy DefenseNeurIPS 2024

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

We introduce GovSim, a generative simulation platform to study strategic interactions and cooperative decision-making in LLMs. We find that most models fail to achieve sustainable equilibrium, with the highest survival rate below 54%. Agents leveraging moral reasoning achieve significantly better sustainability.

SafetyIASEAI 2026

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

We investigate how characteristics of fine-tuning datasets can accidentally misalign language models, revealing that structural and linguistic patterns in seemingly benign datasets amplify adversarial vulnerability. Our findings motivate more rigorous dataset curation as a proactive safety measure.

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Democracy Defense

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

We propose SocialHarmBench, the first comprehensive benchmark to evaluate the vulnerability of LLMs to socially harmful goals with 78,836 prompts from 47 democratic countries collected from 16 genres and 11 domains. From experiments on 15 cutting-edge LLMs, many safety risks are uncovered.

Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models
Democracy Defense

Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

We propose a novel methodology to assess LLM alignment on the democracy–authoritarianism spectrum, combining psychometric tools, a new favorability metric, and role-model probing. We find that LLMs generally favor democratic values but exhibit increased favorability toward authoritarian figures when prompted in Mandarin.

Other Research

Additional published work across our research directions.

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Democracy Defense

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

We explore multiple directions to investigate hidden mechanisms behind content moderation: training classifiers to reverse-engineer decisions across countries, and explaining moderation decisions by analyzing Shapley values and LLM-guided explanations.

Societal Impact

Socio-Political Risks of AI

A report examining how AI systems can amplify or reshape socio-political risks, and outlining governance and technical approaches to mitigate these harms.

Work in Progress

Ongoing and forthcoming work across our research directions.

Multi-Agent SafetyComing Soon

Multi-Agent Safety in Autonomous Systems

Investigating safety guarantees and failure modes when multiple AI agents interact in shared environments, with a focus on emergent risks that arise from agent-to-agent dynamics.

Defending against LLM Propaganda: Detecting Historical Revisionism by Large Language Models
Democracy DefenseComing Soon

Defending against LLM Propaganda: Detecting Historical Revisionism by Large Language Models

We introduce HistoricalMisinfo, a curated dataset of 500 historically contested events from 45 countries, each paired with factual and revisionist narratives. Evaluating responses from multiple LLMs, we observe vulnerabilities and systematic variation in revisionism across models, countries, and prompt types.

When Do Language Models Endorse Limitations on Universal Human Rights Principles?
Democracy DefenseComing Soon

When Do Language Models Endorse Limitations on Universal Human Rights Principles?

We evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights, leveraging 1,152 synthetically generated scenarios across 24 rights articles in eight languages. Analysis of eleven major LLMs reveals systematic biases in rights endorsement patterns.

Interested in Collaborating?

We actively collaborate with academic and industry partners on AI safety research. Get in touch to explore opportunities.