flowchart LR
A[CMP corpus<br/>3,327 manifestos] --> B[Token-chunking<br/>20k tokens, 2k overlap]
B --> C1[Sonnet 4.6]
B --> C2[GPT-4.1-mini]
B --> C3[Gemini Flash]
C1 --> D[Per-chunk JSON<br/>strict schema]
C2 --> D
C3 --> D
D --> E[Span validation<br/>regex + LLM-judge]
E --> F[Confidence-weighted<br/>per-manifesto aggregation]
F --> G[Cross-model aggregation<br/>configurable weighting]
G --> H[Panel regressions<br/>with non-standard errors]
Methodology
Pipeline, prompts, validation, reproducibility
Overview
The corpus
We score every party manifesto in the Manifesto Project Database (CMP / WZB) that has a machine-readable full text and falls within our analysis window (1945–2025). The text is loaded via the manifestoR package using the WZB API. We do not redistribute the manifesto texts themselves (see download page for licensing).
| Selection | Count |
|---|---|
| Manifestos with full text | 3,327 |
| Countries | 50+ |
| Languages | EN, DE, ES, FR, IT, PL, CZ, EL, NL, … |
Scoring schema
Each manifesto chunk is sent to each LLM under an identical strict JSON schema that requires, per scored dimension:
{
"score": <integer 0–10 or null>,
"evidence_status": "verified" | "mixed" | "no_evidence",
"span": {
"quote": "<3–8 word verbatim quote from the chunk>",
"context": "<surrounding ~3 sentences>"
}
}Twelve dimensions are scored:
| Group | Code | Label | |
|---|---|---|---|
| pop_anti_elitism | Populism | pop_anti_elitism | Anti-Elitism |
| pop_people_centrism | Populism | pop_people_centrism | People-Centrism |
| pop_manichean | Populism | pop_manichean | Manichaean Worldview |
| populism_overall | Populism | populism_overall | Populism (Overall) |
| pop_ideology_left | Ideology | pop_ideology_left | Ideology — Left |
| pop_ideology_right | Ideology | pop_ideology_right | Ideology — Right |
| pop_ideology_centrist | Ideology | pop_ideology_centrist | Ideology — Centrist |
| pop_ideology_overall | Ideology | pop_ideology_overall | Ideology (Right − Left) |
| lib_political | Liberalism | lib_political | Political Liberalism |
| lib_social | Liberalism | lib_social | Social Liberalism |
| lib_economic | Liberalism | lib_economic | Economic Liberalism |
| lib_financial_market | Liberalism | lib_financial_market | Financial-Market Liberalism |
| liberalism_overall | Liberalism | liberalism_overall | Liberalism (Overall) |
Key technical decisions
| Decision | Choice | Rationale |
|---|---|---|
| Chunking | 20k-token chunks, 2k overlap, 18k full-text cap | Fits all three models’ context with margin |
| Reproducibility knobs | temperature=0, top_p=1, pinned model snapshots |
Deterministic where the APIs allow |
| NA propagation | score=null for no_evidence rather than score=0 |
“Topic absent” ≠ “topic present at zero intensity” |
| Anti-hallucination | Per-dimension verbatim quote required; status enum forces explicit reasoning | Eliminates “make up a score” failure mode |
| Span validation | Post-hoc 6-tier fuzzy match + LLM-as-judge backstop | Catches paraphrased / fabricated quotes that the schema can’t enforce |
| Within-manifesto aggregation | Confidence-weighted mean across chunks (default; configurable) | Down-weights low-confidence chunk scores |
| Between-model aggregation | Simple mean (default; configurable) | No model is privileged a priori; specification curve reports robustness |
Cross-model calibration
A 50-manifesto matched-pair pilot against Claude Sonnet 4.6 (the most-trusted baseline because of its 285-manifesto reference run) measured:
| Model | mean | gap |
|---|---|---|
| GPT-4.1-mini | 0.73 | 0.73 |
| Gemini Flash (latest) | 0.32 | 0.81 |
| Gemini 2.5 Flash + thinking | 1.91 | 0.72 |
gemini-2.5-flash with explicit thinking budget produced systematically inflated populism scores — possibly because more reasoning leads the model to over-attribute populist rhetoric. We use gemini-flash-latest in the final corpus.
Reproducibility guarantees
- Code: github.com/sstoeckl/3175_populism_llm
- All prompts version-pinned in
R/3c_Populism_v4_safe.R - Model snapshots pinned:
claude-sonnet-4-6,gpt-4.1-mini-2025-04-14,gemini-flash-latest(seeR/16/R/15b/R/18) - Raw chunk-level outputs preserved in parquet, span-validation logs versioned
Non-standard errors
Following Menkveld et al. (2024), we report the distribution of regression coefficients across the full grid of defensible analytic choices:
| Fork | Options |
|---|---|
| Model selection | Sonnet alone, GPT-4.1-mini alone, Gemini alone, all pairs, all three |
| Within-manifesto aggregation | confidence-weighted mean (default), plain mean, median |
| Between-model aggregation | mean (default), median, leave-worst-out |
| Outcome normalisation | raw 0–10, z-scored within country, z-scored within year |
| Sample | full (1945–), post-2000, post-2010, drop low-agreement |
| Fixed effects | country + year, + party, country × year |
| SE cluster | country, country × year, party |
Coefficients across ~9,000 specifications are visualised in the interactive regression tool and as a specification curve in the paper appendix.
References
- Menkveld, A. J., et al. (2024). Nonstandard Errors. Journal of Finance.
- Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification curve analysis. Nature Human Behaviour.
- Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science.
- Lehmann, P., et al. (2024). The Manifesto Project Dataset. WZB.