Methodology

Pipeline, prompts, validation, reproducibility

Overview

flowchart TB
    A[CMP corpus<br/>3,327 manifestos] --> B[Token-chunking<br/>20k tokens, 2k overlap]
    B --> C1[Sonnet 4.6]
    B --> C2[GPT-4.1-mini]
    B --> C3[Gemini Flash]
    C1 --> D[Per-chunk JSON<br/>strict schema]
    C2 --> D
    C3 --> D
    D --> E[Span validation<br/>regex + LLM-judge]
    E --> F[Confidence-weighted<br/>per-manifesto aggregation]
    F --> G[Cross-model aggregation<br/>configurable weighting]
    G --> H[Panel regressions<br/>with non-standard errors]

↑ Click diagram to enlarge · vertical orientation for readability

The corpus

We score every party manifesto in the Manifesto Project Database (CMP / WZB) that has a machine-readable full text and falls within our analysis window (1945–2025). The text is loaded via the manifestoR package using the WZB API. We do not redistribute the manifesto texts themselves (see download page for licensing).

Selection	Count
Manifestos with full text	3,327
Countries	50+
Languages	EN, DE, ES, FR, IT, PL, CZ, EL, NL, …

Scoring schema

Each manifesto chunk is sent to each LLM under an identical strict JSON schema that requires, per scored dimension:

{
  "score": <integer 0–10 or null>,
  "evidence_status": "verified" | "mixed" | "no_evidence",
  "span": {
    "quote":   "<3–8 word verbatim quote from the chunk>",
    "context": "<surrounding ~3 sentences>"
  }
}

Twelve dimensions are scored:

	Group	Code	Label
pop_anti_elitism	Populism	pop_anti_elitism	Anti-Elitism
pop_people_centrism	Populism	pop_people_centrism	People-Centrism
pop_manichean	Populism	pop_manichean	Manichaean Worldview
populism_overall	Populism	populism_overall	Populism (Overall)
pop_ideology_left	Ideology	pop_ideology_left	Ideology — Left
pop_ideology_right	Ideology	pop_ideology_right	Ideology — Right
pop_ideology_centrist	Ideology	pop_ideology_centrist	Ideology — Centrist
pop_ideology_overall	Ideology	pop_ideology_overall	Ideology (Right − Left)
lib_political	Liberalism	lib_political	Political Liberalism
lib_social	Liberalism	lib_social	Social Liberalism
lib_economic	Liberalism	lib_economic	Economic Liberalism
lib_financial_market	Liberalism	lib_financial_market	Financial-Market Liberalism
liberalism_overall	Liberalism	liberalism_overall	Liberalism (Overall)

Key technical decisions

Decision	Choice	Rationale
Chunking	20k-token chunks, 2k overlap, 18k full-text cap	Fits all three models’ context with margin
Reproducibility knobs	`temperature=0`, `top_p=1`, pinned model snapshots	Deterministic where the APIs allow
NA propagation	`score=null` for `no_evidence` rather than `score=0`	“Topic absent” ≠ “topic present at zero intensity”
Anti-hallucination	Per-dimension verbatim quote required; status enum forces explicit reasoning	Eliminates “make up a score” failure mode
Span validation	Post-hoc 6-tier fuzzy match + LLM-as-judge backstop	Catches paraphrased / fabricated quotes that the schema can’t enforce
Within-manifesto aggregation	Confidence-weighted mean across chunks (default; configurable)	Down-weights low-confidence chunk scores
Between-model aggregation	Simple mean (default; configurable)	No model is privileged a priori; specification curve reports robustness

Cross-model calibration

A 50-manifesto matched-pair pilot against Claude Sonnet 4.6 (the most-trusted baseline because of its 285-manifesto reference run) measured:

Model	mean	gap
GPT-4.1-mini	0.73	0.73
Gemini Flash (latest)	0.32	0.81
Gemini 2.5 Flash + thinking	1.91	0.72

gemini-2.5-flash with explicit thinking budget produced systematically inflated populism scores — possibly because more reasoning leads the model to over-attribute populist rhetoric. We use gemini-flash-latest in the final corpus.

Reproducibility guarantees

Code: github.com/sstoeckl/3175_populism_llm
All prompts version-pinned in R/3c_Populism_v4_safe.R
Model snapshots pinned: claude-sonnet-4-6, gpt-4.1-mini-2025-04-14, gemini-flash-latest (see R/16 / R/15b / R/18)
Raw chunk-level outputs preserved in parquet, span-validation logs versioned

Non-standard errors

Following Menkveld et al. (2024), we report the distribution of regression coefficients across the full grid of defensible analytic choices:

Fork	Options
Model selection	Sonnet alone, GPT-4.1-mini alone, Gemini alone, all pairs, all three
Within-manifesto aggregation	confidence-weighted mean (default), plain mean, median
Between-model aggregation	mean (default), median, leave-worst-out
Outcome normalisation	raw 0–10, z-scored within country, z-scored within year
Sample	full (1945–), post-2000, post-2010, drop low-agreement
Fixed effects	country + year, + party, country × year
SE cluster	country, country × year, party

Coefficients across ~9,000 specifications are visualised in the interactive regression tool and as a specification curve in the paper appendix.

References

Menkveld, A. J., et al. (2024). Nonstandard Errors. Journal of Finance.
Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification curve analysis. Nature Human Behaviour.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science.
Lehmann, P., et al. (2024). The Manifesto Project Dataset. WZB.