Evaluating LLMs for Credible and Rigorous Social Science Research

A talk for the Center for Social Science Scholarship at UMBC: treating LLMs as measurement instruments under metrology principles, not as tools and not as people.

I gave a talk to the Center for Social Science Scholarship at the University of Maryland, Baltimore County, on how to evaluate LLMs when we want to use them for credible scientific research. The full slide deck is interactive and lives here:

Open the deck → mro0001.github.io/evaluation-presentation   (source on GitHub)

You can navigate with arrow keys. Press f for fullscreen, o for an overview of all slides, and ? for the full keymap.

Recording

The argument in one paragraph

Researcher use of LLMs has gone from 57% in 2024 to 84% in 2025, and the methods we use to evaluate them have not kept up. The two defaults — “LLM as tool” and “LLM as person” — both miss what an LLM actually is in a research workflow: a measurement instrument that takes a document or a prompt as input and produces a classification, score, or extraction as output. Once you accept that frame, the right intellectual home is not classical test theory and not psychometrics. It is metrology — the science of measurement and uncertainty. Every measurement carries uncertainty; every instrument needs validation; and the result of any measurement is best estimate ± uncertainty, not a single number reported as if it were ground truth.

The protocol

The talk walks through a five-component framework — TaMPER (Task, Model, Prompt, Evaluation, Reporting) — and a sample-benchmark-population approach that lets researchers combine a small hand-coded calibration set with a large LLM-coded corpus and still produce valid confidence intervals (Prediction-Powered Inference, Angelopoulos et al. 2023, Science).

The criteria

I argue that any defensible LLM-assisted analysis must demonstrate, in this order, a minimum floor of three primary criteria:

Above that floor sit the secondary Quality criteria — Linguistic, Logic, Information Fidelity, and Informativeness — which tell us not whether the data is right but whether the instrument itself is functioning. Two questions, two layers: accuracy and precision evaluate what we measured; compliance and quality evaluate the thing doing the measuring.

Worked examples woven through

Rather than abstract principles, the deck threads worked examples from my own and my collaborators’ empirical work:

Why I think this matters

If we treat LLMs as black-box tools, we lose the ability to defend the choices that go into a study. If we treat them as people, we import standards (consent, intent, expertise) that don’t apply. Treating them as measurement instruments — with documented specs, reported uncertainty, and convergent validation across criteria — gives social science a path forward that is rigorous and scalable. The infrastructure for that work is being built right now.

The deck closes with three things to take into your next paper, and an illustrative game of 20 Questions that I think makes the verifiability problem unforgettable. I would rather not spoil that one — it is on the live deck.

The technical scaffolding (CSS, Reveal.js distribution, widget framework) is lifted from the REACH 2026 workshop site — Wiggins, Layman, and Robison at UI Insight, University of Idaho — with attribution. The substance, the data, and the argument are mine.

Comments, pushback, and collaborations are all welcome. I am at .