EVENTS CALENDAR

These events are supplied by APC member institutions. If you would like to have your events displayed on this calendar, please read these instructions. Or fill out this form to submit a single event.

Loading Events

« All Events

  • This event has passed.

Workshop: Brandon Stewart, Princeton University, “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Statistical Analyses”

October 15 @ 12:00 pm PDT
Free

Biography: Brandon Stewart is Associate Professor of Sociology at Princeton University where he is also affiliated with the Office of Population Research and numerous other centers on campus. He currently serves as the Co-Editor-in-Chief of Political Analysis and Associate Editor at Sociological Methods & Research. His work spans several areas of computational social science with a focus on text as data and causal inference.
“Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Statistical Analyses”
Abstract: Social scientists use automated annotation methods, such as supervised machine learning and, more recently, large language models (LLMs), that can predict labels and generate text-based variables. While such predicted text-based variables are often analyzed as if they were observed without errors, we show that ignoring prediction errors in the automated annotation step leads to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of the automated annotations is high, e.g., above 90%. We propose a framework of design-based supervised learning (DSL) that can provide valid statistical estimates, even when predicted variables contain non-random prediction errors. DSL employs a doubly robust procedure to combine predicted labels and a smaller number of expert annotations. DSL allows scholars to apply advances in LLMs to social science research while maintaining statistical validity. We illustrate its general applicability using two applications where the outcome and independent variables are text-based.

Questions? E-mail us.