This post was written by Jay Ulfelder and originally appeared on Dart-Throwing Chimp. The work it describes is part of the NSF-funded MADCOW project to automate the coding of common political science datasets.
Guess what? Text mining isn’t push-button, data-making magic, either. As Phil Schrodt likes to say, there is no Data Fairy.
I’m quickly learning this point from my first real foray into text mining. Under a grant from the National Science Foundation, I’m working with Phil Schrodt and Mike Ward to use these techniques to develop new measures of several things, including national political regime type.
I wish I could say that I’m doing the programming for this task, but I’m not there yet. For the regime-data project, the heavy lifting is being done by Shahryar Minhas, a sharp and able Ph.D. student in political science at Duke University, where Mike leads the WardLab. Shahryar and I are scheduled to present preliminary results from this project at the upcoming Annual Meeting of the American Political Science Association in Washington, DC (see here for details).
When we started work on the project, I imagined a relatively simple and mostly automatic process running from location and ingestion of the relevant texts to data extraction, model training, and, finally, data production. Now that we’re actually doing it, though, I’m finding that, as always, the devil is in the details. Here are just a few of the difficulties and decision points we’ve had to confront so far.