Notes From a First Foray into Text Mining
This post was written by Jay Ulfelder and originally appeared on Dart-Throwing Chimp. The work it describes is part of the NSF-funded MADCOW project to automate the coding of common political science datasets.
Guess what? Text mining isn’t push-button, data-making magic, either. As Phil Schrodt likes to say, there is no Data Fairy.
I’m quickly learning this point from my first real foray into text mining. Under a grant from the National Science Foundation, I’m working with Phil Schrodt and Mike Ward to use these techniques to develop new measures of several things, including national political regime type.
I wish I could say that I’m doing the programming for this task, but I’m not there yet. For the regime-data project, the heavy lifting is being done by Shahryar Minhas, a sharp and able Ph.D. student in political science at Duke University, where Mike leads the WardLab. Shahryar and I are scheduled to present preliminary results from this project at the upcoming Annual Meeting of the American Political Science Association in Washington, DC (see here for details).
When we started work on the project, I imagined a relatively simple and mostly automatic process running from location and ingestion of the relevant texts to data extraction, model training, and, finally, data production. Now that we’re actually doing it, though, I’m finding that, as always, the devil is in the details. Here are just a few of the difficulties and decision points we’ve had to confront so far.
First, the structure of the documents available online often makes it difficult to scrape and organize them. We initially hoped to include annual reports on politics and human-rights practices from four or five different organizations, but some of the ones we wanted weren’t posted online in a format we could readily scrape. At least one was scrapable but not organized by country, so we couldn’t properly group the text for analysis. In the end, we wound up with just two sets of documents in our initial corpus: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s annual Freedom in the World documents.
Differences in naming conventions almost tripped us up, too. For our first pass at the problem, we are trying to create country-year data, so we want to treat all of the documents describing a particular country in a particular year as a single bag of words. As it happens, the State Department labels its human rights reports for the year on which they report, whereas Freedom House labels its Freedom in the World report for the year in which it’s released. So, for example, both organizations have already issued their reports on conditions in 2013, but Freedom House dates that report to 2014 while State dates its version to 2013. Fortunately, we knew this and made a simple adjustment before blending the texts. If we hadn’t known about this difference in naming conventions, however, we would have ended up combining reports for different years from the two sources and made a mess of the analysis.
Once ingested, those documents include some text that isn’t relevant to our task, or that is relevant but the meaning of which is tacit. Common stop words like “the”, “a”, and “an” are obvious and easy to remove. More challenging are the names of people, places, and organizations. For our regime-data task, we’re interested in the abstract roles behind some of those proper names—president, prime minister, ruling party, opposition party, and so on—rather than the names themselves, but text mining can’t automatically derive the one for the other.
For our initial analysis, we decided to omit all proper names and acronyms to focus the classification models on the most general language. In future iterations, though, it would be neat if we could borrow dictionaries developed for related tasks and use them to replace those proper names with more general markers. For example, in a report or story on Russia, Vladimir Putin might get translated into <head of government>, the FSB into <police>, and Chechen Republic of Ichkeria into <rebel group>. This approach would preserve the valuable tacit information in those names while making it explicit and uniform for the pattern-recognition stage.
That’s not all, but it’s enough to make the point. These things are always harder than they look, and text mining is no exception. In any case, we’ve now run this gantlet once and made our way to an encouraging set of initial results. I’ll post something about those results closer to the conference when the paper describing them is ready for public consumption. In the meantime, though, I wanted to share a few of the things I’ve already learned about these techniques with others who might be thinking about applying them, or who already do and can commiserate.