On the Topic of Topic Modeling: NEH/MITH Workshop Wrap-up

Map of Twitter activity around the workshop (image courtesy of @lmrhody).

Overview

Saturday’s Topic Modeling for the Humanities Workshop at MITH was a terrific opportunity to zero in on the mechanics, methods, and applications of topic modeling. In light of recent online conversations about possible overuses and misapplications of MALLET, Saturday’s talks (geared towards humanists) provided some much-needed insight regarding when, why, and how topic modeling might help humanities research. My best takeaway was this helpful reminder: for humanists, topic modeling is not an end in itself; it is a means to test hypotheses, search for patterns, and enrich scholarly research. Perhaps most importantly, I finally feel confident in my pronunciation of Latent Dirichlet Allocation (it’s dee-rish-lay).

According to the workshop’s organizers, 75% of the 55 or so people in attendance are actively working on projects involving topic modeling. This includes my own historical study of prison newspapers. But in all honesty, my approach to topic modeling so far has gone a bit like this: “Whee! I’m plugging text into MALLET! I have results that look like ‘real’ data! But what did I just do? And what do I do with my results?”

Topic modeling as a dolly zoom. Thanks @mcburton.

Talking through the process of topic modeling and interpreting the results with the humanists and computer scientists at the workshop helped demystify the more opaque elements of LDA. Some of the most helpful analogies for topic modeling were the idea of topic modeling as a “dolly zoom” into portions of the text (@mcburton), as well as MALLET output as an index to a huge, mostly unread, book (@patrick_mj).

A head-spinning among of information was presented in the daylong workshop, and I hope to see recaps from some of the prolific bloggers and tweeters in attendance in the coming days. Here’s a very abbreviated version of what I found most helpful from each speaker:

  • Matt Jockers, Thematic Change and Authorial Innovation in the 19th Century Novel — I’ve often been asked questions about my project like “Isn’t ignoring some topics to focus on others skewing your data?” or “If your results are different every time you run the model, how is your data ‘real’?” Matt did a great job pointing out how the essence of scholarly work already involves place attention on some themes at the expense of others. In other words, it is perfectly okay to ignore some topics. When I develop a topic model, I am injecting my assumptions about what matters into the construction and interpretation of the model.
  • Rob NelsonAnalyzing Nationalism and Other Slippery ‘isms’ — A topic is a list of co-occuring words, but the appearance of a topic can mean many different things. The analysis of wartime rhetoric in the North versus the South shows just how crucial historical context is to the model’s analysis.
  • Jordan Boyd-Graber, Incorporating Human Knowledge and Insights into Probabilistic Models of Text — Topic models are “willfully ignorant” of the meanings of words, which can be both good and bad. We already insert ourselves into the model simply by choosing what to focus on. However, if we use human engagement to shift how topics are defined, we can get “better” topics as a result.
  • Jo Guldi, Paper Machines: A Tool for Analyzing Large-Scale Digital Corpora — As a point of entry into large volumes of archival text, topic modeling can tell us where to start looking. Specifically, topic modeling can be a useful first step in identifying patterns, breaks, and archival dissent. In short, topic models can provide critical distance in a way that leafing through archival pages can’t.
  • Chris Johnson-Roberson, Paper Machines: A Tool for Analyzing Large-Scale Digital Corpora — The GUI offered by Paper Machines (the plugin for Zotero) can help us sort through archives that contain more data than we could ever read. Paper Machines is a great example of how a GUI can “democratize” topic modeling and data visualization for folks uncomfortable with the command line and the underlying math.
  • David Mimno, The Details: How We Train Big Topic Models on Lots of Text — “Computer-assisted humanities” might be a better term than “digital humanities” in terms of the type of scholarship we should aim to produce. Also, when accompanied by a mind-blowingly succinct explanation, it might actually be possible for a humanist like me to understand the math behind Gibbs Sampling.
  • David Blei, Topic Modeling in the Humanities Roundtable Discussion — If you’re working with a corpus spanning a long range of time (i.e. the better part of a century or longer), language used to explain your topics is going to change. Dynamic topic models can account for this problem, offering quite a few advantages over trying to re-model your topics over smaller periods of time.

Going Forward

The final workshop Q&A addressed issues of cross-fertilizaiton, including how humanists and computer scientists can effectively collaborate on topic modeling projects. The consensus seemed to be that computer scientists want clean, interesting corpora to work with, and—I hope this goes without saying—should not be viewed simply as executors of humanists’ projects.

If I had to create my ideal environment in which to move forward with topic modeling projects, it would include:

  1. A more hands-on follow up event where we could workshop our projects;
  2. A summer statistics institute for humanists;
  3. Detailed documentation for guiding data from the input through the visualization phase.

Digitization and the “Canon”

On a final note is some commentary on a topic that came up in the Twitter backchannels but not in the workshop itself. Most of the topic modeling projects we heard about make use of already digitized newspapers and literary works (with the exception of Jo Gouldi’s archival work). “Because it was already digitized” seems to be a go-to reason for corpus selection in a lot of topic modeling projects. Funding for digitization is much less readily available right now than funding for digital innovation, so the self-selection evident in topic modeling corpora isn’t likely to change anytime soon. The push for inclusion of non-canonical texts into digital humanities work is severely hampered by default.

Many thanks to Jen Guiliano, Travis Brown, and the workshop presenters for all of their hard work. I’m looking forward to more topic modeling fun at this month’s Chicago Colloquium on Digital Humanities and Computer Science.

 Further Reading

Workshop Zotero archive
Slides from David Mimno’s Workshop Presentation (PDF)
Collaborative Google Doc with Workshop Notes (courtesy of Brian Croxall)
Thomas Padilla’s AYBABTU or Topic Modeling in the Humanities

5 thoughts on “On the Topic of Topic Modeling: NEH/MITH Workshop Wrap-up

  1. Pingback: Thomas Padilla

  2. Don’t be so sure about the pronunciation of Dirichlet. :-) You are certainly saying it the way the originators of LDA say it, so you are in good company. Me, I’ve settled on a hard k sound:
    “di” like in “dish”
    “ri” like in “rich” but with a Frrrrench rrrrrrrr
    “kle” like in “cleptomania”
    This is based on this thread http://mathforum.org/kb/message.jspa?messageID=3415769 Your mileage may vary. Thanks for the post!

  3. Pingback: Post: Topic Modeling: New Software and a Wrap-up of our NEH-Sponsored Workshop | Maryland Institute for Technology in the Humanities

  4. Pingback: Topic Modeling: New Software and a Wrap-up of our NEH-Sponsored Workshop | Maryland Institute for Technology in the Humanities

Leave a Reply