Exclusive to M/m Print Plus

Topic Modeling Modernism/modernity

Editor's note: Blog revised on June 21, 2016 to incorporate missing articles from original M/m corpus. 

Laura Heffernan’s introductory post describes work being done in what she calls the “new disciplinary history.” I have an interest in using quantitative methods to practice disciplinary history. In this post, I explore some of these methods using the archives of Modernism/modernity.

The quantitative analysis of a journal has a long history. Sociologists of science, for example, have long used citation patterns to reveal the disciplinary structure of a field. More recently, digitized journals and contemporary computational tools help lower the barrier to entry to the often-laborious task of conducting analysis on a journal’s content. In what follows, I’ll give some notes on how I used the back catalogue of Modernism/modernity to convert the journal into a topic model, a context-sensitive word-vector analysis, and several forms of citation analysis.

Topic Modeling

Topic modeling is a technique used by information-retrieval researchers to classify documents. It infers the thematic structure of a collection by analyzing co-occurrence patterns. There are several varieties of topic modeling. The one I used to explore Modernism/modernity is known as LDA, and a good non-technical introduction to it can be found in this article by David Blei, one of its originators. There have been several examples of humanities disciplinary analysis using topic modeling. Andrew Goldstone’s and Ted Underwood’s “The Quiet Transformations of Literary Studies” looks at literary studies and John Laudun’s and my “Computing Folklore Studies” examines folklore.

Creating a topic model of a journal involves converting the full text of each article to a bag-of-words representation. The algorithm is not context-sensitive. The full text of Modernism/modernity was graciously made available to me by Johns Hopkins University Press. The platform, Project Muse, on which it is hosted does not have an API to process requests like JSTOR. I used Andrew Goldstone’s dfrtopics to create the topic model in R. Goldstone’s package provides an interface to MALLET, a commonly used implementation of LDA, along with many useful functions for pre-processing and analysis.

The HTML text of the journals needs to be converted to plain text, and a metadata file that indexes each article also needs to be created. The technical details of how I did this are beyond the scope of this post, but I will be glad to share them with anyone interested. After the pre-processing has been completed, the next major decisions involve choosing the number of topics that the will be found in the collection. There are ways to automate this selection, but I find that trial-and-error is usually sufficient. For this model, I decided on fifty topics. I also used a large stop-word list provided by the package, and I added a few words to it. (Again, I will share the stop-word list and other settings used to create the model with anyone interested. I cannot redistribute the source files, however.)

After the model was created, I used Andrew Goldstone’s dfr-browser to create this visualization. The browser has a number of different views used to reveal different aspects of the model. If you choose the “List” view and sort by proportion of corpus, you’ll notice several topics that have high representation with little in the way of topic specificity. Topic 28, for instance, is the most prominent in the corpus. It is made up entirely of what I call “argument words”—words which are often used to establish claims in academic discourse. These prove difficult to eliminate via stop-words, as there are always more of them than you might suspect. Similarly, Topic 18 shows words that are particular to book reviews.

When creating topic models of journals, book reviews and editorial comments are often eliminated. I chose not to do so with this model, as I wanted to show how the model would classify the entirety of the journal. I even included the “Recent Books of Interest” feature, which is largely clustered in Topic 37 and Topic 42. (Advertising, however, is not included.) Topic 16 shows a cluster of articles devoted to textual editing and genetic criticism. I also chose not to eliminate “modernism” from the model, even though it’s in the title of the journal and thus is going to be even more prevalent than it would be otherwise.

The topics themselves range from ones devoted to well-known authors (SteinEliotWoolf, and Beckett) to those that reflect the interdisciplinary character of the journal: art historyfilmarchitecture, and music. As the journal has maintained a consistent focus for its relatively short period of publication, there was little of notice in the chronological trends displayed in the browser. The film and Wilde/decadence show a general increase, though the special issue in 2008 accounts for much of the latter’s rise. In general, the LDA algorithm does not do a good job of tracking changes over time. There are variants such as the dynamic topic model that are designed to track semantic change over time, but I have not implemented one here.

Word2vec, or Theology - God = Theory

Next, I want to explore the Modernism/modernity corpus through word2vec, a machine learning technique that has recently received a great deal of attention (to see an interactive tree visualization of this model, visit the author's website). Humanists who have used it or related word-vector methods include Michael GavinRyan Heuser, and Ben Schmidt. I have used Schmidt’s wordVectors for this analysis. As these posts explain, vector-based approaches seem to have more potential for discourse analysis than topic models. They are capable of startling feats of analogy on large corpora. One journal, however, is not large enough in general to show many of these effects. The standard example of “king” - “man” + “woman” = “queen” does not quite work on this corpus, returning only a vector consisting of “king” and “woman.” (“Theology” - “God” did equal “theory,” however.) Perhaps a journal in renaissance studies would be more likely to complete this analogy.

Visualizing the entirety of a word2vec model is difficult, and I will not attempt it here. As with the co-citation network, better results would be achieved with combining the journal with those of others in the field. It is easier to visualize certain vectors. If you choose the top words in Topic 12, and reduce them to a two-dimensional space, this is the resulting visualization:

The terms nearest “sex” in the vector model are “sexuality,” “sexual,” “female,” and “illicit.” Compare this list to Topic 30, in which “illicit” does not appear. The word “genre” appears in Topic 12 and Topic 25 of the topic model, but it is not strongly associated with either. In the vectorized model, “detective” is most strongly associated with “genre,” along with “conventions” and “reappraisal.”

Neither “illicit” nor “detective” appear in the topic model because neither word appears frequently enough in the corpus. The context-sensitive word2vec model, however, assigns them a greater significance because of how often they appear within a certain window of words like “genre” and “sex.” Humanists are often critical of the bag-of-words representation that topic modeling approaches use, and it’s an understandable reaction. The word2vec model, while contextualized, uses an almost completely opaque internal representation. (Distrust of the “black box” model runs high among humanists, for good reason.) I feel that these classifying systems are best used for exploratory purposes. They can create unexpected juxtapositions or help you find things that you might have otherwise overlooked, but I’m more skeptical of interpretations based on them.

Citation Analysis, Three Ways

There are a variety of measures of what is cited in a journal. The most-cited articles published in the journal measure what other readers have found important or central to their work. The idea of what it means to cite something varies across many disciplinary contexts, of course. Here’s how Bruno Latour visualized the rhetoric of citations:

Humanists cite fewer materials than scientists. Journal articles are less important for humanists’ work, generally speaking, than monographs. Furthermore, many well-cited journal articles often end up published in book form, diluting the citation pool.

Here are the most-cited articles published in Modernism/modernity according to Google Scholar:

Article Citations
Miriam Hansen, “The Mass Production of the Senses: Classical Cinema as Vernacular Modernism.” 1999. 297
Susan Stanford Friedman, “Definitional Excursions: The Meanings of Modern/Modernity/Modernism.” 2001. 167
Susan Stanford Friedman, “Periodizing Modernism: Postcolonial Modernities and the Space/Time Borders of Modernist Studies.” 110
George W. Stocking, “The Turn-of-the-Century Concept of Race.” 1994. 98
Bill Brown, “The Secret Life of Things (Virginia Woolf and the Matter of Modernism).” 1999. 83

The five most-cited articles have a number of things in common. Miriam Hansen’s article is interdisciplinary, drawing on both film and modernist studies. Articles that cross disciplines are often the most-cited. (Pierre Nora’s “Between Memory and History” is one of the most-cited articles in any humanities discipline, for example.) And Susan Stanford Friedman’s two articles engage with definition and periodization: two recurring topics in modernist studies.

And these are the most frequently cited texts in articles published in Modernism/modernity:

Source Citations
Andreas Huyssen, After the Great Divide. 1986. 19
James Joyce, Ulysses. 1986. 17
Walter Benjamin, The Arcades Project. 1999. 14
Walter Benjamin, Illuminations. 1968. 13
Peter Burger, Theory of the Avant-Garde. 1984. 13
Samuel Beckett, The Letters of Samuel Beckett. 2009. 13
Paul Fussell, The Great War and Modern Memory. 1975. 12
Lawrence Rainey, Institutions of Modernism. 1998. 11
Siegfried Kracauer, The Mass Ornament. 1995. 10
Michael North, Dialect of Modernism. 1994. 10
Shari Benstock, Women of the Left Bank. 1986. 9
James Clifford, The Predicament of Culture. 1988. 9
James Knowlson, Damned to Fame. 1996. 9
T. S. Eliot, The Waste Land. 1922. 8
Rita Felski, The Gender of Modernity. 1995. 8

This citation data comes from the Web of Science service. The parsers used by Web of Science do not always track information in Chicago-style footnotes accurately in my experience, but the overall results seen here would likely not differ much if counted by hand. Huyssen’s book was one of the subjects of an MSA panel “The Making of Modernist Studies” (W8) on returning to classics of the discipline in 2015.

In addition to counting the cited material within a journal and tracking its external citations, the network of citations can be visualized. The most useful way of doing this is through a co-citation network, in which sources that are cited together are the nodes rather than the citing article. See the following diagram taken from Scott Weingart’s useful explanatory post:

A co-citation network of only Modernism/modernity has not proven very fruitful, due to a combination of small sample size and noise in the Web of Science parser caused by misidentifying Chicago-style repeat citations as “Anonymous.” I could fix this by hand, but I prefer not to. In the meantime, here is a co-citation network of several journals in modernist studies, plus two dynamic network graphs that show changes in the network over time.

Conclusion

The inaugural post in this series mentioned the “anxious self-reflection native to the mode” of disciplinary history. At times, working with the tools I describe above can indeed be anxiety-inducing, or even rage-inducing, I’m sad to say. For example, the mechanism of “preferential attachment” or the “Matthew effect” evident in citation analysis can be rather depressing. (Likewise, my initial effort at quantifying Modernism/modernity also led to results that some found depressing.) It does not require much cynicism to believe that scholars read less than they claim and cite more to show allegiance and affiliation rather than intellectual engagement. Modernism/modernity is perhaps less prone to these issues than many journals I have studied, however. Its interdisciplinary and contextualist emphasis encourages a wide range of sources, and that partially explains why relatively few texts do not dominate its range of reference.

A final thought: another easily countable thing in the journal would be the presses of the books that are reviewed in it. An early version of the topic model that included numbers showed me how standardized academic book prices were. Checking the prices listed in the book reviews of journals that are older than Modernism/modernity might be a relatively easy way to chart the rise in price of scholarly books over time. Even measures as simple as the length of the articles or number of references in them graphed over time may reveal tacit changes in editorial policy or methodology. I may work on these issues in the future, and I would be glad to hear from any readers of this post about matters specific to these models and visualizations or larger questions about quantifying disciplinary history.

Comments

This has me thinking about my intellectual “heritage." As an early career scholar, am I more likely to cite my advisors and immediate community because that is what I know best? I would love to see this study expanded to include syllabus reading lists--how much diversity exists in our graduate classroom reading lists?