Big Data and Poetry

Angus MacCaull

The ongoing expansion of the digital humanities in literary criticism can be seen in the recent works of scholars like Franco Moretti, who identifies his methods as akin to data mining.1 An early proponent of this type of work, John Sinclair says, “The language looks a lot different when you look at a lot of it at once.”2 By now most of us have at least dabbled in corpus stylistics, even if we aren’t aware of it. Its methods are behind those “which writer do I write like” quizzes, the Google Ngram Viewer, autocomplete in search boxes and autocorrect in text messages.

While some of the techniques are new, it might be argued that all approaches to understanding literary style are data-driven and that we’ve simply gotten better—much better due to computers—at collecting and analyzing data in recent decades. A corpus (“corpora” in the plural) is simply a collection of text. The concept of a “canon” of literature is itself a kind of corpus. Research using corpora goes back centuries with scholars manually compiling word counts from sources like the Bible or the works of Shakespeare or Dickinson. Basically corpus stylistics compares bodies of text to uncover differences in measurable aspects of language.3

An important concept to remember when working with corpora is “reification,” or mistaking an abstraction for the material from which it was derived. This occurs in corpus-based criticism when facts about a text are treated as though they were the text itself. It’s akin to trying to eat a recipe, or the notion that the map is not the territory. This problem is particularly tricky when analyzing literary works, as opposed to other forms of writing, due to the nature of metaphors, puns, and symbols. Fitzgerald’s “great” in The Great Gatsby is different than Fielding’s “great” in Jonathan Wild, the Great.4 If we were to count the occurrences of this word in each novel and then draw conclusions based on this comparison, we would be guilty of reification. As Katie Wales puts it, “quantitative results themselves do not constitute an interpretation.”5 It’s important to take words and phrases in context and to study texts in their socio-cultural frameworks. So far corpus-based analysis has been more widely used in relation to the novel, but scholars are applying these methods to poetry as well, with results that indicate further possibilities in the field.

The first step for any serious corpus stylistics project is to do a needs assessment. Thinking about what a corpus will be used for will determine what kind of material, and how much of it, should be included. A corpus of poetry covering the 18th through 20th centuries will provide different statistical usage patterns than a corpus comprised solely of 20th-century poetry. Some scholars argue that a corpus is not complete until it’s been “tagged,” generally with parts of speech, while others feel that an “untagged” corpus is preferable, since even this represents a theoretical framework imposed on the raw data.6 The basic process is to compile a corpus of poems to research, a monitor corpus of the language to establish working norms, and ideally a reference corpus of the same poet’s other writings for more direct comparisons.7 Concordance software then compares these to uncover rare word usages and syntactical constructions. The poet’s choices can then be interpreted.

The linguist Michael Hoey analyzes Dylan Thomas’s phrase “a grief ago” from the poem of that title. Hoey finds that the word “ago” is normally used in at least nine ways; in his terminology, the word is “primed” for certain typical usages. While Thomas’s phrase is immediately recognizable as “literary” language, Hoey’s analysis shows that it conforms to most of the standard usages. The poet has only overridden three primings or strong probabilities of the word. Thomas has not used “ago” with its collocates “years”, “weeks” or “days” (1) nor used it in connection with a unit of time at all (2) nor contrasted a period of time in question with another period (3). However, Thomas has used the word as a measurement (4), a statement (5) and as an adjunct (6) without cohesion (7) in a paragraph initial (8) and text initial position (9), all of which are strong probabilities. From this Hoey suggests that “even when writers are straining at the limits of what a language is capable of expressing, they make use of more of their primings than they reject.”8

Kieran O’Halloran applies corpus stylistics techniques to support an intriguing interpretation of Frost’s poem “Putting in the Seed.” O’Halloran interprets the persona of the poem, who describes sowing seeds in a field, as having Obsessive-Compulsive Personality Disorder (OCPD). This view draws partly on the poem’s eleventh line, “On through the watching for that early birth.” The unusual phrase “the watching for” does not occur at all in O’Halloran’s monitor corpus. The most common collocates of “the watching” as a phrase refer to large numbers of people including “audience,” “spectators,” “millions,” and “crowd,” as in “the watching crowd.” Frost, however, has used the phrase to refer to a single person in relation to his actions. For O’Halloran, this may suggest OCPD because it shows “a sense of detachment on the part of the persona and thus a marked concentrative focus on the task.”9

In another study, O’Halloran shows how big data can enhance previous interpretations. He begins with Roger Fowler’s commentary on “Street Song” by the contemporary New Zealand poet Fleur Adcock. Both Fowler and O’Halloran comment on the series of “ing” verbs throughout the poem, as for example in the first two stanzas.

Pink Lane, Strawberry Lane, Pudding Chare:
someone is waiting, I don’t know where;
hiding among the nursery names,
he wants to play peculiar games.

In Leazes Terrace or Leazes Park
someone is loitering in the dark,
feeling the giggles rise in his throat
and fingering something under his coat.

O’Halloran uses corpus-based evidence to support Fowler’s claim that the “ing” verbs inform the poem’s overall disturbing effect. He does this by showing that “HUMAN SUBJECT+(is)+waiting” has an extremely strong association with intention, while “HUMAN SUBJECT+(is)+loitering” may or may not be associated with intention. Loitering is often just loitering. However, when loitering is statistically associated with intention, it is often criminal. In this way O’Halloran confirms and enhances Fowler’s reading by linking the poem’s “ing” structure with deeper ambiguities and tensions in relation to its theme of perversion.10

Bill Louw presents a corpus-based interpretation of the poem “Days” by Philip Larkin, which reads:

What are days for?
Days are where we live.
They come, they wake us
Time and time over.
They are to be happy in:
Where can we live but days?

Ah, solving that question
Brings the priest and the doctor
In their long coats
Running over the fields.

Louw’s analysis begins with the phrase “days are,” from the poem’s second line. In Louw’s monitor corpus, this phrase collocates with words like “past,” “over,” and “gone.” These same associations do not apply for phrases like “day is” or “weeks are.” This suggests to Louw that there is something particular about the word “days” that is nostalgic and perhaps even regretful. It possesses what Louw calls a “semantic prosody,” in this case a negative one. These semantic associations colour the irony of the poem, in which our apparently “happy” days ultimately lead to a priest or a doctor.11

As a final example, Louw interprets W.B. Yeat’s state of mind while writing his late poem “The Wild Swans at Coole.” Louw takes a clue to suggest that the poet was tearful. The first stanza reads:

The trees are in their autumn beauty,
The woodland paths are dry,
Under the October twilight the water
Mirrors a still sky;
Upon the brimming water among the stones
Are nine-and-fifty swans.

It is reasonable to sense poignant emotion as the poem proceeds to contrast the speaker’s age with the vitality he sees in the swans. But what exactly is the quality of the poet’s emotion—is there any way to tell? The clue is the word “brimming.” Looking at a larger corpus, one of the strong collocates of “brimming” is the word “tears.” For Louw, “a data-assisted reading would say that the collocates of brimming are an inextricable part of the poem’s meaning, even if, the readers’ intuition is not capable of recognizing them.”12

The value of a big data perspective for literary criticism is that it can provide previously unavailable starting points for exploration, as well as further evidence for existing claims. It’s also possible to do similar research with your own writing. Corpus stylistics techniques can provide any writer with insights into their own habitual language use, and while it takes some time to familiarize yourself with the more technical aspects, I’ve found that the results can be rewarding. There are concordance programs freely available, such as AntConc, and large monitor corpora like the American National Corpus are available too. A good primer on using corpora edited by Martin Wynne can also be found online. In applying corpus analysis to my own poetry I’ve discovered that I have an above average tendency to use the preposition “through” with the verb “walk,” and to place qualifiers after the noun “water” (“water hot” rather than “hot water”). In addition, I’ve also noticed that I tend to stretch the semantic field of the words “tender” and “tenderly” to include an element of decay akin to a warm compost pile. I was unaware of these insights prior to doing this kind of analysis, and they have had a direct effect on my writing choices, much like getting feedback from a workshop. As these techniques become more widely known, we may begin to see sections in poetry handbooks along the lines of “Know your key words and constructions” or “Collocation and Colligation.” That may be a ways off yet, but these methods are becoming increasingly accessible for those poets and critics who wish to employ them.


1 Moretti, F. (2013) Distant Reading. London/New York: Verso.

2 Sinclair, J.M. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. p. 100.

3 Mahlberg, M. (2007) “Corpus stylistics: bridging the gap between linguistic and literary studies” in Hoey, M., Mahlberg, M., Stubbs, M. and Teubert, W. (eds.) Text, Discourse and Corpora. London: Continuum. pp. 219–246.

4 van Peer, W. (1989) “Quantitative studies of style: a critique and an outlook.” Computers and the Humanities. 23, p. 308.

5 Wales, K. (2011, 3rd ed.) A Dictionary of Stylistics. Longman: London. p. 92.

6 Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press. p. 79, 94.

7 Leech, G. (2008) Language in Literature: Style and Foregrounding. Harlow, England: Pearson Longman.

8 Hoey, M. (2005) Lexical Priming. London: Routledge. p. 177.

9 O’Halloran, K. (2012) “Performance stylistics: Deleuze and Guattari, poetry and (corpus) linguistics.” International Journal of English Studies, l (12) 2, pp. 171–199.

10 O’Halloran, K. (2007) “Corpus-assisted literary evaluation.” Corpora, 2(1) pp. 33–63.

11 Louw, W. (1993) “Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies” in Baker, M., Francis, G. and Tognini-Bonelli, E. (eds.) Text and Technology: In Honour of John Sinclair. Philadelphia/Amsterdam: John Benjamins. pp. 157–176.

12 Louw, W. (2010) “Automating the extraction of literary worlds and their subtexts from the poetry of William Butler Yeats” in Falces Sierra, M., Hidalgo Tenorio, E., Santana Lario, J. and Valera Hernandez, S. (eds.) Para, por y sobre Luis Quereda. Granada: Granada University Press. pp. 635–657.