We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
Online ordering will be unavailable from 17:00 GMT on Friday, April 25 until 17:00 GMT on Sunday, April 27 due to maintenance. We apologise for the inconvenience.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter explores how historians distinguish those characteristics distinctive to a given historical era and how quantitative historians investigate the uniqueness of particular periods. Traditionally, historians have reckoned these thematic changes through comprehensive reading about and around adjacent periods. This chapter introduces tf-idf, an algorithm familiar to library science, and shows how tf-idf can be used to index the most distinctive qualities of temporal periods of different scale. In a case study on Hansard’s parliamentary debates, an algorithm for highlighting the distinctive qualities of each era was applied to different timescales ranging from the day to the two-decade period. The results of this algorithmic process show how text mining can highlight how class differentials of access to power played out in terms of parliamentary attention, with the concerns of working-class people and colonized subjects receiving only a fraction of the time allotted to elite concerns over the entire century. It performs this analysis on the names of geographies and ethnicities that formed part of British Empire, demonstrating the greater attention given white subjects than colonized people who were also persons of color.
This chapter asks what it looks like when a discipline uses the power of big data to address its own biases. Drawing from the example of historical linguistics, it tells the story of a discipline that inherited Victorian metanarratives about the evolution of race, language, and history in the past. Historical linguistics is also, as a modern discipline, a place where applied quantitative measures have aided the process of recognizing and addressing inherited biases about empire and race. The chapter urges the experience of historical linguistics as a pattern for other disciplines, showing how text mining can become a tool capable of nuanced work when used together with traditional forms of cultural analysis.
The introduction opens with the work of researchers such as biologist Peter Turchin, who argues that he can distill the massive volume of textual data on human society, easily reduce human culture to variables, and create not only descriptive models of the human past, but also a predictive machine. It contrasts predictive fantasies with the “hybrid” studies where historians and mathematicians have joined forces to apply text mining – the literal count of words from the past – as a tool for modeling how cultures changed in the past. Reviewing recent work from historians and other specialists in text mining, it sets forth the broad aims of this book – to map out how researchers can take a digital, quantitative approach to illuminating history that eschews naïve interpretation, and instead provides a robustly accurate, original, and profound dimension to this complex discipline.
But parliamentary attention and the use of fashionable language aren’t the same as introducing words with the power to endure. This chapter explores tools that might determine which ideas last longer than others, and which can sort words that signify lasting shifts from words that flicker out after a moment in time. This chapter investigates hidden dimensions of temporal experience and appeals to our critical thinking. It asks how we might compose questions that lend themselves to algorithmic analyses that expand our definition of a distinctive event to include towering figures that broadly alter conceptual thinking for generations, such as a Steinem or a Linnaeus. That investigation dives deeper into the nuts and bolts of algorithms, tweaking how we assess their parameters. This methodological chapter aims to deepen the reader’s understanding of “divergence measures,” which offer a robust toolkit for detecting the difference between any two sets of documents.
This chapter explores three dangers a researcher commonly encounters in the digital humanities, each posed by imperfect and incomplete data. First, the data of the past may be occluded – historians are trained to look for silences and gaps in historical accounts, and a digital scholar must develop literacy in these matters lest their analysis be riddled with inaccuracies and distortions. A second danger is dirty data, not in the sense of transmission errors, but rather, cultural biases and conceptual distortions in the source material itself that, when left unrecognized, can result in distorted narratives. A third danger in text mining is fantasy – the gross misapprehension that models of the past can function as a prediction machine. Critically reviewing projects framed by misunderstandings of history that resulted from data-driven inquiries, this chapter explores a crucial aspect of research in the humanities: the imperfection of archives, the skew of libraries, and the inherent bias of language, and how literacy in these matters means approaching even the best archives with caution.
This chapter explores critical thinking about data and algorithms and offers a formula called “critical search.” To develop a critical perspective on the past, the researcher must investigate the “fit” between data, algorithm, secondary sources, and analysis. Recursively iterating through each part of the research process, one develops not so much the one true portrait of the past, but rather a portrait of the past in its multiple dimensions. We see that rigorous methodology produces not certainty, but on the contrary, layers of contingency. This chapter discusses exemplarity, cherrypicking, and common issues of sloppy analysis, and the imperfect grounds upon which historical narratives are built, as they are coded with ethnocentrism and other biases. The bulk of the chapter lays out a threefold method for addressing these issues with critical thinking and an energetically iterative approach. First, through seeding, that is, asking a wide range of essential questions about the data and about methodological approach to the data, and then applying these parameters, as a way to set up the most robust experiment possible. Second, through broad winnowing, the next stage in the experiment, where the scholar pores over the returns of the query to sort signal from noise, sturdy from flimsy, gathering up the promising results, and discarding the less clear or less relevant information. And thirdly, guided reading, in which the research turns to textual sources that can bring new knowledge to the known archives.
This chapter discusses the long history of lexical analysis going back to the biblical concordances composed by medieval monks; it highlights the need for sensitivity to semantic shifts over time of one hopes to produce meaningful research using archival text; and it explores how we can apply algorithms to find lexical trends that precisely identify the books and years in which new attitudes appeared, and thus map change over time. Reviewing the errors commonly found in student work, this chapter discusses the danger of automated approaches to textual analysis, and identifies a series of typical errors that it can engender. It then explores how we can rectify these issues with critical thinking, and create the best models for insight. It discusses the use of controlled vocabulary, problematic aspects of keyword searches, and several other nuts and bolts aspect of this research. It also explores the theoretical side, and shows how excellent research is possible if we develop a sensitivity to how language changes, to the multiple layers of meanings beneath words, and to the traditional interpretative questions associated with language.
This chapter will show how text mining can enable us to peer into the relationships between individuals, as opposed to studying moments in time. Reviewing a famous case study that focuses on the French debates in the era of the French Revolution, this chapter reviews the proposal that text mining can discern individuals whose speech was most influential on their generation. The chapter discusses the information theory that lies at the base of the French Revolution study and applies it to a case study on Britain’s parliament. A critical review of the results demonstrates how the careers of Isaac Butt, William Gladstone, and Arthur Balfour took shape against different relationships to past and future. The chapter reviews the algorithm’s abstractions carefully and critically, raising important questions about what each metric shows.
This chapter examines whether and how it is possible to reckon where modernity is going. While some researchers are tempted to think about the deep past in light of a shared and aspirational future, there are many reasons to be skeptical about doctrines of progress. Even to say with certainty what trends and dynamics animate a given period is never easy – to say when modernity began and what distinguishes it from the past requires a profound grasp of the multiple dimensions of time. Case studies in this chapter test common conceptions of modernity, such as “urbanization” and “empire,” for their rise and fall over a century, demonstrating how different ways of reckoning with the count of words are crucial to interpretation. It asks whether and how algorithms can constitute a tool for determining which forces are “trending” in a culture, and in a given period. Reviewing a case study on Hansard’s parliamentary debates, it explains how different measures of trending can be used to create new insight into how Britain changed over the course of the nineteenth century, highlighting references to a shared future where political rights dominated as one major finding.
This chapter revisits the fantastical conceit that we can make significant predictions about a society. Engaging recent critiques of predictive logic by Jill Lepore, it touches upon the misuses of partially accurate predictive machines – the marketing research technology that has become part of the fabric of political and social life in our time – often with disastrous consequences. The chapter revisits the fantastic project of Peter Turchin, to map out predictive laws of human society, analyzing him against objections against modeling the evolution of human interactions as law-like, from Karl Popper to William Sewell. It engages the theoretical work of Reinhart Koselleck, who argues that prediction or prophecy is untenable, and that we only have prognosis. It looks at the conditions under which Koselleck imagines “prognosis” to be tenable, including the example of grievance theory, which refrains from predicting how or when a group might respond to historical oppression, and only describes the dynamics and acknowledges when there are “grounds” for revolt. The chapter makes room for readers who have turned to this book out of an interest in modeling or prediction to enter a more robust engagement with history, which is presented here as the source of a rigorous engagement with problems of change over time.