Monday, October 6, Keynote: David Krakauer, Univ. Wisconsin:
“The Art of Memory: from chromosome to chronotope.”
A surprising bestseller from the fifteenth century, “The Phoenix, dive artificiosa memoria” by Peter of Ravenna, promised a universal mnemotechnics: methods for the construction of universal memory. Memory systems have always been of central interest to both humanist and scientist. From stone tablets to genomes and books to brains, there has been interest in understanding the principles and the limits of memory. I shall discuss memory systems on which I have worked, the genome, the epigenome, prions, novels and constitutions. In all of these diverse cases we seek to understand how information is preserved, transmitted and combined, and what is ultimately, irretrievably lost.
Tuesday, October 7, Opening Keynote: Julia Flanders, Northeastern:
What do art and data have in common that may help us towards a deeper understanding of the potential of digital tools? What can we learn by attending to the modeling systems that animate and undergird our digital representations? As we embrace large-scale tools and methods, the shape of data may be where beauty still resides.
Tuesday, October 7, Panel 1:
Alex Hanna, Univ. Wisconsin:
“Developing a System for the Automated Coding of Protest Event Data”
Scholars and policy makers recognize the need for better and timelier data about contentious collective action. News media provide the only consistent source of information and are thus the focus of all scholarly efforts to improve collective action data. However, human coding of news sources is time-consuming and thus can never be timely and is necessarily limited to a small number of sources, a small time interval, or a limited set of protest “issues” as captured by particular keywords. The goal of this paper is to outline the steps needed to build, test and validate an open-source system, the Machine-learning Protest Event Data System (or MPEDS) for coding protest events from any electronically available news source using advances from natural language processing and machine learning. Such a system should have the effect of increasing the speed and reducing the labor costs associated with identifying and coding collective actions in news sources, thus increasing the timeliness of protest data and reducing biases due to excessive reliance on too few news sources. The system will also be open, available for replication, and extendable by future social movement researchers, and social and computational scientists.
Jonathan Schroeder Univ. Chicago:
“Database Archaeologies: The Case of ‘Nostalgia’ ”
My talk argues that a consequence of the pressing question of how the humanities can make use of computational methods is to give method, and specifically the formalization of method, new importance. Because formalization is a prerequisite for computer programming, digital humanists occupy the unique position of being able to decide what place traditionally humanistic questions should occupy in the new research environment of the database. I will draw from my collaborative project with a linguist and sociologist, which attempts to reconstruct the history of “nostalgia” from the HathiTrust and Google Books filesets, in order to ask how we should create historical evidence. I will also present the chief problems we’ve found in our attempts to produce a digital history, problems which include the messiness of large filesets, the uneasiness of humanists with uncertainty, and the arbitrariness involved in automating the collection of evidence.
Michael Evans, Dartmouth:
“Who or What is Antiscience? Distinction and Opposition in the American Public Sphere”
While a vibrant public sphere is good for democracy, a contentious and unproductive public sphere discourages citizens from participating in their own governance. This is an especially important concern for science and technology issues that affect many citizens. To explore why some debates over these issues become contentious while others do not, I use a combination of computational and qualitative methods to examine American newspaper articles published from 1980-2012 that contain science demarcation language. I examine the circumstances under which distinction language (e.g. "pseudoscience" or "unscientific") and oppositional language (e.g. "antiscience") are deployed in public life. Who or what counts as "antiscience?" And what does the use of contentious oppositional language to demarcate science tell us about conflict in the public sphere more broadly?
Jonathan Armoza, McGill:
“Plotting in Reverse: 'Plotto: The Master Book of All Plots' and 'Bartleby, the Scrivener'
Scale is one challenge for communication between the humanities and computer science. Where the humanities operates succinctly at the level of the individual sentence or text, “big data” promises definitive observations about texts at larger scales. How do those observations relate to that prior order of textual criticism? Last year I worked to identify plot structure via topic modeling using William Wallace Cook's Plotto and MALLET. I used Cook's system to describe to describe Herman Melville’s “Bartleby”, and modeled its plot components as one corpus. The output revealed potential relationships between components, but also methodological and conceptual concerns for topic modeling. What do we see when looking at a topic word list? Are the words merely statistically correlated? Do they also connect with linguistic/literary patterns? In this talk, I will discuss my work with Plotto and look to a new visualization that reveals topics words in their individual- and corpus-level contexts. By re-contextualizing topic words we can re-locate prior critical concerns and discover relations between human-derived, textual structures and topics. To illustrate, I will use this visualization to look over an even larger collection of human-derived structures – Emily Dickinson’s fascicle booklets – and will show how we may learn new things about her “small” poems from a “big data” vantage.
Kirstyn Leuner, Dartmouth:
“Books, Catalogs, Libraries, Digital Archives: Histories of Nineteenth Century Data Collections”
Engaging with the conference question of how technological manipulation of large data sets reshapes the Humanities, this talk considers one history of this practice in the nineteenth century. Specifically, I show that an important and popular nineteenth-century practice of generating abundant data for numerous purposes originates in the field of bibliography and the fashions for collecting books, cataloging one’s collection, and studying catalogs of others’ collections and library holdings. This talk uses the Stainforth library manuscript catalog (1866) and its auction catalog (1867) as the primary texts under consideration. Catalogs and catalogs of catalogs were used, for example, as finding aids to locate a work in a given library; to document or advertise the richness and extent of a particular collection; to establish a collection as a living family of books usually connected to an estate and a particular author, property owner, or curator; and to create yet another catalog to auction off the collection. Contemporary access to digitized historical nineteenth-century catalogs and archives highlight the tensions between books’ identities as commodities, family-owned possessions, pieces of artwork for display in a curated library or drawing room with numerous related holdings, and educational resources. Furthermore, a study of Romantic- and Victorian-era compendia and their representations in contemporary digital editions illuminates certain trends in contemporary digital literary archives, such as the emergence of aggregates, or digital archives of digital archives, which have become important resources for Humanist work.
Allen Riddell, Dartmouth:
“Rethinking the Use of Data in the Humanities: The Novels Project”
John Sutherland, a towering figure in the history of the British novel, called nineteenth-century publishing history, or the lack thereof, a "hole" at the center of literary sociology. By this he meant that we lack the most basic information about the book trade. For example, How many writers pursued careers as novelists? How did publishers and sellers of literature operate? How did their ways of business influence the literature that was produced? How much of what was produced found a readership? These missing data motivate an ongoing project which seeks to assemble a random sample of British novels published between 1800 and 1836.
It is the context of this project that I argue for a selective rapprochement between quantitative methods and humanistic inquiry. Indeed, in the wake of growing, justified suspicion of prevailing methods in the quantitative social sciences in the United States, there are opportunities for scholars in the humanities to adopt the use of statistics and probabilistic modeling on their own terms.
Anapum Basu, Washington Univ.:
“A Computational Approach to Early English Orthographic Variation”
“A Computational Approach to Early English Orthographic Variation”
While certain broad patterns of “standardization” in early modern print are evident and have been studied by linguists, the general assumption has been that early English orthography moves from confusion and chaos to the gradual emergence of a standardized English, a process distinctly noticeable, if not quite complete, by the Restoration. Even with the availability of large scale corpora such as EEBO-TCP, this assumption has attracted little scholarly attention. Using a database of 22 million words extracted from EEBO-TCP, I shall argue that the emergence of standardized orthography in early modern England was the product of a long, deeply contested and yet fundamentally structured set of transformations. Early modern writers and printers were not oblivious to the advantages of orthographic standardization and being able to separate the various strands of orthographic regularization is not only of interest to linguists, but has important implications for a variety of fields – from better algorithmic regularization of early texts, to the study of archaism in the poetry of the period.
Aaron Plasek & Rob Koehler, New York Univ.:
“Mediating Genres of Prestige, Credit, and Authority: The Epigraph and the Citation”
By way of two vignettes we discuss several of the challenges involved in using the techniques of corpus linguistics and network analysis to examine economies of prestige, credit, and authority in curated collections of texts. Our first vignette considers the epigraph as both a unique text and a commentary affixed to another text through a study of eighteenth- and nineteenth-century novels. Do epigraphs serve as a “metadata tag" for certain genres of literature? And what does the appearance of a particular epigraph tell us about a particular text? Our second vignette explores the degree to which particular disciplinary fields can be studied through a comparative analysis of patterns of citation in twentieth-century journal articles. We conclude our critical investigation into how the affordances and limitations of our media ecology yield new insights into historical and contemporary practices of textual interpretation by pointing to specific moments in above cases in which algorithms inflect and problematize our critical and material understanding of textuality.
Benjamin Schmidt, Northeastern:
“Bookworm: Giving shape to massive libraries through metadata"
Decades of digitization programs have left humanists with dozens of major textual collections ranging in size from thousands to millions of documents. Each one of these represents a substantial archive in and of itself, deserving of extensive analysis. As a rule, most humanists are only able to access these texts through search engines, leaving their broad outlines and their biases relatively intractable to the practices of "distant reading."
This talk will outline a comprehensive strategy for data modelling of large full-text collections through their metadata as instantiated in the Bookworm platform, a project led jointly by the author and Erez Aiden of Rice University and currently deployed by a number of major text repositories, including the Medical Heritage Library, the Yale University libraries, and the Hathi Trust.
Bookworm integrates full text with metadata on extremely large collections, exposing data for statistical analysis and quantitative research. Treating words and metadata as equivalent entities allows extremely fast access to descriptive statistics of large collections, and easy integration with a wide variety of outside tools, such as Mallet for incorporating detailed analysis of topic models and the Stanford Natural Language Tool Kit for named entity recognition.
The data modelling strategy described by the Bookworm API enables an implementation of a flexible "grammar of graphics" that supports a wide variety of text visualizations, from temporal charts to multivariable models to networks. I will describe the possibilities this opens for researchers and libraries, with particular reference to the benefits to incorporating text mapping into a central system to explore issues such as the strength of regional identities.
Closing Keynote: Daniel Shore, Georgetown Univ.:
“Cyberformalism: Search and the History of Linguistic Forms”
Digital humanists have tended to critique or dismiss search as “ineffectual” (Jockers), unsophisticated (Liu), biased and “projective” (Underwood), or “narrowing” (Gitelman). My talk argues that we should instead see search as a prime locus of digital and humanistic innovation. When coupled with massive digital archives, advanced search tools allow us to enlarge both the empire of the sign and the domain of philological inquiry to include linguistic forms as well as fixed words and phrases. Linguistic forms are abstract, variable, and productive Saussurean signs – conventionally established pairings of signifier and signified, manifestation and meaning. Through a series of brief examples, I will suggest that linguistic forms have long and consequential but largely unexplored social, intellectual, and literary histories. Linguistic forms form us. Our ability to discover their histories depends on our learning to use, and in some cases misuse, the astonishing array of search tools that computer scientists and corpus linguists have already made for us. Yet humanists should not be content with current search capacities. In the interests of an increasingly promiscuous philology, they should take a hand in developing new search capacities. I end by sharing the probabilistic, natural language processing (NLP) search tool that I am building and training in collaboration with a computational linguist.