What's in a number? Understanding Measures of Distinctiveness
Comparing the prevalence of certain features in two groups of texts is one of the most fundamental sense-making strategies of established and digital literary studies alike. In the quantitative paradigm, this usually means comparing the frequencies of features in a target group of texts to their frequencies in a comparison group or reference corpus. Over the decades, numerous measures of distinctiveness or keyness have been proposed for this task, each with their strengths and shortcomings.
This talk starts with an analysis of desireable features of such a measure of distinctiveness in the context of literary studies. It then focuses on one such measure that, rather than having been adapted from statistics or computer science, has been developed by digital literary scholars: Zeta, first proposed by John Burrows and then adapted by Hugh Craig. Based on a new implementation of Zeta in Python, this talk explains the statistical properties of the measure, shows some possible applications of it, and discusses possible enhancements that could make Zeta even more useful.