Invited talk
in
Workshop: Information-Theoretic Principles in Cognitive Systems
Information-Theoretic Methods in the Study of the Lexicon
Ryan Cotterell
Since Shannon originally proposed his mathematical theory of communication in the middle of the 20th century, information theory has been an important way of viewing and investigating problems at the interfaces between linguistics, cognitive science, and computation, respectively. With the upsurge in applying machine learning approaches to linguistics questions, information-theoretic methods are becoming an ever more important tool in the linguist’s toolbox. This talk focuses on three concrete applications of information-theoretic techniques to the study of the lexicon. In the first part of the talk, I take a coding-theoretic view of the lexicon. Using a novel generative statistical model, I discuss how to estimate the compressibility of the lexicon under various linguistic constraints. In the second part of the talk, I will discuss a longstanding debate in semiotics: How arbitrary is the relationship between a word's form and its meaning? Using mutual information, I give the first holistic quantification of form--meaning arbitrariness, and, in a 106-language study, we do indeed find a statistically significant relationship between a word's form and its meaning in many languages. Finally, in the third part of the talk, I will focus on whether there exists a pressure for or against homophony in the lexicons of the world. On one hand, Piantadosi et al. (2012) argue that homophony enables the reuse of efficient word forms and is thus beneficial for languages. However, on the other hand, Trott and Bergen (2020) posit that good word forms are more often homophonous simply because they are more phonotactically probable. I will discuss a new information-theoretic quantification of a language’s homophony: the sample Rényi entropy. Then, I discuss how to use quantification to study homophony and argue that there is no evidence for a pressure either towards or against homophony, a much more nuanced result than either Piantadosi et al.’s or Trott and Bergen’s findings.