<p class=cs-document-type>Spark</p> # From Shakespeare to AI<span class=cs-invisible>:</span> <span class=cs-subtitle>How LLMs’ inner workings are rooted in comparative literature</span> <p class=cs-byline>David Truog</p> <p class=cs-dateline>4 Mar 2026</p> <p class=cs-reading-time>6–8 min read</p> <p class=cs-dek>The concept at the core of large language models (LLMs) wasn’t hatched in an AI lab. It was invented in the 19th century as a way of answering controversial questions of literary authorship.</p> ## Every author has a stylistic “fingerprint” Debates raged in the mid-19th century about whether Shakespeare really authored the works attributed to him. The controversy then sparked a conjecture: perhaps *language patterns* in a sonnet or play could serve as a way to identify its author. Thomas Corwin Mendenhall, a self-taught physicist and meteorologist, was the first to develop the idea and apply it, in *Science* in 1887.[^1] > [!research-figure] > ![[_fig-1-thomas-corwin-mendenhall.jpg|Portrait of Thomas Corwin Mendenhall]] > > Thomas Corwin Mendenhall [^2] The approach Mendenhall proposed and tested was simple: 1. Select short excerpts from various works — he started with Dickens’ *Oliver Twist,* Thackeray’s *Vanity Fair,* and a few others. 2. In each excerpt, count how many words are one letter in length, how many are two letters, three, and so on. 3. Plot the findings on graph paper, with word lengths on the x axis and counts on the y axis. 4. Connect the dots so that visually they form what he called a “characteristic curve.” The result was a set of visual “fingerprints” — charts like this one, which shows that among the first 1,000 words of *Oliver Twist*, the number of five-letter words, for example, is 123:[^3] ![[_fig-2-characteristic-curve.png|Chart from Mendenhall’s paper showing that 123 words in one of the text samples he chose from Dickens’ *Oliver Twist* contain five words.]] Mendenhall found that charts for multiple Dickens samples resembled each other and were usually different from those by other authors, whose works exhibited their own unique patterns. Each of these fingerprints was essentially a *small* language model, a precursor to today’s *large* language models — small because it was limited to the language used by a single author in a short sample. ## Quantifying writing styles is stylometry The idea of quantifying the unique stylistic fingerprints of individual authors (and genres, too) caught on and the field became known as stylometry. Scholars in comparative literature and philology used stylometric techniques to identify the likely authors of various literary, political, and religious texts. ## Stylometry reversed can generate text The next major milestone on the road to large language models arose from another question: what if stylometry could be turned on its head — generating text from a known fingerprint instead of distilling a fingerprint from known text? That’s the hunch that occurred to Claude Shannon, the mathematician known as the father of information theory, in the mid-20th century. He succeeded and published his findings in 1951.[^4] > [!research-figure] > ![[_fig-3-claude-shannon.jpg|Portrait of Claude Shannon]] > > Claude Shannon [^5] The approach Shannon laid out was straightforward: 1. Select a few passages of English text. 2. In each one, count how often each letter, pair of letters, or longer sequence (up to eight letters long) occurs. 3. Collect the tallies in a table. (The table was essentially another *small* language model, like a Mendenhall characteristic curve.) 4. Generate new sequences of letters based on the table: choose the first letter randomly, then choose each following letter by consulting the table to find out which letter is most likely to follow the preceding one(s). If the table indicates a tie between several letters (equal likelihood for more than one possible next letter), break the tie by selecting one of them at random. The resulting sequences turned out to be nearly readable. They were mostly nonsensical in meaning but consisted of a mix of real English words and pseudo-words that, while not in the dictionary, sounded English-like, a bit like in Lewis Carroll’s *Jabberwocky*. And their sentence structure and rhythm seemed familiarly English-like, too. This demonstrated that statistical patterns alone could yield text resembling ordinary human language. Scholars in linguistics, computer science, and adjacent fields (Chomsky, Markov, and others) then elaborated on Shannon’s insight in the second half of the 20th century. ## GenAI is reverse stylometry with more text and computing Two developments enabled researchers in the early 21st century to take the next big step: generative AI (genAI). First, massive amounts of text had become available digitally, especially on the web. And most of the web was publicly accessible (not password-protected). Several initiatives emerged for automatically archiving copies of web content on a regular basis and making these archived “snapshots” publicly available for research purposes. The archives came to include news articles, blog posts, discussion forums, social media posts, corporate and personal websites, and so on. Combined with other sources such as digitized copies of books in the public domain, the result was immensely more text than Mendenhall or Shannon ever had access to. Second, available and affordable computational capacity and information storage space had ballooned. That made it possible to calculate, remember, and look up probabilities on a much larger scale than Mendenhall or Shannon could. Using that massive quantity of text, computing capacity, and storage space, researchers at universities and at companies like Google and OpenAI began creating *large* language models (LLMs). Compared to earlier approaches, LLMs document subtler relationships between sequences of letters, in which higher-level language structures emerge that correspond to adjectives, nouns, and other parts of speech — as in this chart, which shows how strongly the model links different kinds of words:[^6] ![[_fig-4-llm-heatmap.png|A visualization of how a GPT-class language model tracks relationships among words. Each square’s color shows how strongly the model connects one kind of word (like a noun or verb) with others.]] The sheer number of correlations that LLMs document is enormous in comparison with Mendenhall’s and Shannon’s techniques: billions — even trillions, in some cases — rather than the few that could fit on graph paper or in a single table. The process of creating a large language model became known as “training” the model on a “corpus,” where: - *Corpus* refers to the collection of text that the model is intended to mimic — equivalent to Mendenhall’s excerpts from *Oliver Twist* and Shannon’s selected English passages. (The primary meaning of *corpus* in English arose in the Renaissance to refer to collections of literary, philosophical, theological, or legal texts.) - *Training* simply means calculating all those probabilities (and storing them) so that an algorithm can later look them up to generate new text similar to what’s in the corpus. After the model has been trained, you can start with any sequence of letters you find or create, and use the model to answer the question: “if this sequence of letters had been present in the corpus, what letters would probably have followed it?” That sequence you provide as the input is known as a “prompt.” And the sequence that the model indicates would likely have followed is often called the model’s response or completion. ## ChatGPT wraps LLMs in a public-facing user interface The moment that brought LLMs into the mainstream spotlight was when OpenAI released ChatGPT, in November 2022, to hand over the reins from AI researchers to members of the public. It uses whatever question or statement you type into its chat interface as a prompt and responds with whatever the underlying model generates. ChatGPT also now includes filters. Some filters prevent it from answering certain questions. Other filters intercept questions to which LLMs tend to generate poor answers, such as any math question that isn’t trivial. (LLMs are large *language* models after all, not large *number* models.) Still others route certain requests to specialized tools. ## Everything ChatGPT utters arises from human authorship The text used for training ChatGPT’s LLM consists of the words of billions of humans writing social media posts, blogs, research papers, Wikipedia entries, books, news articles, and more. That means everything ChatGPT generates is human-authored, indirectly. Or more precisely, all its output is the fruit of collaboration between its users and nearly everyone who has ever written anything online — probably including you. [^1]: [Thomas Corwin Mendenhall](https://en.wikipedia.org/wiki/Thomas_Corwin_Mendenhall), “[The Characteristic Curves of Composition](https://archive.org/details/jstor-1764604/page/n1/mode/2up)” (*Science,* 1887). [^2]: Thomas Corwin Mendenhall image source: [_Popular Science Monthly,_ Volume 37, via Wikimedia Commons](https://commons.wikimedia.org/wiki/File:PSM_V37_D594_Thomas_Corwin_Mendenhall.jpg). [^3]: Chart image source: Mendenhall 1887. Annotations added. [^4]: [Claude Shannon](https://en.wikipedia.org/wiki/Claude_Shannon), “[Prediction and Entropy of Printed English](https://archive.org/details/bstj30-1-50)” (*Bell System Technical Journal,* 1951). [^5]: Claude Shannon image source: [Tekniska Museet, Sweden, via Wikimedia Commons](https://commons.wikimedia.org/wiki/File:C.E._Shannon._Tekniska_museet_43069.jpg). [^6]: Chart image source: Jesse Vig and Yonatan Belinkov, “[Analyzing the Structure of Attention in a Transformer Language Model](https://aclanthology.org/W19-4808/)” (*Association for Computational Linguistics,* 2019). Adapted from Figure A.2 (cropped).