Vignette: Corpus Linguistics, 1999
in memory of István Kecskés
It’s been two years since Deep Blue has beaten Garry Kasparov; Ray Kurzweil’s book The Age of Intelligent Machines has been out for nine years. Machines with superhuman intelligence, it seems, are already five, or maybe ten, or perhaps 20 at the outside, years away. The hype machine is on.
I’m in a linguistics class, probably Pragmatics, and the professor is very excited about something called “corpus linguistics.” The idea is that we can amass vast bodies (“corpora”) of text and parse them and annotate them and make julienne fries with them and notice important things about language and usage. Corpus linguistics is not a very new idea. The idea of using corpora as starting points for studies of language use and compiling dictionaries and such goes back a long time, and people have been doing this with computers since the 1960s. But as computers grow more powerful and the collections of text grow ever larger, we can find out more things.
No humans were literate for many millennia, and yet we must have already been displaying the intelligence that would allow us to invent writing, mathematics, and then computers.
For the computers, we need text and not, for example, recordings of spoken language; recordings of spoken language can be transcribed, of course, but the computers are not great at doing this themselves. Human intelligence and language start from speech, not writing, but it will not be so for the machines. Fortunately there are armies of graduate students to transcribe texts and annotate them. Indeed, I will be one of those grad students, grinding away at part-of-speech tagging a corpus, although since the language I’m working on is an endangered American Indian language and the corpus is a body of stories collected by Gladys Reichard in the late 1920s, I won’t see myself as contributing to “corpus linguistics” but rather to a language preservation effort.
When we apply computation methods to large bodies of text, we can see that language speakers use a lot of recognizable patterns: set phrases, routines, cliches, words that cluster together in predictable ways. After some work on the corpus we can begin to estimate the likelihood that, for example, if I have typed “a lot” that the next word will be “of.” We use “a lot of” a lot, but not always. There’s a probability; the probability depends on context. Like overeducated bookies, we can compute the odds of it. We are often, after all, speaking without thinking about what we’re saying. No one who says “close proximity” is thinking about what those words mean individually; it’s just a phrase we say.
Anyway, the rise of corpus linguistics will be vital to early natural-language processing.1 The development and analysis of large bodies of text in different languages allowed for systematic comparisons between languages, which brought about machine translation, a very cool and useful thing to do with a computer.
In 1999, though, I am thinking that this should humble us. We feel like gods, sometimes, with our highly complex linguistic abilities and the vastness of cultural memory enabled by literacy. We can use language in spontaneous and creative ways, to express anything we have in our minds, a kind of infinity we would once have associated only with the divine. And yet for all that, and for all our cleverness and intelligence, we often speak and write like automata. This is interesting to me, as a sign of our limits. It has me thinking about the relationship between language and intelligence differently. There must be some more meaning to “intelligence” than we can find in these patterns of rote phrases, I conclude, and I go looking for a broader perspective on the human capacities.
But for many people these discoveries are exciting in that creating computer programs that are as intelligent as we are has suddenly begun to seem possible. To these people, the idea that our language use is so highly statistically model-able does not imply that maybe we understand intelligence incorrectly; instead, they come to believe that if we can make computers sound like us, with programs and data sets that are complex enough, then by the magic of evolution that led to our brains producing the body of things we call language, consciousness, and intelligence, perhaps these programs will be like us.
The fields of machine learning, natural language processing, and artificial intelligence spent much of the 1970s and 1980s in a kind of holding pattern. To be sure, programmers were working on solving some of the difficult programming problems, but part of the problem is that the massive bodies of text that can now be found on the internet simply don’t exist yet. In the late 1990s, two things will begin to happen: some serious difficulties with making computers “remember context” as they process sequential data, such as text, will be overcome by new algorithms, and the internet will begin to produce quantities of textual data such as the world has never seen before. Finally, there will be enough text for the machines to work with.
“Artificial intelligence” will thus ultimately — after many complex technological refinements; I don’t want to downplay the complexities of the programming behind LLMs — be built from the realization that we use language in highly predictable, statistically model-able ways coupled with the fervent belief that what can be captured in text is coextensive with what we call intelligence. That is, we tend to believe that “intelligence” is more or less the same as facility with the written word (and numerals: mathematics is also fundamentally an activity of literacy). Everything that we use to “measure intelligence” is based on this facility: school, IQ tests, the SAT and GRE, etc. It’s an odd belief, given that no humans were literate for many millennia, and yet we must have already been displaying the intelligence that would allow us to invent writing, mathematics, and then computers. And yet we can hardly bring ourselves to believe otherwise.
If a linguist is doing it primarily for linguistics research, it’s usually considered a branch of “computational linguistics.” If a computer programmer is doing it trying to make or improve some type of software, it’s usually considered a branch of “natural language processing.” There are some technical distinctions between various subfields of these two larger categories, but they are not relevant to this story.



