Kyndi works in the domain of modeling and understanding natural language. To explain why this is a worthy challenge, I’d like to share some thoughts on what makes language, or “unstructured data,” so hard to work with.
Every piece of writing is more than the sum of its words. Think of writing—or speaking—as a way of skipping stones over the surface of a body of complex ideas. The ideas a poem tries to convey may seem explicit to the poet, but the truth is any human communication is relatively low fidelity. Just as meaning changes in the children’s game “Telephone” when a word or phrase is passed from ear to ear, what a writer intends to communicate is always partly lost on the reader, who may invent new, unintended meanings for the words. This is just one of the challenges of understanding language—along with that of making sense of repeating patterns, interaction, metaphor, and more.
The meaning of every word depends on words that surround it, on the writer, reader, and on the implied intent of the whole document. Without fixed meaning, simply identifying what a word denotes, or the multiple meanings of a word, cannot help us understand the whole.
Language flows like water
One way to grasp how meaning arises is to think of it flowing like water across and between words. The sense that emanates from three words strung together is like the ripples emerging from three stones thrown into a pond. Similarly, the meaning of a thousand words is like the ripples radiating from where a thousand stones are thrown into the same pond—sometimes over great distances. This may sound strangely poetic for a technology blog, but it reflects something real: meaning crystallizes in context, emerging from the interactions of parts of a whole.
To understand how significance emerges from a manuscript, looking at the meaning of individual words, or even individual sentences, is simply not good enough. And narrative flow is a far more elusive beast to track. These tasks are further complicated by the challenges all the subtle variations between and within languages present.
The map is not the territory
We generate maps of reality when we speak or write. And different languages draw different mappings—or concept boundaries—of the world. “One, like English, might distinguish running water (river) from still water (lake),” writes linguist Geoffrey K. Pullum. “Another, like Inuit, might distinguish falling snow (qanik) from fallen snow (aput).”
Consider, for example, the “Inuit snow words” meme—apocryphal tales of early 20th-century anthropologist Franz Boas’ discovery about the seemingly infinite number of subtly evocative Inuit words for snow—stories that Pullum calls a “constantly changing, self-regenerating myth.”Living and traveling among the Inuit of Baffin Island in the 1880s, Boas recorded a number of terms for ice and snow—words like aqilokoq for softly fallen snow, and mauja for soft snow on the ground. It’s true, say SIKU Project authors Igor Krupnik and Ludger Müller-Wille, that through “thoughtless recycling” of the stories, the number Boas originally recorded “snow-balled” into hundreds. Nonetheless, the terminology is intriguing. Contemporary lists include words like “apputtattuq, snow that accumulates on the newly formed ice and causes its thinning,” and “kiviniq, wet snow sinking into the sea ice.”
Compare these with the outdoorsy or meteorological connotations of English words like sleet, slush, corn, and powder, or with the magical realism of Sarah Addison Allen’s “snow globe world” where snow flurries swirl “around people’s legs like house cats.” And how different is the snow that falls from the sky from the snow that overwhelms (a “snow job”) or from the snow on an old TV screen?
The same effects occur when we move from one scientific domain to another, or between the disciplines of business and legal writing. Raymond Queneau’s Exercises in Style illustrates this nicely. Queneau’s book tells the story 99 times of a chance encounter between two men on a bus. He tells the story once in metaphor, once in the language of precision (“In a bus of the S-line, 10 metres long, 3 wide, 6 high, at 3 km. 600 m. from its starting point, loaded with 48 people…”), and once in the language of geometry (the bus where two “homoids” meet is now a “rectangular parallelepiped”). Queneau tells versions of the story in the past and present tense, in haiku and rhyming slang, and in the languages of abuse, reported speech, probability, cross-examination, and negativities (“neither the morning, nor the evening, but midday”). He tells the story interpolated with “you knows”: “Well, you know, the bus arrived, so, you know, I got on.”
Underlying all these different ways of telling the same story is our experience of being human, the core of our technologies, our societies, and our cultural institutions.
Language is at the core of the human experience
Language and our use of language underlies nearly everything in the human experience. Even our social structures rely on how we choose to speak and write. In his book Sapiens: A Brief History of Mankind, the historian Yuval Noah Harari wrote of the role language and common myth play in our experience. “Large numbers of strangers can cooperate successfully by believing in common myths,” says Harari. “Any large-scale human cooperation—whether a modern state, a medieval church, an ancient city or an archaic tribe—is rooted in common myths that exist only in people’s collective imagination.”
Given the scope of its implications, is it any wonder that our automated systems struggle to understand all that language can convey? The discipline of NLP has come a long way, but to bring our systems to the first stages of understanding the underlying meaning of everyday language, we need new capabilities, new methods, and a new approach to AI.
In my next blog post, I will offer some thoughts on the business implications of a more capable NLU system.