Welcome to Doug Bryan, SVP Data Products at Merkle, as a guest blogger. Doug has decades of software and data science experience including practice lead at RichRelevance, VP of Analytics at iCrossing, product recommendations lead at Amazon.com, R&D manager at Accenture, and lecturer at Stanford University.
Doug provides background on the evolution of data structure and specifically the benefits of knowledge graphs. Knowledge graphs are a critical aspect of Kyndi’s approach to AI as a way to capture facts associated with people, actions, places and things and how these entities relate.
Please let us know what you think.
From Numbers to Ratios to Vectors to Graphs
Bottom line up front: information retrieval models have evolved from counts to ratios to vectors and graphs are next. Graphs give meaning to the relationships between concepts, enable inferrance, and are enabled by the relentless increase in computational power and digital storage available.
Counts: The first popular model was “bag of words” where a document is represented by counts of how many times each word occurs. For example, “You’re a good man with a good heart. And it’s hard for a good man to be king,” has counts “a” = 3, “good” = 3, “man” = 2, etc. A weakness of this model is that all words are equally important, which they aren’t: “heart” is better at differentiating documents than “a”. So information retrieval moved to ratios.
Ratios: “Term frequency * inverse document frequency,” or TFIDF, measures a word by how often it appears in a document divided by how often it appears in all documents. “A” appears in almost every document so it doesn’t tell us much about a specific document, while “heart” is far less common and thus better at distinguishing one document from another. TFIDF is widely used today but as document collections grew, more information was needed to match queries to documents. Query expansion adds information to a query, such as adding “cheap laptop” to “low-cost computer,” or “horse” to “mustang sanctuary.” However, finding relationships such as that between “laptop” and “computer” is complicated so models evolved to vectors.
Vectors: A vector is a list of numbers, like (longitude, latitude, altitude). Vector encodings use machine learning to generate hundreds of numbers for each word based on the other words it appears near in documents, resulting in similar words — “laptop” and “computer” — having similar vectors. That simplifies query expansion, automated the discovery of similar words, added more information to concepts, but used a lot more computation.
“Computation and storage is basically free.” — Tom Siebel, November 2017
The following chart illustrates how computation and digital storage per dollar have increased over the past 37 years, adjusting for inflation. Today, a dollar buys 570 million times more computation than it bought in 1980, and 97 million times more storage. So if computation is free, then what’s next after vectors? Graphs.
Vectors discover some relationships between concepts but don’t identify the relationships. They discover that “computer” and “laptop” are related, but not how. Are all computers laptops, or are all laptops computers, or is a laptop part of a computer… Graphs address why they’re related. Below is an example graph for movies.A movie has a director, authors and an aggregate rating. Directors and authors are persons, persons have a birthplace, and places have addresses and geographic locations. Here’s an instance of the graph for this year’s best selling movie (so far). Black Panther was written and directed by Ryan Coogler, Coogler also directed Creed and Fruitvale Station, and he was born in Oakland, California. The graph enables inferences such as the director of Black Panther was born in Oakland.
Here’s another example from pharmaceuticals. A drug has alternative drugs, interacts with other drugs, treats conditions and is used in therapies.
Some of those relationships for Acebutolol follow. From this we can infer, for example, that Clonidine has interactions with hypertension drugs.
- Bag of words: Gerald Salton et al. (1975) “A vector space model for automatic indexing,” Communications of ACM 18(11):613–620. Cornelis Joost van Rijsbergen (1979) Information Retrieval, London: Buttersworth http://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf. For a recent survey and bibliography, see Chris Potts (2013) “Distributional approaches to word meanings,” Stanford University, Ling 236/Psych 236c: Representations of meaning , May https://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf
- Word encodings: Word2vec, Google code archive https://code.google.com/archive/p/word2vec/. Wikipedia https://en.wikipedia.org/wiki/Word2vec. Tomas Mikolov et al. (2013) “Distributed representations of words and phrases and their compositionality,” https://arxiv.org/abs/1310.4546v1
- “Movie,” Schema.org https://schema.org/Movie,
“Drug,” Schema.org https://health-lifesci.schema.org/Drug
- Black Panther, Internet Movie Database. Retrieved September 1, 2018. https://www.imdb.com/title/tt1825683/
- Tom Siebel, AWS re:Invent 2017, November 2017 https://www.youtube.com/watch?v=2ozr788H0iU&t=4m58s
- Cost of compute power 1980 to 2010: “One dollar’s worth of computer power, 1980-2010,” The Hamilton Project, The Brookings Institution, Feb. 2015
2011 to 2014: Yoav Mor (2015) “Analyzing AWS EC2 price drops over the past 5 years,” Cloudyn.com https://www.cloudyn.com/blog/analyzing-aws-ec2-price-drops-over-the-past-5-years/.
Todd Meyer (2014) “Don’t let the ‘cloud cost war’ stop your next reserved instance purchase,” Cloudability.com https://blog.cloudability.com/cloud-cost-war-shouldnt-stop-buying-reserved-instances/2014 to 2017: TSO Logic (2018) “Amazon Web Services delivers continual cost savings,” https://tsologic.com/amazon-web-services-delivers-continual-cost-savings-tso-logic-study-reveals/
- Cost of digital storage John C. McCallum (2015) “Disk drive prices (1955–2015),” jcmit.com https://jcmit.net/diskprice.htm, https://web.archive.org/web/20150714062141/http://www.jcmit.com/diskprice.htm
Andy Klein (2017) “Hard drive cost per gigabyte,” Backblaze.com https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/
Lucas Mearian (2017) “CW@50: Data storage goes from $1M to 2 cents per gigabyte,” Computerworld, March 23 https://www.computerworld.com/article/3182207/data-storage/cw50-data-storage-goes-from-1m-to-2-cents-p er-gigabyte.htm