In 2004, Kyndi’s Paul Tarau had an interesting idea. What if you could create an algorithm similar to Google’s PageRank algorithm that ranked the relative importance of words and sentences in text documents? The algorithm could be very useful in Natural Language Processing. It would extract the meaning from text documents and summarize it for researchers.
Google founders Larry Page and Sergey Brin created the PageRank algorithm in 1996 when they were students at Stanford University. PageRank revolutionized web searching — and put Google on the map. The algorithm offered a more concise way to measure the importance of website pages in search engine results. An algorithm similar to PageRank for ranking text in documents would use a connectivity graph to identify and rank the most important words and sentences, effectively summarizing the document and pointing the researcher to essential information.
In 2004, Tarau created this algorithm with his colleague Rada Mihalcea. They called it TextRank. The paper they wrote describing their algorithm, “TextRank: Bringing Order into Texts,” has become standard reading in the field of Natural Language Processing. The paper paved the way to a new research area in the field — graph-based methods of processing language. As of today, “TextRank: Bringing Order into Texts” has been cited 1,519 times in various scientific publications.
TextRank takes a holistic view of the meaning of text in a document. It identifies connections between the different text entities (words, sentences, paragraphs) using the concept of recommendation. For example, as part of identifying the important sentences in a text, TextRank notes when two sentences discuss a similar topic; these sentences are thought to recommend each other. A sentence that is highly recommended by other sentences is considered more informative, and is therefore given a higher score in the text ranking; the sentence’s information is given a more prominent place in the document summary.
Thanks to its extensive graph-mapping of words and sentences, TextRank can summarize documents in a few words or several hundred words according to the researcher’s wishes. TextRank requires no training corpora, so it can be adapted to other languages and domains (in linguistics, corpora are large sets of texts used to validate linguistic rules and perform statistical analysis). TextRank can be adapted to different kinds of documents — news publications, social networks, and even tweets.
Kyndi’s more comprehensive TextRank algorithm is part of the Kyndi solution and Paul Tarau believes it will be invaluable , especially in science and technology research. “We are looking at an explosion of scientific documents,” says Tarau. “The number of new research papers keeps getting larger. Finding the relevant information in these papers very quickly and as much as possible automatically is essential. It is also one of the big promises that we can offer at Kyndi.”