Cupid, Draw Back Your Bow and Connect My Data

May 29, 2018

Forrester Research estimates that 25% of enterprises have implemented graph databases, and Gartner states that graph analysis is the “single most effective competitive differentiator.” Typically, though, content is stored in relational databases. Those databases house data and allow for retrieval, and that data may or may not be interconnected. Conversely, graph databases are all about connections between seemingly disparate data.

In the early days of relational databases, data collection was a large part of the puzzle. Pre-internet, the number of news sources was much smaller, and the metadata involved was much more segmented—names, dates, and places. The format and variety of Big Data have changed just about everything that we believed to be true regarding database contents and creation.

Data comes in myriad formats, including server log files, videos, still images, and sensor data created by the Internet of Things (IoT). It can be difficult to organize it into a retrieval system. Since analysis is one of the most important aspects of Big Data initiatives, it is imperative that the data is stored in a way that allows for easy retrieval. 

Relational databases are typically used for queries on single topics. Different aspects of the query can be linked together, but this often requires knowledge of Boolean operators and nesting. In contrast, graph databases curate information by linking relationships between topics and archiving that metadata, so that connections and facts about the relationship are easily discovered.

A good example of the power of graph databases is their use by the online dating industry. According to Forbes, several worldwide online dating companies have harnessed the power of graph databases to recommend dates with people who are in users’ extended social networks (i.e., “friends of friends”). Statistics show that people are more likely to go out on a date with a known entity.

Online dating sites’ graph databases allow for queries such as “find all men who are connected within three friends of my women friends who like sailing but not bowling and who live within 30 miles of my ZIP code.”  While relational databases are still superior when exacting results and accurate calculations are needed, dating sites are a natural fit. They do not require exact matching; users are merely looking for suggestions. 

Journalism is another vertical in which graph databases can be a game changer, since large-scale investigations often require combing through vast amounts of documents. The International Consortium of Investigative Journalists (ICIJ) won the 2017 Pulitzer Prize for Explanatory Reporting for its Panama Papers investigation, which uncovered offshore shell companies involved in financial fraud and tax dodging and linked them to more than 140 politicians in 50-plus countries. The ICIJ reviewed 11.5 million documents, using a graph database to search the documents and detail connections between people, places, and companies in each.

The query capability of this graph database was unique; it allowed for the uploading of lists of politicians’ names to search for their mention. It also had the capacity for visual search. After keywords were entered into a Google-like search box, the results appeared visually as dots, which could be expanded to literally connect the dots.  This experience ensured that the ICIJ did not need to reinvent the wheel when it analyzed the Paradise Papers in November 2017.

Ontotext uses the acronym SMART (speed, meaning, answers, relationships, and transformation) to characterize the five key drivers of graph databases. Immeasurable speed and quick analysis of data save time and money. Analysis of relationships between datapoints and text creates information that is retrievable by its meaning, and more complex and intricate answers are returned. While some relationships are apparent, many are hidden, and graph databases reveal these connections and link them to even more relationships. These four aspects of graph databases lead to the fifth driver, transformation, by translating data and content into a multidimensional view of research challenges. 

Related Articles

Given their reliance on statistics, data scientists were stunned when Donald Trump won the U.S. presidential election. Did the faulty numbers and broken algorithms signal the death of Big Data? Or did they underscore the fact that this is a nascent field? The Monday-morning quarterbacking continues, but certain factors may lead to better best practices and a more conscientious climate. At a minimum, the need for higher quality data, sharper code writing, and more contextual, nuanced analysis is apparent.
The familiar adage "Everything old is new again" was on my mind as I enjoyed one of my beach reads this summer, How to Lie With Statistics, by Darrell Huff. In a TED talk, Bill Gates said that Huff "shows you how visuals can be used to exaggerate trends and give distorted comparisons." I assumed it was a hot-off-the-press response to the current fixation with fake news and erroneous data. However, the copyright page says Huff first shared these tips and tricks back in 1954.