Approaches to Discovery
Broadly defined, information discovery is knowing where enterprise information is and being able to access it without compromising data integrity and security. It is not, of course, a stand-alone function. It is tightly related to data access and data management practices. Indeed, information discovery is increasingly being subsumed into enterprise functionality, an acknowledgment of the key role search now plays in the core infrastructure layer.
Because the benefits accrued by good information discovery practices are outnumbered only by the quantity of approaches to achieving it, enterprises should have a clear view of the desired outcome of their information discovery processes. Is it to minimize the number of false matches in a query? Is it to uncover hidden relationships between data sets? Is it to combine results of both structured and unstructured data sets, inside and outside the firewall? Andrew McKay, SVP of sales for Attivio, Inc., which provides solutions for information access, contends that “Fifty percent of customers are dissatisfied with their information discovery vendors, not because the product is bad but because they asked the wrong questions in choosing it.”
Some of the emerging technologies for information discovery are summarized as the following:
- Concept and context searching: Concept searching tools use proprietary methods to group common sets of words, phrases, and properties to create associated lists of concepts, topics, facets, and names. Vendors such as Recommind and Collexis each have a proprietary method for ranking and grouping related words and properties to extract the relationships between the vast numbers of items in the typical data corpus.
- Visualization: These tools provide a visual map of data content that can be used to organize and prioritize documents. Vendors such as Xeround and Advanced Visual Systems enable users to navigate through a visual representation of data sets and their relationships by zooming in on areas of interest, filtering out unwanted items, and drilling down for further details.
- Guided Navigation: This is the approach used by Endeca that presents multiple categories of the entire data set being analyzed, of data subsets, or of the results of search queries. Categories might include key phrases, monetary figures, people, etc. Within each category, a relevance-ranked list of matches is presented, enabling users to drill down into the collection as they determine the most relevant answer set to their query.
- Clustering: Clustering tools such as Vivisimo’s Clusty and Hot Neuron, LLC’s Clustify utilize proprietary algorithms to automatically relate documents and cluster them into groups for review. The underlying technology that performs this sorting can be based on concept clustering, statistical methods, lingual analysis, or thesauri.
Other approaches include statistical analysis, classification, and hybrid models that combine multiple techniques.
Market Contraction Ahead
It’s fair to say that the market for information discovery has reached an inflection point as vendors seek to incorporate discovery directly into a variety of enterprise content management solutions. Autonomy’s January 2009 acquisition of content management player Interwoven, for instance, is a play to offer an integrated content creation and discovery suite of solutions to its customers. In a January 2009 research note called “Reduce the Cost and Risk of E-Discovery in 2009,” Gartner’s Logan says, “partnerships, mergers and acquisitions among software vendors will accelerate during 2009, resulting in suites of e-discovery functions, but vendors will still not deliver true integration or ‘one stop’ e-discovery platforms.”
Some people in the market believe that a movement toward vertical market solutions is also in the cards for 2009. Endeca’s Agarwal says, “There will be a continued movement toward tailored applications that push ROI. Instead of generalized solutions, customers will want specific business solutions based on the role of the user.” In expectation of this, Agarwal says Endeca is moving toward a configurable application stack model so that customers can choose a configuration metaphor and drastically reduce the implementation time frame of a custom solution.
Recommind’s Carpenter believes this desire for vertical solutions will come from within the enterprise as well. “The one-size-fits-all guys need to get bought quickly,” he contends. “Customers are looking for more specific applications of search directed at their roles like business intelligence or knowledge management.” Microsoft’s July 2008 acquisition of startup Powerset, which provides an infrastructure for semantic searching on specific sites (in this case, Wikipedia) lends credence to Carpenter’s theory.
Unfortunately, the current economic downturn is also likely to lead to a reduction of the number of information discovery vendors. In a field like information discovery where new methods and solutions are still being sought, it’s healthy to have a large number of innovators working on the problem. If they’re forced to close shop due to economic pressures, the quality of available solutions may suffer in the long term.