Dieselpoint announced a new product in late April, called OpenPipeline, which is open source middleware for crawling, parsing, analyzing, and routing documents. The product was announced at the Infonortics Search Engine Meeting in Boston, where Dieselpoint CEO Chris Cleveland provided an introduction to the product, its implementation, and its underlying code. He also conducted a live demonstration. The software provides a common architecture for connectors to data sources, file filters, text analyzers, and modules to distribute documents across a network.
Dieselpoint says that it is working with enterprise search companies as well as connector and text analysis companies in an effort to drive innovation in several key areas of enterprise search technology. OpenPipeline’s primary goal is to support massive scalability and to keep configuration, integration, and application design simple, straightforward, and elegant. The software comes fully functional with prebuilt components, but it is also capable of integrating third-party modules. As such, plug-ins for crawling content management systems, parsing special file formats, and performing text analytics can be used through OpenPipeline.
"Every vendor has its own form of the pipeline: built-in crawlers, document folders, analyzers, and whatnot," Cleveland says. "They are all proprietary and this then makes it difficult for third-parties to interact with the software, so we decided to open source the pipeline so that more parties can access it."
Dieselpoint points to the flexibility of OpenPipeline’s source code, which makes it easier for more parties to use, contribute to, and enhance the product. Cleveland believes that organizations stand to gain from participating in this "little ecosystem" because most companies are obligated to build their connections to content management systems from the ground up, whereas OpenPipeline provides the opportunity for anyone to come in and use what has already been built.
"If we provide them with the framework, then they have the opportunity to distribute their own products," said Cleveland.
The Infonortics Search Engine Meeting, where OpenPipeline was officially announced, is an annual meeting in its 13th year of operation. It provides a forum and a point of reference for anyone interested in search and retrieval technologies. It aims to draw together professionals interested in search engines—such as designers and developers—as well as corporate clients who are interested in implementing innovative search technologies in their own companies.
Cleveland and Dieselpoint accept as a given that search engines need to synthesize data before it is input in order to make it more "searchable." An effective search solution such as OpenPipeline needs to have crawlers that can be provided by a content management system, the internet, or a website. OpenPipeline transforms content into substandard form, such as text for Microsoft Word files or HTML for media files. Once the appropriate crawler is identified, code from the database is tokenized, making it easy to search all of the available metadata.
Enterprise search is not as exciting as web searching, Cleveland says, because it entails all of the nitty-gritty preparation for search—that is, it requires doing all of those things you need to do to get a document and standardize it before indexing. OpenPipeline, he says, aims to streamline the preparation process through its innovative document-processing capabilities.