Get a Grip on Grids: Can Distributed Computing Power Collaboration and Content Management?


      Bookmark and Share

BEST PRACTICES SERIES

Tere's no doubt that computational demands in the content business will rise over time," says Mike Riley, chief scientist at the RR Donnelley Technology Center in Downers Grove, Illinois. "The real question is how quickly Moore's Law can keep up with these demands, especially as the world becomes more digitally-enabled."

Riley's question arises from his experience helping Donnelley—founded in 1864—stay current in its portfolio of business communication services, used by a worldwide customer base that includes publishers, merchandisers, and companies in the telecommunications, financial services, and healthcare industries. But his point applies equally well in areas outside of Donnelley's focus, ranging from entertainment sales and online gaming to the pure science of high-energy physics. While Moore observed a rate at which processors increase in power, back in the mid-sixties he didn't envision (or at least address) the rate at which the demand for processing power and network throughput would grow. Today, those demands threaten to eat up the gains that chip makers have so far been able to deliver. And wherever demand for computing resources exceeds supply, IT systems—and the human activities that rely on them—will be constrained from achieving their full potential.

It's largely to transcend these constraints that a growing number of technology strategists have been working toward a redefinition of the classic paradigm of simply trying to outrun increased demand with faster processors and networks. Instead, the idea is that the availability of resources to meet tomorrow's computing needs is best assured not simply by boosting speed, but also by fundamentally reorganizing the way that individual computing resources within an organization may be allocated to accomplish common goals. Gaining currency under the name "Grid computing," it's almost a kind of communism for computers—"from each workstation according to its abilities, and to each according to its needs." But Grid advocates are neither utopians nor ideologues, just people with lots of work to do who need enough computing power to do it. The successful deployment of Grids will make possible the development of markets for providing that power when and where it's most in demand.

Essence of Grid
To the extent that the Grid revolution has a central command, it can be found at Argonne National Laboratories near Chicago, professional home to leading Grid advocates Ian Foster and Steve Tuecke. The two have been working on the Grid concept since at least the mid-nineties, and along with Carl Kesselman of the University of Southern California, they have penned definitive descriptions of Grid computing such as "The Physiology of the Grid" and "The Anatomy of the Grid" (both available at www.globus.org/research/papers. html). In a July 2002 article for GRIDtoday called "What is the Grid?" (www.gridtoday.com/02/ 0722/100136.html), Foster boils the core concept down to its essence. A Grid, he writes, is a system that "coordinates resources that are not subject to centralized control using standard, open, general-purpose protocols and interfaces to deliver nontrivial qualities of service."

Perhaps the most widely known example of Grid computing is the SETI@home project (hosted by the University of California at Berkeley; http://setiathome.ssl.berkeley.edu), in which the unused processing cycles of more than 1.6 million Internet-connected PCs have been harnessed over the years for non-real-time data processing tasks that aid the search for extraterrestrial intelligence. But Tuecke says that Grid computing goes far beyond the utilization of distributed idle processing power. "Cycle scavenging," he says, "is one basic form of Grid, but it is only a hint at the full promise of Grid, which will take a number of years and considerably more work to fully realize."

One future Grid application that Tuecke cites is "utility computing," in which an organization purchases resources on demand from an external provider. "In this case," he says, "the organization's resources are being coordinated with the provider's resources, to deliver enhanced capacity and/or capability." A "Grid Bank" accounting infrastructure for the brokering of such services has been proposed by Australian computer scientists Alexander Barmouta at University of Western Australia and Rajkumar Buyya at University of Melbourne (see www.cs.mu.oz. au/~raj/papers/gridbank.pdf). And IBM's new CEO, Sam Palmisano, announced in a major strategy address at the end of October that the company plans to invest $10 billion on developing and marketing this type of Grid-based "on-demand" computing.

Another area where Grid computing might contribute would be in online collaborative environments. "Such environments," Tuecke says, "are increasingly moving beyond basic audio, video, and shared-document conferences among geographically-distributed collaborators toward integration with other mission-critical resources such as databases, analysis systems, and instruments."
What binds various Grid applications together, says Bret Greenstein, worldwide project executive for Grid Computing at IBM, is "the virtualization of heterogeneous computing and data resources for sharing and collaboration across an infrastructure." In other words, in a Grid, the resources an individual can bring to bear for a given task are not limited to local capabilities, but rather expanded to encompass all the other resources available in the "virtual organization" defined by that particular Grid, whether it be an enterprise, a multi-campus research effort, or simply a collection of otherwise independent entities such as the participants in SETI@home.
What sets current Grid computing efforts apart from other forms of network-enabled sharing and collaboration is the emphasis on formalizing open mechanisms to support the QoS part of the equation. "The manner in which multiple resources—CPU, storage, network, scientific instruments, etc.—are brought to bear on a problem," Tuecke says, "must meet user requirements for quality of service, such as reliability, performance, and security."

The stress on QoS drives Grid computing's focus on the problems of effective resource allocation, and also defines its role within a context that also includes Web services. Tuecke explains that Web services—implemented with Web Services Description Language (WSDL) and Simple Object Access Protocol (SOAP)—are targeted toward specifying and performing messaging between services. Business Process Execution Language (BPEL), meanwhile, provides a high-level mechanism for defining traffic management and workflow between systems. In theory, Grid computing completes the picture by addressing issues of resource allocation (computation, storage, and bandwidth) across a network. The intent is to ensure that a requested service can actually be delivered within time-frame and cost parameters that are acceptable to all parties involved, even without having knowledge of or control over all the systems involved along the way.

Content Grids
Suggesting how Grid computing might apply to content management, Tuecke says that WSDL and BPEL can define operations that involve gathering, analyzing, and integrating content from multiple sources, while the role of Grid computing would be to arrange for the required resources through automated negotiation and implementation of service level agreements that might define how soon processing can be completed and the charge (if any) to the party requesting the service. This contribution is particularly important when requested operations involve computationally-intensive tasks such as pattern-matching, e.g., looking for pictures of cats in a set of digital image archives.

"As content becomes more digital," Riley says, "locating and managing the most atomic elements of that content for potentially millions of simultaneous users will be an exponentially difficult challenge to address. Structured content such as XML-defined contexts combined with a strict adherence to the practices associated with many commercial content management systems will help alleviate some of this burden. But the sheer volume and diversity of digital content types—text, image, audio, and video, each with their own variety of file formats—will demand a more flexible solution with massive computational capacity. This is one of the problems that Grid computing can theoretically solve."

One example of such volume and diversity in content is the Library of Congress in Washington, DC, where efforts are underway to assess the value of Grid technologies in connection with the Library's American Memory project. "American Memory is the product of a five-year public/private endeavor to make special collections from the Library of Congress available to elementary and secondary education and the general public," says Martha Anderson of the Library's Office of Strategic Initiatives. "The 116 collections now available via our Web site represent approximately eight terabytes of data in almost eight million files. A characteristic of the data is that the formats are diverse, including moving images, still images, text, and audio, with both MARC-format and XML-encoded metadata records."

Anderson says that Grid technologies offer an advantage for storing and retrieving large datasets efficiently, but that "it remains a question whether the Grid is beneficial for storing and managing large numbers of files that vary greatly in size." To find out, the Library turned to Grid middleware developed by the San Diego Supercomputer Center (SDSC). Called Storage Resource Broker (SRB) (see www.npaci.edu/DICE/SRB/), the software provides a uniform API to connect to heterogeneous resources and access datasets.

"We are testing SRB," Anderson says, "by installing it within our computing environment and loading content and metadata from four representative collections: ‘Civil War Photographs,' a 19th century still image collection; ‘A Ballroom Companion,' a collection of dance instruction manuals from the 19th and 20th century; ‘The Alexander Graham Bell Papers,' a manuscript collection captured as images and encoded text; and ‘Buckaroos in Paradise,' a folklife collection of audio and video recordings. This is a small pilot test that is investigating the SRB's capabilities to understand how well they can support the storage and management of large, diverse datasets."

So far, Anderson says, the Library is "in the midst of technical jockeying to get the SRB to work in our environment. It is too early in the test period to fully understand the value of the SRB for managing content and metadata."

Tools and Standards
In a sense, Grid computing as a whole is still early in its "test period" because the concepts articulated by Grid advocates have yet to achieve widespread adoption. The development of tools to make Grids a practical reality has been carried on at companies such as Platform Computing (which introduced its MultiCluster resource-sharing package in 1996) DataSynapse and United Devices. But the drive for open standards has been led by The Globus Project, a public/private research and development project whose backers include IBM and Microsoft. Focused on "enabling the application of Grid concepts to scientific and engineering computing," Globus has developed a set of software tools, known collectively as the Globus Toolkit, to facilitate the building of computational grids and grid-based applications. A set of interoperable components, the Toolkit serves as a foundation for development of additional software-handling higher-level services and applications. Commercial versions of the Globus Toolkit, including service and support, are available from Platform (Platform Globus) and IBM (Grid Toolbox for Linux and AIX).

Currently at version 2.2, the next Globus Toolkit upgrade (3.0)—to be rolled out over the course of 2003—will incorporate Open Grid Services Architecture (OGSA), providing standardized discovery, management, and monitoring facilities for coordinating multiple Web services and provisioning their associated resources. "OGSA will marry the worlds of Web Services and Grid Computing," Bret Greenstein says, "by defining a set of specifications and standards designed to enable ebusiness applications. These specifications bring together standards such as XML,WSDL, and SOAP with Grid computing standards developed by The Globus Project."

While Globus Toolkit 3 marks a milestone, there is still a long way to go before the ability to access distributed computation is as straightforward as access to the Web. University of Florida physics professor Paul Avery is director of two NSF-funded Grid projects, Grid Physics Network (GriPhyN) and the International Virtual Data Grid Laboratory (iVDGL), which are both scientific endeavors on a massive scale that require not only analysis of petabyte-scale datasets, but also the collaborative effort of thousands of scientists worldwide. According to Avery, the Toolkit "provides a very basic toolbox, but it's not sufficient by itself. For one thing, the Toolkit has only recently begun to address the dimensions of data movement and access, which are required for data-intensive computing" For projects like GriPhyN, higher level end-to-end data management capabilities must be built on top of the Toolkit's fundamental component for data movement and location, Global Access to Secondary Storage (GASS). The resolution of these high-volume data management issues will clearly be crucial for content-oriented applications.

As they gear up for deploying Grid technologies to handle the massive data yield of high-energy physics experiments, Avery and his colleagues in iVDGL find themselves exposing the challenges ahead for Grids in general. "Right now we are just trying to build test beds," he says, "and that can be a very painful process. Our tools—Globus and Condor—are research tools, and as we try to scale them up to the level we are deploying at, we are greatly stressing them and forcing them to improve. We are breaking some boundaries, but we are fixing things as we go, and it's actually been going quite well lately. We've been much more successful because we are starting to get more experience and understanding of these tools."

Despite the obstacles, Avery says his team will eventually make their Grid work on a large scale because "we feel it's got to work. If we look at the scale of the data in our upcoming high-energy physics experiments, and the number of CPUs and people involved, we can't see any way except to apply Grid technologies."

Is there a parallel need outside science and academia, making the eventual adoption of Grid architecture a must for organizations dealing with databases of digitized text, still image, video, and audio content? It's too early to say, but to the extent that particle collisions and rich media both involve data to be found, processed, and delivered, the answer may well be "yes." "Reducing cycle time directly correlates to lower costs for us and our customers," Riley says, "and Grid computing could some day serve an important role in that, especially in our PreMedia content management business."

It's likely to be some time, however, before it's practical to put the benefits of Grid computing for content to the test. "It will take the efforts of a commercial vendor to create and market an interface that can do this seemingly magical stuff at a low cost of implementation," Riley says. "If a content management system vendor can construct a Grid and write the software that exposes the Grid's power, that's when things will get seriously interesting in the content-Grid space."

SIDEBAR: Grid Goals
The goal of Grid environments is to evolve from simple resource aggregation to data sharing and finally to collaboration.

resource aggregation:
allow corporate users to treat a company's entire IT
infrastructure as one computer, grabbing unused resources as they are needed.

data sharing:
allow companies to access remote data. This is of particular interest in certain life sciences projects in which companies need to share human genome data with other companies.

collaboration via grids:
allow widely dispersed organizations to work together on projects, integrating business processes, and sharing everything from engineering blueprints to software applications.