Data-Intensive Supercomputing: The case for DISC
Randal E. Bryant May 10, 2007 CMU-CS-07-128
Question：How can university researchers demonstrate the credibility of their work without having comparable computing facilities available?
Describe a new form of high-performance computing facility (Data-Intensive Super Computer) that places emphasis on data, rather than raw computation, as the core focus of the system.
The author inspiration for DISC: comes from the server infrastructures that have been developed to support search over the worldwide web.
This paper outlines the case for DISC as an important direction for large-scale computing systems.
The common role in the computations:
• Web search without language barriers. (No matter in which language they type the query)
• Inferring biological function from genomic sequences.
• Predicting and modeling the effects of earthquakes.
• Discovering new astronomical phenomena from telescope imagery data.
• Synthesizing realistic graphic animations.
• Understanding the spatial and temporal patterns of brain behavior based on MRI data.
2 Data-Intensive Super Computing
Conventional (Current) supercomputers:
are evaluated largely on the number of arithmetic operations they can supply each second to the application programs.
Advantage: highly structured data requires large amounts of computation.
1. It creates misguided priorities in the way these machines are designed, programmed, and operated;
2. Disregarding the importance of incorporating computation-proximate, fast-access data storage, and at the same time creating machines that are very difficult to program effectively;
3. The range of computational styles is restricted by the system structure.
The key principles of DISC:
1. Intrinsic, rather than extrinsic data.
2. High-level programming models for expressing computations over the data.
3. Interactive access.
4. Scalable mechanisms to ensure high reliability and availability. (error detection and handling)
3 Comparison to Other Large-Scale Computer Systems
3.1 Current Supercomputers
3.2 Transaction Processing Systems
3.3 Grid Systems
4 Google: A DISC Case Study
1. The Google system actively maintains cached copies of every document it can find on the Internet.
The system constructs complex index structures, summarizing information about the documents in forms that enable rapid identification of the documents most relevant to a particular query.
When a user submits a query, the front end servers direct the query to one of the clusters, where several hundred processors work together to determine the best matching documents based on the index structures. The system then retrieves the documents from their cached locations, creates brief summaries of the documents, orders them with the most relevant documents first, and determines which sponsored links should be placed on the page.
2. The Google hardware design is based on a philosophy of using components that emphasize low cost and low power over raw speed and reliability. Google keeps the hardware as simple as possible.
They make extensive use of redundancy and software-based reliability.
These failed components are removed and replaced without turning the system off.
Google has significantly lower operating costs in terms of power consumption and human labor than do other data centers.
3. MapReduce, that supports powerful forms of computation performed in parallel over large amounts of data.
Two function: a map function that generates values and associated keys from each document, and a reduction function that describes how all the data matching each possible key should be combined.
MapReduce can be used to compute statistics about documents, to create the index structures used by the search engine, and to implement their PageRank algorithm for quantifying the relative importance of different web documents.
4. BigTable: a distributed data structures, provides capabilities similar to those seen in database systems.
5 Possible Usage Model
The DISC operations could include user-specified functions in the style of Google’s MapReduce programming framework. As with databases, different users will be given different authority over what operations can be performed and what modifications can be made.
6 Constructing a General-Purpose DISC System
The open source project Hadoop implements capabilities similar to the Google file system and support for MapReduce.
Constructing a General-Purpose DISC System：
• Hardware Design.
There are a wide range of choices;
We need to understand the tradeoffs between the different hardware configurations and how well the system performs on different applications.
Google has made a compelling case for sticking with low-end nodes for web search applications, and the Google approach requires much more complex system software to overcome the limited performance and reliability of the components. But it might not be the most cost-effective solution for a smaller operation when personnel costs are considered.
• Programming Model.
1. One important software concept for scaling parallel computing beyond 100 or so processors is to incorporate error detection and recovery into the runtime system and to isolate programmers from both transient and permanent failures as much as possible.
Work on providing fault tolerance in a manner invisible to the application programmer started in the context of grid-style computing, but only with the advent of MapReduce and in recent work by Microsoft has it become recognized as an important capability for parallel systems.
2. We want programming models that dynamically adapt to the available resources and that perform well in a more asynchronous execution environment.
e.g.: Google’s implementation of MapReduce partitions a computation into a number of map and reduce tasks that are then scheduled dynamically onto a number of “worker” processors.
• Resource Management.
Problem: how to manage the computing and storage resources of a DISC system.
We want it to be available in an interactive mode and yet able to handle very large-scale computing tasks.
• Supporting Program Development.
Developing parallel programs is difficult, both in terms of correctness and to get good performance.
As a consequence, we must provide software development tools that allow correct programs to be written easily, while also enabling more detailed monitoring, analysis, and optimization of program performance.
• System Software.
System software is required for a variety of tasks, including fault diagnosis and isolation, system resource control, and data migration and replication.
Google and its competitors provide an existence proof that DISC systems can be implemented using available technology. Some additional topics include:
• How should the processors be designed for use in cluster machines?
• How can we effectively support different scientific communities in their data management and applications?
• Can we radically reduce the energy requirements for large-scale systems?
• How do we build large-scale computing systems with an appropriate balance of performance and cost?
• How can very large systems be constructed given the realities of component failures and repair times?
• Can we support a mix of computationally intensive jobs with ones requiring interactive response?
• How do we control access to the system while enabling sharing?
• Can we deal with bad or unavailable data in a systematic way?
• Can high performance systems be built from heterogenous components?
7 Turning Ideas into Reality
7.1 Developing a Prototype System
Operate two types of partitions: some for application development, focusing on gaining experience with the different programming techniques, and others for systems research, studying fundamental issues in system design.
For the program development partitions:
Use available software, such as the open source code from the Hadoop project, to implement the file system and support for application programming.
For the systems research partitions:
Create our own design, studying the different layers of hardware and system software required to get high performance and reliability. (e.g.: high-end hardware, low-cost component)
7.2 Jump Starting
Begin application development by renting much of the required computing infrastructure:
1. network-accessible storage: Simple Storage System (S3) service
2. computing cycles: Elastic Computing Cloud (EC2) service
(The current pricing for storage is $0.15 per gigabyte per day ($1,000 per terabyte per year), with addition costs for reading or writing the data. Computing cycles cost $0.10 per CPU hour ($877 per year) on a virtual Linux machine.)
1. The performance of such a configuration is much less than that of a dedicated facility.
2. There is no way to ensure that the S3 data and the EC2 processors will be in close enough proximity to provide high speed access.
3. We would lose the opportunity to design, evaluate, and refine our own system.
7.3 Scaling Up
1. We believe that DISC systems could change the face of scientific research worldwide.
2. DISC will help realize the potential all these data such as the combination of sensors and networks to collect data, inexpensive disks to store data, and the benefits derived by analyzing data provides.