DISC(Data Intensive Super Computing 数据密集型超级计算)

View Post

Data Intensive System(DIS)

System Challenges：

Data distributed over many disks

Compute using many processors

Connected by gigabit Ethernet (or equivalent)

System Requirements:

Lots of disks

Lots of processors

Located in close proximity

System Comparison:

(i) Data

Conventional Supercomputers

DISC

Data stored in separate repository

No support for collection or management

Brought into system for computation

Time consuming

Limits interactivity

System collects and maintains data

Shared, active data set

Computation colocated with storage

Faster access

(ii) Programing Models

Conventional Supercomputers

DISC

Programs described at very low level

Specify detailed control of processing & communications

Rely on small number of software packages

Written by specialists

Limits classes of problems & solution methods

Application programs written in terms of high-level operations on data

Runtime system controls scheduling, load balancing, …

(iii) Interaction

Conventional Supercomputers

DISC

Main Machine: Batch Access

Priority is to conserve machine resources

User submits job with specific resource requirements

Run in batch mode when resources available

Offline Visualization

Move results to separate facility for interactive use

Interactive Access

Priority is to conserve human resources

User action can range from simple query to complex computation

System supports many simultaneous users

Requires flexible programming and runtime environment

(iv) Reliability

Conventional Supercomputers

DISC

“Brittle” Systems

Main recovery mechanism is to recompute from most recent checkpoint

Must bring down system for diagnosis, repair, or upgrades

Flexible Error Detection and Recovery

Runtime system detects and diagnoses errors

Selective use of redundancy and dynamic recomputation

Replace or upgrade components while system running

Requires flexible programming model & runtime environment

Comparing with Grid Computing:

Grid: Distribute Computing and Data

(i) Computation: Distribute problem across many machines

Generally only those with easy partitioning into independent subproblems

(ii) Data: Support shared access to large-scale data set

DISC: Centralize Computing and Data

(i) Enables more demanding computational tasks

(ii) Reduces time required to get data to machines

(iii) Enables more flexible resource management

A Commercial DISC

Netezza Performance Server (NPS)

Designed for “data warehouse” applications

Heavy duty analysis of database

Data distributed over up to 500 Snippet Processing Units

Disk storage, dedicated processor, FPGA controller

User “programs” expressed in SQL

Constructing DISC

Hardware: Rent from Amazon

Elastic Compute Cloud (EC2)

Generic Linux cycles for $0.10 / hour ($877 / yr)

Simple Storage Service (S3)

Network-accessible storage for $0.15 / GB / month ($1800/TB/yr)

Software: utilize open source

Hadoop Project

Open source project providing file system and MapReduce

Supported and used by Yahoo

Implementing System Software

Programming Support

Abstractions for computation & data representation

E.g., Google: MapReduce & BigTable

Usage models

Runtime Support

Allocating processing and storage

Scheduling multiple users

Implementing programming model

Error Handling

Detecting errors

Dynamic recovery

Identifying failed components

posted on 2008-04-04 23:43 sun 阅读(1270) 评论(0) 编辑收藏所属分类: DISC

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: Paper Learning: Data-Intensive Supercomputing: The case for DISC DISC(Data Intensive Super Computing 数据密集型超级计算)

全世界的屋顶

常用链接

留言簿

随笔分类(3)

文章分类(37)

文章档案(35)

相册

收藏夹(7)

搜索

最新评论

阅读排行榜

评论排行榜

View Post