2009年12月3日随笔档案 - programmers

2009年12月3日

分布式事务(一)：Two-phase commit & Three-phase commit

Two-phase commit(http://en.wikipedia.org/wiki/Two-phase_commit_protocol)是分布式事务最基础的协议，Three-phase commit(http://en.wikipedia.org/wiki/Three-phase_commit_protocol)主要解决Two-phase commit中协调者宕机问题。

Two-phase commit的算法实现 (from <<Distributed System: Principles and Paradigms>>)：

协调者(Coordinator)：

write START_2PC to local log;

multicast VOTE_REQUEST to all participants;

while not all votes have been collected {

wait for any incoming vote;

if timeout {

write GLOBAL_ABORT to local log;

multicast GLOBAL_ABORT to all participants;

exit;

}

record vote;

}

if all participants sent VOTE_COMMIT and coordinator votes COMMIT {

write GLOBAL_COMMIT to local log;

multicast GLOBAL_COMMIT to all participants;

} else {

write GLOBAL_ABORT to local log;

multicast GLOBAL_ABORT to all participants;

}

参与者(Participants)

write INIT to local log;

wait for VOTE_REQUEST from coordinator;

if timeout {

write VOTE_ABORT to local log;

exit;

}

if participant votes COMMIT {

write VOTE_COMMIT to local log;

send VOTE_COMMIT to coordinator;

wait for DECISION from coordinator;

if timeout {

multicast DECISION_REQUEST to other participants;

wait until DECISION is received; /* remain blocked*/

write DECISION to local log;

}

if DECISION == GLOBAL_COMMIT

write GLOBAL_COMMIT to local log;

else if DECISION == GLOBAL_ABORT

write GLOBAL_ABORT to local log;

} else {

write VOTE_ABORT to local log;

send VOTE_ABORT to coordinator;

}

另外，每个参与者维护一个线程专门处理其它参与者的DECISION_REQUEST请求，处理线程流程如下：

while true {

wait until any incoming DECISION_REQUEST is received;

read most recently recorded STATE from the local log;

if STATE == GLOBAL_COMMIT

send GLOBAL_COMMIT to requesting participant;

else if STATE == INIT or STATE == GLOBAL_ABORT;

send GLOBAL_ABORT to requesting participant;

else

skip; /* participant remains blocked */

}

从上述的协调者与参与者的流程可以看出，如果所有参与者VOTE_COMMIT后协调者宕机，这个时候每个参与者都无法单独决定全局事务的最终结果(GLOBAL_COMMIT还是GLOBAL_ABORT)，也无法从其它参与者获取，整个事务一直阻塞到协调者恢复；如果协调者出现类似磁盘坏这种永久性错误，该事务将成为被永久遗弃的孤儿。问题的解决有如下思路：

1. 协调者持久化数据定期备份。为了防止协调者出现永久性错误，这是一种代价最小的解决方法，不容易引入bug，但是事务被阻塞的时间可能特别长，比较适合银行这种正确性高于一切的系统。

2. Three-phase Commit。这是理论上的一种方法，实现起来复杂且效率低。思路如下：假设参与者机器不可能出现超过一半同时宕机的情况，如果协调者宕机，我们需要从活着的超过一半的参与者中得出事务的全局结果。由于不可能知道已经宕机的参与者的状态，所以引入一个新的参与者状态PRECOMMIT，参与者成功执行一个事务需要经过INIT, READY, PRECOMMIT，最后到COMMIT状态；如果至少有一个参与者处于PRECOMMIT或者COMMIT，事务成功；如果至少一个参与者处于INIT或者ABORT，事务失败；如果所有的参与者都处于READY(至少一半参与者活着)，事务失败，即使原先宕机的参与者恢复后处于PRECOMMIT状态，也会因为有其它参与者处于ABORT状态而回滚。PRECOMMIT状态的引入给了宕机的参与者回滚机会，所以Three-phase commit在超过一半的参与者活着的时候是不阻塞的。不过，Three-phase Commit只能算是是理论上的探索，效率低并且没有解决网络分区问题。

3. Paxos解决协调者单点问题。Jim Gray和Lamport合作了一篇论文讲这个方法，很适合互联网公司的超大规模集群，Google的Megastore事务就是这样实现的，不过问题在于Paxos和Two-phase Commit都不简单，需要有比较靠谱（代码质量高）的小团队设计和编码才行。后续的blog将详细阐述该方法。

总之，分布式事务只能是系统开发者的乌托邦式理想，Two-phase commit的介入将导致涉及多台机器的事务之间完全串行，没有代价的分布式事务是不存在的。

posted @ 2009-12-22 23:01 Programmers 阅读(858) | 评论 (0) | 编辑收藏

再说Bigtable replication

前面我的一篇文章http://hi.baidu.com/knuthocean/blog/item/12bb9f3dea0e400abba1673c.html引用了对Google App Engine工程师关于Bigtable/Megastore replication的文章。当时留下了很多疑问，比如：为什么Google Bigtable 是按照column family级别而不是按行执行replication的？今天重新思考了Bigtable replication问题，有如下体会：

1. Bigtable/GFS的设计属于分层设计，和文件系统/数据库分层设计原理一致，通过系统隔离解决工程上的问题。这种分层设计带来了两个问题，一个是性能问题，另外一个就是Replication问题。由于存储节点和服务节点可能不在一台机器，理论上总是存在性能问题，这就要求我们在加载/迁移Bigtable子表(Bigtable tablet)的时候考虑本地化因素；另外，GFS有自己的replication机制保证存储的可靠性，Bigtable通过分离服务节点和存储节点获得了很大的灵活性，且Bigtable的宕机恢复时间可以做到很短。对于很多对实时性要求不是特别高的应用Bigtable由于服务节点同时只有一个，既节约资源又避免了单点问题。然后，Bigtable tablet服务过于灵活导致replication做起来极其困难。比如，tablet的分裂和合并机制导致多个tablet(一个只写，其它只读)服务同一段范围的数据变得几乎不可能。

2. Google replication分为两种机制，基于客户端和基于Tablet Server。分述如下：

2-1). 基于客户端的replication。这种机制比较简单，实现如下：客户端读/写操作均为异步操作，每个写操作都尝试写两个Bigtable集群，任何一个写成功就返回用户，客户端维护一个retry list，不断重试失败的写操作。读操作发到两个集群，任何一个集群读取成功均可。然后，这样做有两个问题：

a. 客户端不可靠，可能因为各种问题，包括程序问题退出，retry list丢失导致两个集群的数据不一致；

b. 多个客户端并发操作时无法保证顺序性。集群A收到的写操作可能是"DEL item; PUT item"；集群B的可能是"PUT item; DEL item"。

2-2). 基于Tablet Server的replication。这种机制实现较为复杂，目的是为了保证读服务，写操作的延时仍然可能比较长。两个集群，一个为主集群，提供读/写服务；一个为slave集群，提供只读服务，两个集群维持最终一致性。对于一般的读操作，尽量读取主集群，如果主集群不可以访问则读取slave集群；对于写操作，首先将写操作提交到主集群的Tablet Server，主集群的Tablet Server维护slave集群的元数据信息，并维护一个后台线程不断地将积攒的用户表格写操作提交到slave集群进行日志回放(group commit)。对于一般的tablet迁移，操作逻辑和Bigtable论文中的完全一致；主集群如果发生了机器宕机，则除了回放commit log外，还需要完成宕机的Tablet Server遗留的后台备份任务。之所以要按照column family级别而不是按行复制，是为了提高压缩率从而提高备份效率。如果主集群写操作日志的压缩率大于备份数据的压缩率，则可能出现备份不及时，待备份数据越来越多的问题。

假设集群A为主集群，集群B是集群A的备份，集群切换时先停止集群A的写服务，将集群A余下的备份任务备份到集群B后切换到集群B；如果集群A不可访问的时间不可预知，可以选择直接切换到集群B，这样会带来一致性问题。且由于Bigtable是按列复制的，最后写入的一些行的事务性无法保证。不过由于写操作数据还是保存在集群A的，所以用户可以知道丢了哪些数据，很多应用可以通过重新执行A集群遗留的写操作进行灾难恢复。Google的App Engine也提供了这种查询及重做丢失的写操作的工具。

想法不成熟，有问题联系：knuthocean@163.com

posted @ 2009-12-18 22:05 Programmers 阅读(390) | 评论 (0) | 编辑收藏

Dynamo(四)：Dynamo优化及优缺点

负载平衡策略

Dynamo的负载平衡取决于如何给每台机器分配虚拟节点号。由于集群环境的异构性，每台物理机器包含多个虚拟节点。一般有如下两种分配节点号的方法：

1. 随机分配。每台物理节点加入时根据其配置情况随机分配S个Token(节点号)。这种方法的负载平衡效果还是不错的，因为自然界的数据大致是比较随机的，虽然可能出现某段范围的数据特别多的情况（如baidu, sina等域名下的网页特别多），但是只要切分足够细，即S足够大，负载还是比较均衡的。这个方法的问题是可控性较差，新节点加入/离开系统时，集群中的原有节点都需要扫描所有的数据从而找出属于新节点的数据，Merkle Tree也需要全部更新；另外，增量归档/备份变得几乎不可能。

2. 数据范围等分+随机分配。为了解决方法1的问题，首先将数据的Hash空间等分为Q = N * S份 (N=机器个数，S=每台机器的虚拟节点数），然后每台机器随机选择S个分割点作为Token。和方法1一样，这种方法的负载也比较均衡，且每台机器都可以对属于每个范围的数据维护一个逻辑上的Merkle Tree，新节点加入/离开时只需扫描部分数据进行同步，并更新这部分数据对应的逻辑Merkle Tree，增量归档也变得简单。该方法的一个问题是对机器规模需要做出比较合适的预估，随着业务量的增长，可能需要重新对数据进行划分。

不管采用哪种方法，Dynamo的负载平衡效果还是值得担心的。

客户端缓存及前后台任务资源分配

客户端缓存机器信息可以减少一次在DHT中定位目标机器的网络交互。由于客户端数量不可控，这里缓存采用客户端pull的方式更新，Dynamo中每隔10s或者读/写操作发现缓存信息不一致时客户端更新一次缓存信息。

Dynamo中同步操作、写操作重试等后台任务较多，为了不影响正常的读写服务，需要对后台任务能够使用的资源做出限制。Dynamo中维护一个资源授权系统。该系统将整个机器的资源切分成多个片，监控60s内的磁盘读写响应时间，事务超时时间及锁冲突情况，根据监控信息算出机器负载从而动态调整分配给后台任务的资源片个数。

Dynamo的优点

1. 设计简单，组合利用P2P的各种成熟技术，模块划分好，代码复用程度高。

2. 分布式逻辑与单机存储引擎逻辑基本隔离。很多公司有自己的单机存储引擎，可以借鉴Dynamo的思想加入分布式功能。

3. NWR策略可以根据应用自由调整，这个思想已经被Google借鉴到其下一代存储基础设施中。

4. 设计上天然没有单点，且基本没有对系统时钟一致性的依赖。而在Google的单Master设计中，Master是单点，需要引入复杂的分布式锁机制来解决，且Lease机制需要对机器间时钟同步做出假设。

Dynamo的缺陷

1. 负载平衡相比单Master设计较不可控；负载平衡策略一般需要预估机器规模，不能无缝地适应业务动态增长。

2. 系统的扩展性较差。由于增加机器需要给机器分配DHT算法所需的编号，操作复杂度较高，且每台机器存储了整个集群的机器信息及数据文件的Merkle Tree信息，机器最大规模只能到几千台。

3. 数据一致性问题。多个客户端的写操作有顺序问题，而在GFS中可以通过只允许Append操作得到一个比较好的一致性模型。

4. 数据存储不是有序，无法执行Mapreduce；Mapreduce是目前允许机器故障，具有强扩展性的最好的并行计算模型，且有开源的Hadoop可以直接使用，Dynamo由于数据存储依赖Hash无法直接执行Mapreduce任务。

posted @ 2009-12-05 15:19 Programmers 阅读(1780) | 评论 (0) | 编辑收藏

Dynamo(三)：Dynamo的异常处理及读/写流程

异常处理

Dynamo中把异常分为两种类型，临时性的异常和永久性异常。服务器程序运行时一般通过类似supervise的监控daemon启动，出现core dump等异常情况时自动重启。这种异常是临时性的，其它异常如硬盘报修或机器报废等由于其持续时间太长，称之为永久性的。回顾Dynamo的设计，一份数据被写到N, N+1, ... N+K-1这K台机器上，如果机器N+i (0 <= i <= K-1)宕机，原本写入该机器的数据转移到机器N+K，机器N+K定时ping机器N+i，如果在指定的时间T内N+i重新提供服务，机器N+K将启动传输任务将暂存的数据发送给机器N+i；如果超过了时间T机器N+i还是处于宕机状态，这种异常被认为是永久性的，这时需要借助Merkle Tree机制进行数据同步。这里的问题在于时间T的选择，所以Dynamo的开发人员后来干脆把所有程序检测出来的异常认为是临时性的，并提供给管理员一个utility工具，用来显示指定一台机器永久性下线。由于数据被存储了K份，一台机器下线将导致后续的K台机器出现数据不一致的情况。这是因为原本属于机器N的数据由于机器下线可能被临时写入机器N+1, ... N+K。如果机器N出现永久性异常，后续的K台机器都需要服务它的部分数据，这时它们都需要选择冗余机器中较为空闲的一台进行同步。Merkle Tree同步的原理很简单，每个非叶子节点对应多个文件，为其所有子节点值组合以后的Hash值，叶子节点对应单个数据文件，为文件内容的Hash值。这样，任何一个数据文件不匹配都将导致从该文件对应的叶子节点到根节点的所有节点值不同。每台机器维护K棵Merkle Tree，机器同步时首先传输Merkle Tree信息，并且只需要同步从根到叶子的所有节点值均不相同的文件。

读/写流程

客户端的读/写请求首先传输到缓存的一台机器，根据预先配置的K、W和R值，对于写请求，根据DHT算法计算出数据所属的节点后直接写入后续的K个节点，等到W个节点返回成功时返回客户端，如果写请求失败将加入retry_list不断重试。如果某台机器发生了临时性异常，将数据写入后续的备用机器并在备用机器中记录临时异常的机器信息。对于读请求，根据DHT算法计算出数据所属节点后根据负载策略选择R个节点，从中读取R份数据，如果数据一致，直接返回客户端；如果数据不一致，采用vector clock的方法解决冲突。Dynamo系统默认的策略是选择最新的数据，当然用户也可以自定义冲突处理方法。每个写入系统的<key, value>对都记录一个vector lock信息，vector lock就是一系列<机器节点号, 版本号/时间戳>对，记录每台机器对该数据的最新更新版本信息。如下图：

读取时进行冲突解决，如果一台机器读到的数据的vector lock记录的所有版本信息都小于另一台机器，直接返回vector lock较大的数据；如果二者是平行版本，根据时间戳选择最新的数据或者通过用户自定义策略解决冲突。读请求除了返回数据<key, value>值以外还返回vector lock信息，后续的写操作需要带上该信息。

问题1：垃圾数据如何回收？

Dynamo的垃圾回收机制主要依赖每个节点上的存储引擎，如Berkely db存储引擎，merge-dump存储引擎等。其它操作，如Merkle Tree同步产生的垃圾文件回收可以和底层存储引擎配合完成。

问题2：Dynamo有没有可能丢数据？

关键在于K, W, R的设置。假设一个读敏感应用设置K=3, W=3, R=1，待处理的数据原本属于节点A, B, C，节点B出现临时性故障的过程中由节点D代替。在节点B出现故障到节点B同步完成节点D暂存的修改这段时间内，如果读请求落入节点B或者D都将出现丢数据的问题。这里需要适当处理下，对于B节点下线的情况，由于其它机器要么缓存了B节点已下线信息，要么读取时将发现B节点处于下线状态，这是只需要将请求转发其它节点即可；对于B节点上线情况，可以等到B节点完全同步以后才开始提供读服务。对于设置W<K的应用，Dynamo读取时需要解决冲突，可能丢数据。总之，Dynamo中可以保证读取的机器都是有效的（处于正常服务状态），但W != K时不保证所有的有效机器均同步了所有更新操作。

问题3：Dynamo的写入数据有没有顺序问题？

假设要写入两条数据"add item"和"delete item"，如果写入的顺序不同，将导致完全不同的结果。如果设置W=K，对于同一个客户端，由于写入所有的机器以后才返回，可以保证顺序；而多个客户端的写操作可能被不同的节点处理，不能保证顺序性。如果设置W < K，Dynamo不保证顺序性。

问题4：冲突解决后是否需要将结果值更新存储节点？

读操作解决冲突后不需要将结果值更新存储节点。产生冲突的情况一般有机器下线或者多个客户端导致的顺序问题。机器下线时retry_list中的操作将丢失，某些节点不能获取所有的更新操作。对于机器暂时性或者永久性的异常，Dynamo中内部都有同步机制进行处理，但是对于retry_list中的操作丢失或者多个客户端引发的顺序问题，Dynamo内部根本无法分辨数据是否正确。唯一的冲突解决机器在读操作，Dynamo可以设计成读操作将冲突解决结果值更新存储节点，但是这样会使读操作变得复杂和不高效。所以，比较好的做法是每个写操作都带上读操作返回的多个版本数据，写操作将冲突处理的结果更新存储节点。

posted @ 2009-12-04 23:05 Programmers 阅读(1063) | 评论 (0) | 编辑收藏

名词解释：乐观锁 vs 悲观锁，coarse-grained vs fine-grained

分布式系统或其它论文里面经常出现下面几个名词：
乐观锁：有时称作optimistic concurrency control, 指并发控制的时候“乐观”地认为冲突的概率很小，万一发生了冲突再重试。具体表现为事务执行过程中不锁住其它事务，等到事务提交的时候看一下是否发生了冲突，如果冲突则重试或回滚，否则提交事务。
悲观锁：并发控制的时候总是很悲观，事务执行过程中锁住其它事务，事务提交时不会有冲突。
从表面上看，悲观锁比较符合计算机基础课上灌输的思维，然而，在分布式系统环境下，异常是常有的事。假设分布式系统采用悲观锁设计，如果客户端发出事务(加锁)请求后异常退出，将导致系统被永久锁住。Web应用存储系统一般采用乐观锁设计，这和Web应用的读/写比例一般超过10相关。系统设计的时候面临这样一种CAS(Compare-And-Swap)需求：如果待操作项符合某个条件则修改。我们可以采用悲观锁锁住待操作项的所有修改，再加上锁的最大持有时间限制，但这样的API设计风险很大，乐观锁可以很好地解决该问题。

coarse-grained vs fine-grained：粗粒度和细粒度。J2EE中常用来指API的粒度，比如, 我有一个远程对象, 他有很多属性和对应的getter和setter方法, 如果我们远程调用对象每个属性的getter和setter方法, 就会产生很多远程方法调用. 这就是fine-grained, 会造成性能问题。所以我们可以用一个setdata或getdata的方法把一些属性的访问封装起来, 只用一个远程方法传输一个data transfer object来对该对象进行赋值和访问, 这就是coarse-grained。Google Chubby中用来表示锁的粒度。coarse-grained指的是分布式锁的持有时间可以很长并不断延长锁的持有时间，这样的好处在于对锁服务器的压力较小，难点在于锁服务端宕机恢复需要恢复锁的状态，find-grained指的是分布式锁的持有时间一般是秒级或者毫秒级，这样的好处在于锁服务器宕机恢复不必维持原有锁的状态，但这种简单的设计方法导致服务器压力很大，不容易扩展到大集群。Google的设计一开始就把集群的线性扩展放到了一个很重要的位置，所以Google Chubby里面使用了coarse-grained的设计。客户端可以简单地在coarse-grained锁的基础上扩展出一个fine-grained的锁，具体请看Chubby论文：scholar.google.cn/scholar

posted @ 2009-12-03 14:58 Programmers 阅读(530) | 评论 (0) | 编辑收藏

Megastore-Bigtable上的封装

Google在SIGMOD 2008上透露了Megastore部分实现细节，详情参考大牛James Hamilton的blog：perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx
大牛的文章固然不错，不过肯定不大好懂，下面我说一下我对文章的翻译+理解：
1. Google Bigtable只支持最简单的insert, update, del, ...等函数调用API，不支持SQL形式的API，这个转换工作放到了Megastore层次上来做。SQL对于异步Bigtable调用的支持需要仔细考虑。
2. 对于索引支持文章中已经说得很明显了，维护一个<索引，row_key>的索引表，更新时先更新数据表再更新索引表，索引项越多，更新效率越低，但是读基本不怎么影响，特别适合互联网这种读/写比例一般超过10倍的应用。
3. Megastore不提供通用的分布式事务支持，分布式事务仅仅限于同一个entity group。Bigtable支持单行的事务，而Megastore支持row key前缀相同的多行事务，如一个用户的blog, post, photo，可以将它们存在到Bigtable的一张表中，row key为user_id + blog_id + post_id + photo_id，这样同一个user的数据即为一个entity group。然而，这样就导致不能支持像百付宝、支付宝等电子商务转账事务，我暂时也还不清楚支持同一个entity group内部的事务意义有多大，即有多少web应用需要这种同一个entity group下的事务支持。
4. Megastore支持事务的方式当然还是传统的Two-phase commit协议，为了解决这个协议中协调者失效导致的问题，引入Paxos协议(Google Chubby)使协调者不再成为单点。具体做起来会非常复杂，这里提供超级大牛Jim Gray和Lamport的一篇论文供大家参考：scholar.google.com/scholar 个人认为Oracle的事务内部是一个基本的Two-phase commit协议，协调者宕机时由Oracle DBA手工介入，由于其复杂性，对DBA要求很高，所以Taobao一直网罗国内顶级DBA牛人。
5. Megastore具体事务实现时会借用Bigtable 原有的机制来实现commit log, replication等功能。可能的实现为：建一张专门的Entity group root表，加载Entity group root表的Tablet Server做为协调者角色进行分布式事务控制。然而问题在于加载Entity group root表的Tablet Server是一个单点，实现多个Tablet Server服务同一个Bigtable Tablet又是一件极其困难的事情。
6. Megastore不支持复杂的Join操作，这和互联网公司应用性质相关。Join操作一般不要求强一致性，可以通过多表冗余方式实现。
7. 事务的并发控制采用最优控制策略。简单来说，就是事务过程中不要锁住其它事务操作，提交的时候看一下是否与其它事务冲突，如果有冲突则重试。Megastore实现时没有rollback，失败了都是retry，宕机了就回放操作日志。
8. Megastore/Bigtable的实现者认为让用户自己指定entity group, locality group是合理的（和数据存储位置相关)。这样的效果是同一个entity group的数据经常存放在一台机器上，分布式事务的性能损耗较小，这也就说明在分布式系统中，没有代价的scalable是不存在的，要想获得scalable和性能，就必须牺牲关系数据库的一些特性及用户的易用性。

上述均为个人的粗浅看法，如何避免协调者的单点等很多问题还没有想清楚，Bigtable和Megastore的replication策略看起来也有些冲突，想清楚后将续写！

posted @ 2009-12-03 14:58 Programmers 阅读(618) | 评论 (0) | 编辑收藏

[导入]Dynamo(二)：Dynamo的replication

前文说到，Dynamo DHT能够定位数据所属的节点，为了处理节点失效的情况（DHT环中删除节点），需要对节点的数据进行replication。思路如下：假设数据存储K份，DHT定位到的数据所属节点为N，则数据存储在节点N, N+1, ..., N+K-1上。如果第i (0 <= i <= K-1) 台机器宕机，则往后找一台机器N+K临时替代。临时替代的机器定时ping机器N+i，等到它重启后将这些临时数据重新写入N+i。机器N+i宕机的这段时间内，所有的读写均落入到机器[N, N+i-1]和[N+i+1, N+K]中，这段时间会出现数据一致性问题，需要引入专门的冲突解决协议，在Dynamo中是通过Lamport的vector clock实现的。如果机器N+i永久失效，机器N+K需要进行同步操作。一般来说，从机器N+i宕机开始到被认定为永久失效的时间不会太长，积累的写操作也不会太多，可以采用Merkle Tree对机器的数据文件进行快速同步。
为了在可用性和效率之间权衡，Dynamo的设计中允许用户指定读/写个数R和W值。R和W分别表示每个读/写操作需要操作的副本数。只要满足R+W > K，就可以保证在存在不超过一台机器故障的时候，至少能够读到一份有效的数据。如果应用重视读效率，可以设置W = K, R = 1；如果应用需要在读/写之间权衡，一般可设置W = 2, R = 2，K = 3。

问题1：Dynamo中如何解决网络分区问题？
前面已经提到，DHT协议本身是无法处理网络分区的。在Dynamo中，引入种子节点，服务器定期向种子节点轮询整个机群的机器信息，种子节点的选择符合一定的策略使得网络分区问题出现概率降至工程可以接受的水平。

问题2：如何将数据复制到多个数据中心？
每份数据都被复制到N, N+1, ..., N+K-1这K台机器中，为了保证这些机器属于不同的数据中心，需要合理地设计获取数据节点号的Hash算法。当然，Dynamo通过直接手工配置每台机器的编号解决。看起来很山寨，不过很实用，呵呵。阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/f085d72a06d4ee27d52af170.html

posted @ 2009-12-03 13:43 Programmers 阅读(333) | 评论 (0) | 编辑收藏

[导入]Dynamo(一)：一切从DHT开始

DHT全称Distributed Hash Table (en.wikipedia.org/wiki/Distributed_hash_table)，在P2P系统中经常用来定位数据所属机器。这就涉及到一致哈希(consistent hashing)思想，分布式系统中经常出现机器上下线，如果采用通常的Hash方法来查找数据所属机器，机器上下线将导致整个集群的数据分布被打乱。这是因为，机器上下线将导致机器序号及Hash函数的改变，一致哈希做了简单调整：每台机器存储哈希值和它最为接近的数据。在Chord系统中，顺时针到达的第一台机器即为最近的机器。
外部的数据可能首先传输至集群中的任意一台机器，为了找到数据所属机器，要求每台机器维护一定的集群机器信息用于定位。最直观的想法当然是每台机器分别维护它的前一台及后一台机器的信息，机器的编号可以为机器IP的Hash值，定位机器最坏情况下复杂度为O(N)。可以采用平衡化思想来优化（如同平衡二叉树优化数组/链表），使每一台机器存储O(logN)的集群机器信息，定位机器最坏情况下复杂度为O(logN)。
首先考虑每台机器维护前一台及后一台机器信息的情况，这里的问题是机器上下线导致缓存信息的不一致，我们需要设计协议使得在确定一段比较短的时间内能够纠正这种错误。对于新机器加入，首先进行一次查找操作找到该机器的下一台机器，并记录下一台机器的信息。机器内的每台机器都定时向它的后继发送心跳信息，如果后继记录的前一台机器编号在二者之间，说明有新机器加入，这时需要更新后一台机器编号为新加入编号；收到心跳信息的后继也需要检查，如果发送心跳的机器编号较为接近则更新为前一台机器。机器下线将导致机器循环链表断开，所以，每台机器都维护了R个(一般取R值为3)最近的后继信息，发现后继节点下线时将通知其它后继节点并加入新的替换节点。如果R个后继节点同时下线，需要操作人员手工介入修复循环链。
Chord中的每台机器维护O(logN)的机器信息是一种空间换时间的做法，实现时需要引入额外的消息交换协议。这种做法依赖于如下前提：每台机器维护的前一台机器及后一台机器除了短时间不一致外总是正确的。

问题1：机器缓存短时间不一致有什么影响？数据正确性依靠什么保证？
短时间可能出现缓存的机器信息不正确的情况。比如有A, C, D, E四台机器，再加入一台机器B，机器加入的过程中，原本属于B的数据可能写入到B或者C，这都是允许的。又如删除机器C，访问机器C的节点得不到数据。数据的可用性及一致性还需要通过额外的replication及冲突处理协议解决。
问题2：DHT能否处理网络分区问题？
DHT不能处理网络分区问题，理论上存在整个DHT被分成多个子集的情况。我想，这时侯需要外部的机制介入，如维护一台外部机器保存所有机器列表等。阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/cca1e711221dcfcca6ef3f1d.html

posted @ 2009-12-03 13:43 Programmers 阅读(417) | 评论 (0) | 编辑收藏

[导入]Amazon Dynamo论文阅读笔记(序)

Amazon Dynamo是组合使用P2P各种技术的经典论文，对单机key-value存储系统扩展成分布式系统有借鉴意义，值得仔细推敲。本人准备近期深入阅读该论文，并写下读书笔记自娱自乐。当然，如果有志同道合的同学非常欢迎交流。以下是阅读计划：
1. 一切从DHT开始。Dynamo首先要解决的就是给定关键字key找出服务节点的问题。Dynamo的思想与Chord有些类似，我们可以抛开replication问题，看看Chord和Dynamo是如何通过应用DHT解决服务节点定位问题的。这里面的难点当然是节点加入和删除，尤其是多个节点并发加入/删除。建议预先阅读Chord论文：scholar.google.com/scholar 。
2. Dynamo的replication。理解了DHT，我们需要结合replication理解服务节点定位及错误处理等问题。
3. Dynamo错误处理。这里包括两种类型的错误，一种是暂时性的，如由于程序bug core dump后重启，另外一种是永久性的，这里用到了Merkle Tree同步技术。
4. Dynamo读/写流程设计及冲突解决。这里主要涉及到一致性模型。Dynamo允许根据应用配置R和W值来权衡效率及Availability，并使用了Lamport的Vector Clock来解决冲突。
5. Dynamo优化。主要是Load rebalance的优化。
6. Dynamo实现。如果让我们自己来实现Dynamo，我们应该如何划分模块以及实现过程中有哪些关键问题？
后续将按照计划对每个问题做阅读笔记 :) 阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/8838ad34f9ae1dbdd0a2d3d7.html

posted @ 2009-12-03 13:43 Programmers 阅读(893) | 评论 (1) | 编辑收藏

[导入]推荐两本分布式系统书籍

推荐两本分布式系统方面书籍：
1. <<Distributed Systems - Principles and Paradigms>> Andrew S. Tanenbaum www.china-pub.com/40777&ref=ps Tanenbaum出品，必属精品。本书条理清晰，涉及到分布式系统的方方面面，通俗易懂并附录了分布式系统各个经典问题的论文阅读资料，是分布式系统入门的不二选择。感觉和以前看过的<<Introduction to Algorithm>>一样，读起来让人心旷神怡，建议通读。
2. <<Introduction to Distributed Algorithms>> Gerard Tel www.china-pub.com/13102&ref=ps 我们老大推荐的书籍。虽然从名字看是入门型书籍，不过内容一点都不好懂，适合有一定基础的同学。另外，千万要注意，一定要买英文原版。阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/8838ad34fbfb1fbdd1a2d364.html

posted @ 2009-12-03 13:43 Programmers 阅读(1216) | 评论 (0) | 编辑收藏

[导入]分布式相关会议历届最佳论文(02-04)

2004
OSDI ‘04
Best Paper:
Recovering Device Drivers
Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy, University of Washington
Best Paper:
Using Model Checking to Find Serious File System Errors
Junfeng Yang, Paul Twohey, and Dawson Engler, Stanford University; Madanlal Musuvathi, Microsoft Research

LISA ‘04
Best Paper:
Scalable Centralized Bayesian Spam Mitigation with Bogofilter
Jeremy Blosser and David Josephsen, VHA, Inc.

Security ‘04
Best Paper:
Understanding Data Lifetime via Whole System Simulation
Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum, Stanford University
Best Student Paper:
Fairplay—A Secure Two-Party Computation System
Dahlia Malkhi and Noam Nisan, Hebrew University; Benny Pinkas, HP Labs; Yaron Sella, Hebrew University

2004 USENIX Annual Technical Conference
Best Paper:
Handling Churn in a DHT
Sean Rhea and Dennis Geels, University of California, Berkeley; Timothy Roscoe, Intel Research, Berkeley; John Kubiatowicz, University of California, Berkeley
Best Paper:
Energy Efficient Prefetching and Caching
Athanasios E. Papathanasiou and Michael L. Scott, University of Rochester

FREENIX Track
Best Paper:
Wayback: A User-level Versioning File System for Linux
Brian Cornell, Peter A. Dinda, and Fabián E. Bustamante, Northwestern University
Best Student Paper:
Design and Implementation of Netdude, a Framework for Packet Trace Manipulation
Christian Kreibich, University of Cambridge, UK

VM ‘04
Best Paper:
Semantic Remote Attestation—A Virtual Machine Directed Approach to Trusted Computing
Vivek Haldar, Deepak Chandra, and Michael Franz, University of California, Irvine

FAST ‘04
Best Paper:
Row-Diagonal Parity for Double Disk Failure Correction
Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar, Network Appliance, Inc.
Best Student Paper:
Improving Storage System Availability with D-GRAID
Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison
Best Student Paper:
A Framework for Building Unobtrusive Disk Maintenance Applications
Eno Thereska, Jiri Schindler, John Bucy, Brandon Salmon, Christopher R. Lumb, and Gregory R. Ganger, Carnegie Mellon University

NSDI ‘04
Best Paper:
Trickle: A Self-Regulating Algorithm for Code Propagation and Maintenance in Wireless Sensor Networks Philip Levis, University of California, Berkeley, and Intel Research Berkeley; Neil Patel, University of California, Berkeley; David Culler, University of California, Berkeley, and Intel Research Berkeley; Scott Shenker, University of California, Berkeley, and ICSI
Best Student Paper:
Listen and Whisper: Security Mechanisms for BGP
Lakshminarayanan Subramanian, University of California, Berkeley; Volker Roth, Fraunhofer Institute, Germany; Ion Stoica, University of California, Berkeley; Scott Shenker, University of California, Berkeley, and ICSI; Randy H. Katz, University of California, Berkeley

2003
LISA ‘03
Award Paper:
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, and Zheng Zhang, Microsoft Research
Award Paper:
Distributed Tarpitting: Impeding Spam Across Multiple Servers
Tim Hunter, Paul Terry, and Alan Judge, eircom.net

BSDCon ‘03
Best Paper:
Cryptographic Device Support for FreeBSD
Samuel J. Leffler, Errno Consulting
Best Student Paper:
Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions Hideki Eiraku and Yasushi Shinjo, University of Tsukuba

12th USENIX Security Symposium
Best Paper:
Remote Timing Attacks Are Practical
David Brumley and Dan Boneh, Stanford University
Best Student Paper:
Establishing the Genuinity of Remote Computer Systems
Rick Kennell and Leah H. Jamieson, Purdue University

2003 USENIX Annual Technical Conference
Award Paper:
Undo for Operators: Building an Undoable E-mail Store
Aaron B. Brown and David A. Patterson, University of California, Berkeley
Award Paper:
Operating System I/O Speculation: How Two Invocations Are Faster Than One
Keir Fraser, University of Cambridge Computer Laboratory; Fay Chang, Google Inc.

FREENIX Track Best Paper:
StarFish: Highly Available Block Storage
Eran Gabber, Jeff Fellin, Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu Özden, and Elizabeth Shriver, Lucent Technologies, Bell Labs
Best Student Paper:
Flexibility in ROM: A Stackable Open Source BIOS
Adam Agnew and Adam Sulmicki, University of Maryland at College Park; Ronald Minnich, Los Alamos National Labs; William Arbaugh, University of Maryland at College Park

First International Conference on Mobile Systems, Applications, and Services
Best Paper:
Energy Aware Lossless Data Compression
Kenneth Barr and Krste Asanovic, Massachusetts Institute of Technology

2nd USENIX Conference on File and Storage Technologies
Best Paper:
Using MEMS-Based Storage in Disk Arrays
Mustafa Uysal and Arif Merchant, Hewlett-Packard Labs; Guillermo A. Alvarez, IBM Almaden Research Center
Best Student Paper:
Pond: The OceanStore Prototype
Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz, University of California, Berkeley

4th USENIX Symposium on Internet Technologies and Systems
Best Paper:
SkipNet: A Scalable Overlay Network with Practical Locality Properties
Nicholas J. A. Harvey, Microsoft Research and University of Washington; Michael B. Jones, Microsoft Research; Stefan Saroiu, University of Washington; Marvin Theimer and Alec Wolman, Microsoft Research
Best Student Paper:
Scriptroute: A Public Internet Measurement Facility
Neil Spring, David Wetherall, and Tom Anderson, University of Washington

2002
5th Symposium on Operating Systems Design and Implementation
Best Paper:
Memory Resource Management in VMware ESX Server
Carl A. Waldspurger, VMware, Inc.
Best Student Paper:
An Analysis of Internet Content Delivery Systems
Stefan Saroiu, Krishna P. Gummadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy, University of Washington

LISA ‘02: 16th Systems Administration Conference
Best Paper:
RTG: A Scalable SNMP Statistics Architecture for Service Providers
Robert Beverly, MIT Laboratory for Computer Science
Best Paper:
Work-Augmented Laziness with the Los Task Request System
Thomas Stepleton, Swarthmore College Computer Society

11th USENIX Security Symposium
Best Paper:
Security in Plan 9
Russ Cox, MIT LCS; Eric Grosse and Rob Pike, Bell Labs; Dave Presotto, Avaya Labs and Bell Labs; Sean Quinlan, Bell Labs
Best Student Paper:
Infranet: Circumventing Web Censorship and Surveillance
Nick Feamster, Magdalena Balazinska, Greg Harfst, Hari Balakrishnan, and David Karger, MIT

2nd Java Virtual Machine Research and Technology Symposium
Best Paper:
An Empirical Study of Method In-lining for a Java Just-in-Time Compiler
Toshio Suganuma, Toshiaki Yasue, and Toshio Nakatani, IBM Tokyo Research Laboratory
Best Student Paper:
Supporting Binary Compatibility with Static Compilation
Dachuan Yu, Zhong Shao, and Valery Trifonov, Yale University

2002 USENIX Annual Technical Conference
Best Paper:
Structure and Performance of the Direct Access File System
Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, and Margo I. Seltzer, Harvard University; Jeffrey S. Chase, Andrew J. Gallatin, Richard Kisley, and Rajiv G. Wickremesinghe, Duke University; and Eran Gabber, Lucent Technologies
Best Student Paper:
EtE: Passive End-to-End Internet Service Performance Monitoring
Yun Fu and Amin Vahdat, Duke University; Ludmila Cherkasova and Wenting Tang, Hewlett-Packard Laboratories

FREENIX Track
Best FREENIX Paper:
CPCMS: A Configuration Management System Based on Cryptographic Names
Jonathan S. Shapiro and John Vanderburgh, Johns Hopkins University
Best FREENIX Student Paper:
SWILL: A Simple Embedded Web Server Library
Sotiria Lampoudi and David M. Beazley, University of Chicago

BSDCon ‘02
Best Paper:
Running “fsck” in the Background Marshall Kirk McKusick, Author and Consultant
Best Paper:
Design And Implementation of a Direct Access File System (DAFS) Kernel Server for FreeBSD
Kostas Magoutis, Division of Engineering and Applied Sciences, Harvard University

Conference on File and Storage Technologies
Best Paper:
VENTI - A New Approach to Archival Data Storage
Sean Quinlan and Sean Dorward, Bell Labs, Lucent Technologies
Best Student Paper:
Track-aligned Extents: Matching Access Patterns to Disk Drive Characteristics
Jiri Schindler, John Linwood Griffin, Christopher R. Lumb, Gregory R. Ganger, Carnegie Mellon University

阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/8218034f4a01523caec3ab1c.html

posted @ 2009-12-03 13:43 Programmers 阅读(275) | 评论 (0) | 编辑收藏

[导入]分布式相关会议历届最佳论文(05-09)

前几天有同学问分布式系统方向有哪些会议。网上查了一下，顶级的会议是OSDI(Operating System Design and Implementation)和SOSP(Symposium on Operating System Principles)。其它几个会议，如NSDI，FAST，VLDB也常常有让人眼前一亮的论文。值得庆幸的是，现在云计算太火了，GFS/Mapreduce/Bigtable等工程性文章都发表在最牛的OSDI上，并且Google Bigtable和Microsoft的Dryad LINQ还获得了最佳论文奖。下面列出了每个会议的历年最佳论文，希望我们可以站在一个制高点上。
USENIX ‘09
Best Paper:
Satori: Enlightened Page Sharing
Grzegorz Miłoś, Derek G. Murray, and Steven Hand, University of Cambridge Computer Laboratory; Michael A. Fetterman, NVIDIA Corporation

Best Paper:
Tolerating File-System Mistakes with EnvyFS
Lakshmi N. Bairavasundaram, NetApp., Inc.; Swaminathan Sundararaman, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

NSDI ‘09
Best Paper:
TrInc: Small Trusted Hardware for Large Distributed Systems
Dave Levin, University of Maryland; John R. Douceur, Jacob R. Lorch, and Thomas Moscibroda, Microsoft Research

Best Paper:
Sora: High Performance Software Radio Using General Purpose Multi-core Processors
Kun Tan and Jiansong Zhang, Microsoft Research Asia; Ji Fang, Beijing Jiaotong University; He Liu, Yusheng Ye, and Shen Wang, Tsinghua University; Yongguang Zhang, Haitao Wu, and Wei Wang, Microsoft Research Asia; Geoffrey M. Voelker, University of California, San Diego

FAST ‘09
Best Paper:
CA-NFS: A Congestion-Aware Network File System
Alexandros Batsakis, NetApp and Johns Hopkins University; Randal Burns, Johns Hopkins University; Arkady Kanevsky, James Lentini, and Thomas Talpey, NetApp

Best Paper:
Generating Realistic Impressions for File-System Benchmarking
Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison
2008

OSDI ‘08
Jay Lepreau Best Paper:
Difference Engine: Harnessing Memory Redundancy in Virtual Machines
Diwaker Gupta, University of California, San Diego; Sangmin Lee, University of Texas at Austin; Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat, University of California, San Diego

Jay Lepreau Best Paper:
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
Yuan Yu, Michael Isard, Dennis Fetterly, and Mihai Budiu, Microsoft Research Silicon Valley; Úlfar Erlingsson, Reykjavík University, Iceland, and Microsoft Research Silicon Valley; Pradeep Kumar Gunda and Jon Currey, Microsoft Research Silicon Valley

Jay Lepreau Best Paper:
KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs
Cristian Cadar, Daniel Dunbar, and Dawson Engler, Stanford University

LISA ‘08
Best Paper:
ENAVis: Enterprise Network Activities Visualization
Qi Liao, Andrew Blaich, Aaron Striegel, and Douglas Thain, University of Notre Dame

Best Student Paper:
Automatic Software Fault Diagnosis by Exploiting Application Signatures
Xiaoning Ding, The Ohio State University; Hai Huang, Yaoping Ruan, and Anees Shaikh, IBM T.J. Watson Research Center; Xiaodong Zhang, The Ohio State University

USENIX Security ‘08
Best Paper:
Highly Predictive Blacklisting
Jian Zhang and Phillip Porras, SRI International; Johannes Ullrich, SANS Institute

Best Student Paper:
Lest We Remember: Cold Boot Attacks on Encryption Keys
J. Alex Halderman, Princeton University; Seth D. Schoen, Electronic Frontier Foundation; Nadia Heninger and William Clarkson, Princeton University; William Paul, Wind River Systems; Joseph A. Calandrino and Ariel J. Feldman, Princeton University; Jacob Appelbaum; Edward W. Felten, Princeton University

USENIX ‘08
Best Paper:
Decoupling Dynamic Program Analysis from Execution in Virtual Environments
Jim Chow, Tal Garfinkel, and Peter M. Chen, VMware

Best Student Paper:
Vx32: Lightweight User-level Sandboxing on the x86
Bryan Ford and Russ Cox, Massachusetts Institute of Technology

NSDI ‘08
Best Paper:
Remus: High Availability via Asynchronous Virtual Machine Replication
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, and Norm Hutchinson, University of British Columbia; Andrew Warfield, University of British Columbia and Citrix Systems, Inc.

Best Paper:
Consensus Routing: The Internet as a Distributed System
John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Arun Venkataramani, University of Massachusetts Amherst

LEET ‘08
Best Paper:
Designing and Implementing Malicious Hardware (PDF) or read in HTML
Samuel T. King, Joseph Tucek, Anthony Cozzie, Chris Grier, Weihang Jiang, and Yuanyuan Zhou, University of Illinois at Urbana-Champaign

FAST ‘08
Best Paper:
Portably Solving File TOCTTOU Races with Hardness Amplification
Dan Tsafrir, IBM T.J. Watson Research Center; Tomer Hertz, Microsoft Research; David Wagner, University of California, Berkeley; Dilma Da Silva, IBM T.J. Watson Research Center

Best Student Paper:
An Analysis of Data Corruption in the Storage Stack
Lakshmi N. Bairavasundaram, University of Wisconsin, Madison; Garth Goodson, Network Appliance Inc.; Bianca Schroeder, University of Toronto; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison
2007

LISA ‘07
Best Paper:
Application Buffer-Cache Management for Performance: Running the World’s Largest MRTG
David Plonka, Archit Gupta, and Dale Carder, University of Wisconsin Madison

Best Paper:
PoDIM: A Language for High-Level Configuration Management
Thomas Delaet and Wouter Joosen, Katholieke Universiteit Leuven, Belgium

16th USENIX Security Symposium
Best Paper:
Towards Automatic Discovery of Deviations in Binary Implementations with Applications to Error Detection and Fingerprint Generation
David Brumley, Juan Caballero, Zhenkai Liang, James Newsome, and Dawn Song, Carnegie Mellon University

Best Student Paper:
Keep Your Enemies Close: Distance Bounding Against Smartcard Relay Attacks
Saar Drimer and Steven J. Murdoch, Computer Laboratory, University of Cambridge

USENIX ‘07
Best Paper:
Hyperion: High Volume Stream Archival for Retrospective Querying
Peter Desnoyers and Prashant Shenoy, University of Massachusetts Amherst

Best Paper:
SafeStore: A Durable and Practical Storage System
Ramakrishna Kotla, Lorenzo Alvisi, and Mike Dahlin, The University of Texas at Austin

NSDI ‘07
Best Paper:
Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat, University of California, San Diego

Best Student Paper:
Do Incentives Build Robustness in BitTorrent?
Michael Piatek, Tomas Isdal, Thomas Anderson, and Arvind Krishnamurthy, University of Washington; Arun Venkataramani, University of Massachusetts Amherst

FAST ‘07
Best Paper:
Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?
Bianca Schroeder and Garth A. Gibson, Carnegie Mellon University

Best Paper:
TFS: A Transparent File System for Contributory Storage
James Cipar, Mark D. Corner, and Emery D. Berger, University of Massachusetts Amherst

2006

LISA ‘06
Best Paper:
A Platform for RFID Security and Privacy Administration
Melanie R. Rieback, Vrije Universiteit Amsterdam; Georgi N. Gaydadjiev, Delft University of Technology; Bruno Crispo, Rutger F.H. Hofman, and Andrew S. Tanenbaum, Vrije Universiteit Amsterdam

Honorable Mention:
A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack
Daniel V. Klein, LoneWolf Systems

OSDI ‘06
Best Paper:
Rethink the Sync
Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, and Jason Flinn, University of Michigan

Best Paper:
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc.

15th USENIX Security Symposium
Best Paper:
Evaluating SFI for a CISC Architecture
Stephen McCamant, Massachusetts Institute of Technology; Greg Morrisett, Harvard University

Best Student Paper:
Keyboards and Covert Channels
Gaurav Shah, Andres Molina, and Matt Blaze, University of Pennsylvania

2006 USENIX Annual Technical Conference
Best Paper:
Optimizing Network Virtualization in Xen
Aravind Menon, EPFL; Alan L. Cox, Rice University; Willy Zwaenepoel, EPFL

Best Paper:
Replay Debugging for Distributed Applications
Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica, University of California, Berkeley

NSDI ‘06
Best Paper:
Experience with an Object Reputation System for Peer-to-Peer Filesharing
Kevin Walsh and Emin Gün Sirer, Cornell University

Best Paper:
Availability of Multi-Object Operations
Haifeng Yu, Intel Research Pittsburgh and Carnegie Mellon University; Phillip B. Gibbons, Intel Research Pittsburgh; Suman Nath, Microsoft Research
2005

FAST ‘05
Best Paper:
Ursa Minor: Versatile Cluster-based Storage
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie, Carnegie Mellon University

Best Paper:
On Multidimensional Data and Modern Disks
Steven W. Schlosser, Intel Research Pittsburgh; Jiri Schindler, EMC Corporation; Stratos Papadomanolakis, Minglong Shao, Anastassia Ailamaki, Christos Faloutsos, and Gregory R. Ganger, Carnegie Mellon University

LISA ‘05
Best Paper:
Toward a Cost Model for System Administration
Alva L. Couch, Ning Wu, and Hengky Susanto, Tufts University

Best Student Paper:
Toward an Automated Vulnerability Comparison of Open Source IMAP Servers
Chaos Golubitsky, Carnegie Mellon University

Best Student Paper:
Reducing Downtime Due to System Maintenance and Upgrades
Shaya Potter and Jason Nieh, Columbia University

IMC 2005
Best Student Paper:
Measurement-based Characterization of a Collection of On-line Games
Chris Chambers and Wu-chang Feng, Portland State University; Sambit Sahu and Debanjan Saha, IBM Research

Security ‘05
Best Paper:
Mapping Internet Sensors with Probe Response Attacks
John Bethencourt, Jason Franklin, and Mary Vernon University of Wisconsin, Madison

Best Student Paper:
Security Analysis of a Cryptographically-Enabled RFID Device
Steve Bono, Matthew Green, and Adam Stubblefield, Johns Hopkins University; Ari Juels, RSA Laboratories; Avi Rubin, Johns Hopkins University; Michael Szydlo, RSA Laboratories

MobiSys ‘05
Best Paper:
Reincarnating PCs with Portable SoulPads
Ramón Cáceres, Casey Carter, Chandra Narayanaswami, and Mandayam Raghunath, IBM T.J. Watson Research Center

NSDI ‘05
Best Paper:
Detecting BGP Configuration Faults with Static Analysis
Nick Feamster and Hari Balakrishnan, MIT Computer Science and Artificial Intelligence Laboratory

Best Student Paper:
Botz-4-Sale: Surviving Organized DDoS Attacks That Mimic Flash Crowds
Srikanth Kandula and Dina Katabi, Massachusetts Institute of Technology; Matthias Jacob, Princeton University; Arthur Berger, Massachusetts Institute of Technology/Akamai

2005 USENIX Annual Technical Conference
General Track
Best Paper:
Debugging Operating Systems with Time-Traveling Virtual Machines
Samuel T. King, George W. Dunlap, and Peter M. Chen, University of Michigan

Best Student Paper:
Itanium—A System Implementor’s Tale
Charles Gray, University of New South Wales; Matthew Chapman and Peter Chubb, University of New South Wales and National ICT Australia; David Mosberger-Tang, Hewlett-Packard Labs; Gernot Heiser, University of New South Wales and National ICT Australia

FREENIX Track
Best Paper:
USB/IP—A Peripheral Bus Extension for Device Sharing over IP Network
Takahiro Hirofuchi, Eiji Kawai, Kazutoshi Fujikawa, and Hideki Sunahara, Nara Institute of Science and Technology

阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/7f32925830ed16d49d82040f.html

posted @ 2009-12-03 13:43 Programmers 阅读(545) | 评论 (0) | 编辑收藏

[导入]分布式系统设计实现的几个现象

分布式系统设计开发过程中有几个比较有意思的现象：
1. CAP原理。CAP分别表示Consistency(一致性), Availability(可访问性), Partition-tolerance(网络分区容忍性)。Consistency指强一致性，符合ACID；Availability指每一个请求都能在确定的时间内返回结果；Partition-tolerance指系统能在网络被分成多个部分，即允许任意消息丢失的情况下正常工作。CAP原理指出，CAP三者最多取其二，没有完美的结果。因此，我们设计replication策略、一致性模型、分布式事务时都应该有所折衷。
2. 一致性的不可能性原理。该原理指出在允许失败的异步系统下，进程间是不可能达成一致的。典型的问题就是分布式选举问题，实际系统如Bigtable的tablet加载问题。所以，Google Chubby/Hadoop Zookeeper实现时都需要对服务器时钟误差做一个假设。当时钟出现不一致时，工作机只能下线以防止出现不正确的结果。
3. 错误必然出现原理。只要是理论上有问题的设计/实现，运行时一定会出现，不管概率有多低。如果没有出现问题，要么是稳定运行时间不够长，要么是压力不够大。
4. 错误的必然复现原则。实践表明，分布式系统测试中发现的错误等到数据规模增大以后必然会复现。分布式系统中出现的多机多线程问题有的非常难于排查，但是，没关系，根据现象推测原因并补调试日志吧，加大数据规模，错误肯定会复现的。
5. 两倍数据规模原则。实践表明，分布式系统最大数据规模翻番时，都会发现以前从来没有出现过的问题。这个原则当然不是准确的，不过可以指导我们做开发计划。不管我们的系统多么稳定，不要高兴太早，数据量翻番一定会出现很多意想不到的情况。不信就试试吧！
阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/d291ab64301ddbfaf73654bc.html

posted @ 2009-12-03 13:43 Programmers 阅读(245) | 评论 (0) | 编辑收藏

[导入]Hypertable C++ vs Hbase Java

Hypertable和Hbase二者同源，设计也有诸多相似之处，最主要的区别当然还是编程语言的选择。Hbase选择Java主要是因为Apache和Hadoop的公共库、历史项目基本都采用该语言，并且Java项目在设计模式和文档上一般都比C++项目好，非常适合开源项目。C++的优势当然还是在性能和内存使用上。Yahoo曾经给出了一个很好的Terasort结果(perspectives.mvdirona.com/2008/07/08/HadoopWinsTeraSort.aspx)，它们认为对于大多数Mapreduce任务，比如分布式排序，性能瓶颈在于IO和网络，Java和C++在性能上基本没有区别。不过，使用Java的Mapreduce在每台服务器上明显使用了更多的CPU和内存，如果用于分布式排序的服务器还需要部署其它的CPU/内存密集型应用，Java的性能劣势将显现。对于Hypertable/HBase这样的表格系统，Java的选择将带来如下问题：
1. Hyertable/Hbase是内存和CPU密集型的。Hypertable/Hbase采用Log-Structured Merge Tree设计，系统可以使用的内存直接决定了系统性能。内存中的memtable和表格系统内部的缓存都大量使用内存，可使用的内存减少将导致merge-dump频率加大，直接加重底层HDFS的压力。另外，读取和dump操作大量的归并操作也可能使CPU成为一个瓶颈，再加上对数据的压缩/解压缩，特别是Bigtable中最经常使用的BM-diff算法在压缩/解压缩过程完全跑满一个CPU核，很难想象Java实现的Hbase能够与C++实现的Hypertable在性能上抗衡。
2. Java垃圾回收。目前Java虚拟机垃圾回收时将停止服务一段时间，这对Hypertable/HBase中大量使用的Lease机制是一个很大的考验。虽然Java垃圾回收可以改进，但是企图以通用的方式完全解决内存管理问题是不现实的。内存管理没有通用做法，需要根据应用的访问模式采取选择不同的策略。
当然，Hadoop由于采用了Java设计，导致开源合作变得更加容易，三大核心系统之上开发的辅助系统，如Hadoop的监控，Pig等都相当成功。所以，我的观点依然是：对于三驾马车的核心系统，采用C++相对合理；对于辅助模块，Java是一个不错的选择。阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/ef201038f5d866f8b311c746.html

posted @ 2009-12-03 13:43 Programmers 阅读(536) | 评论 (0) | 编辑收藏

[导入]Key-value存储系统流派

对于Web应用来说，RDBMS在性能和扩展性上有着天生的缺陷，而key-value存储系统通过牺牲关系数据库的事务和范式等要求来换取性能和扩展性，成为了不错的替代品。key-value存储系统设计时一般需要关注扩展性，错误恢复，可靠性等，大致可以分类如下：
1. “山寨“流派：国产的很多系统属于这种类型。这种类型的系统一般不容易扩展，错误恢复和负载平衡等都需要人工介入。由于国内的人力成本较低，这类系统通过增加运维人员的数量来回避分布式系统设计最为复杂的几个问题，具有强烈的中国特色。这种系统的好处在于设计简单，适合几台到几十台服务器的互联网应用。比如，现在很多多机mysql应用通过人工分库来实现系统扩展，即每次系统将要到达服务上限时，增加机器重新分库。又如，很多系统将更新节点设计成单点，再通过简单的冗余方式来提高系统可靠性；又如，很多系统规定单个表格最大的数据量，并通过人工指定机器服务每个表格来实现负载平衡。在这样的设计下，应用规模增加一倍，服务器和运营各项成本增加远大于一倍，不能用来提供云计算服务。然而由于其简单可依赖，这类系统非常适合小型互联网公司或者大型互联网公司的一些规模较小的产品。
2. "P2P"流派：代表作为Amazon的Dynamo。Amazon作为提供云计算服务最为成功的公司，其商业模式和技术实力都异常强大。Amazon的系统典型特点是采用P2P技术，组合使用了多种流行的技术，如DHT，Vector Clock，Merkle Tree等，并且允许配置W和R值，在可靠性和一致性上求得一个平衡。Dynamo的负载平衡需要通过简单的人工配置机器来配合，它的很多技术点可以单独被其它系统借鉴。如，国内的“山寨”系统可以借鉴Dynamo的设计提高扩展性。
3. Google流派：代表作有Google的三驾马车：GFS+Mapreduce+Bigtable。这种系统属于贵族流派，模仿者众多，知名的有以Yahoo为代表的Hadoop, 与Hadoop同源的Hypertable以及国内外众多互联网公司。Google的设计从数据中心建设，服务器选购到系统设计，数据存储方式(数据压缩)到系统部署都有一套指导原则，自成体系。如Hadoop的HDFS设计时不支持多个客户端同时并发Append操作，导致后续的HBase及Hypertable实现极其困难。模仿者虽多，成功者少，HBase和Hypertable都在响应的延时及宕机恢复上有一系列的问题，期待后续发布的版本能有较大的突破。小型互联网公司可以使用Hadoop的HDFS和Mapreduce，至于类似Hbase/Hypertable的表格系统，推荐自己做一个“山寨”版后不断优化。
4. 学院派：这种类型的系统为研究人员主导，设计一般比较复杂，实现的时候以Demo为主。这类系统代表未来可能的方向，但实现的Demo可能有各种各样的问题，如有的系统不能长期稳定运行，又如，有的系统不支持异构的机器环境。本人对这类系统知之甚少，著名的类Mapreduce系统Microsoft Dryad看起来有这种味道。

【注："山寨“如山寨手机，山寨开心网主要表示符合中国国情，非贬义】阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/ae38ebf8891acb05d9f9fdb9.html

posted @ 2009-12-03 13:43 Programmers 阅读(271) | 评论 (0) | 编辑收藏

[导入]Megastore/Bigtable Replication的文章

从Google App Engine中挖出的关于Megastore/Bigtable跨数据中心replication的文章，里面有提到一点点实现，希望对理解Bigtable及其衍生品的replication机制有用。我想指出几点：
1. Bigtable的跨机房replication是保证最终一致性的，Megastore是通过Paxos 将tablet变成可以被跨机房的tablet server服务的。Bigtable的问题在于机器断电会丢数据，Megastore可以做到不丢数据，但是实现起来极其复杂。Megastore的机制对性能还有一定影响，因为Google Chubby不适合访问量过大的环境，所以，Bigtable和Megastore这两个team正在合作寻找一个平衡点。
2. Bigtable内部的replication是后台进行的，按照列级别执行复制；Megastore是按照Entity group级别进行Paxos控制。为什么Bigtable按照列级别复制？难道和locality group有关？

Migration to a Better Datastore

At Google, we've learned through experience to treat everything with healthy skepticism. We expect that servers, racks, shared GFS cells, and even entire datacenters will occasionally go down, sometimes with little or no warning. This has led us to try as hard as possible to design our products to run on multiple servers, multiple cells, and even multiple datacenters simultaneously, so that they keep running even if any one (or more) redundant underlying parts go down. We call this multihoming. It's a term that usually applies narrowly, to networking alone, but we use it much more broadly in our internal language.

Multihoming is straightforward for read-only products like web search, but it's more difficult for products that allow users to read and write data in real time, like GMail, Google Calendar, and App Engine. I've personally spent a while thinking about how multihoming applies to the App Engine datastore. I even gave a talk about it at this year's Google I/O.

While I've got you captive, I'll describe how multihoming currently works in App Engine, and how we're going to improve it with a release next week. I'll wrap things up with more detail about App Engine's maintenance schedule.

Bigtable replication and planned datacenter moves

When we launched App Engine, the datastore served each application's data out of one datacenter at a time. Data was replicated to other datacenters in the background, using Bigtable's built-in replication facility. For the most part, this was a big win. It gave us mature, robust, real time replication for all datastore data and metadata.

For example, if the datastore was serving data for some apps from datacenter A, and we needed to switch to serving their data from datacenter B, we simply flipped the datastore to read only mode, waited for Bigtable replication to flush any remaining writes from A to B, then flipped the switch back and started serving in read/write mode from B. This generally works well, but it depends on the Bigtable cells in both A and B to be healthy. Of course, we wouldn't want to move to B if it was unhealthy, but we definitely would if B was healthy but A wasn't.

Planning for trouble

Google continuously monitors the overall health of App Engine's underlying services, like GFS and Bigtable, in all of our datacenters. However, unexpected problems can crop up from time to time. When that happens, having backup options available is crucial.

You may remember the unplanned outage we had a few months ago. We published a detailed postmortem; in a nutshell, the shared GFS cell we use went down hard, which took us down as well, and it took a while to get the GFS cell back up. The GFS cell is just one example of the extent to which we use shared infrastructure at Google. It's one of our greatest strengths, in my opinion, but it has its drawbacks. One of the most noticeable drawback is loss of isolation. When a piece of shared infrastructure has problems or goes down, it affects everything that uses it.

In the example above, if the Bigtable cell in A is unhealthy, we're in trouble. Bigtable replication is fast, but it runs in the background, so it's usually at least a little behind, which is why we wait for that final flush before switching to B. If A is unhealthy, some of its data may be unavailable for extended periods of time. We can't get to it, so we can't flush it, we can't switch to B, and we're stuck in A until its Bigtable cell recovers enough to let us finish the flush. In extreme cases like this, we might not know how soon the data in A will become available. Rather than waiting indefinitely for A to recover, we'd like to have the option to cut our losses and serve out of B instead of A, even if it means a small, bounded amount of disruption to application data. Following our example, that extreme recovery scenario would go something like this:

We give up on flushing the most recent writes in A that haven't replicated to B, and switch to serving the data that is in B. Thankfully, there isn't much data in A that hasn't replicated to B, because replication is usually quite fast. It depends on the nature of the failure, but the window of unreplicated data usually only includes a small fraction of apps, and is often as small as a few thousand recent puts, deletes, and transaction commits, across all affected apps.

Naturally, when A comes back online, we can recover that unreplicated data, but if we've already started serving from B, we can't automatically copy it over from A, since there may have been conflicting writes in B to the same entities. If your app had unreplicated writes, we can at least provide you with a full dump of those writes from A, so that your data isn't lost forever. We can also provide you with tools to relatively easily apply those unreplicated writes to your current datastore serving out of B.

Unfortunately, Bigtable replication on its own isn't quite enough for us to implement the extreme recovery scenario above. We use Bigtable single-row transactions, which let us do read/modify/write operations on multiple columns in a row, to make our datastore writes transactional and consistent. Unfortunately, Bigtable replication operates at the column value level, not the row level. This means that after a Bigtable transaction in A that updates two columns, one of the new column values could be replicated to B but not the other.

If this happened, and we switched to B without flushing the other column value, the datastore would be internally inconsistent and difficult to recover to a consistent state without the data in A. In our July 2nd outage, it was partly this expectation of internal inconsistency that prevented us from switching to datacenter B when A became unhealthy.

Megastore replication saves the day!

Thankfully, there's a solution to our consistency problem: Megastore replication. Megastore is an internal library on top of Bigtable that supports declarative schemas, multi-row transactions, secondary indices, and recently, consistent replication across datacenters. The App Engine datastore uses Megastore liberally. We don't need all of its features - declarative schemas, for example - but we've been following the consistent replication feature closely during its development.

Megastore replication is similar to Bigtable replication in that it replicates data across multiple datacenters, but it replicates at the level of entire entity group transactions, not individual Bigtable column values. Furthermore, transactions on a given entity group are always replicated in order. This means that if Bigtable in datacenter A becomes unhealthy, and we must take the extreme option to switch to B before all of the data in A has flushed, B will be consistent and usable. Some writes may be stuck in A and unavailable in B, but B will always be a consistent recent snapshot of the data in A. Some scattered entity groups may be stale, ie they may not reflect the most recent updates, but we'd at least be able to start serving from B immediately, as opposed waiting for A to recover.

To Paxos or not to Paxos

Megastore replication was originally intended to replicate across multiple datacenters synchronously and atomically, using Paxos. Unfortunately, as I described in my Google I/O talk, the latency of Paxos across datacenters is simply too high for a low-level, developer facing storage system like the App Engine datastore.

Due to that, we've been working with the Megastore team on an alternative: asynchronous, background replication similar to Bigtable's. This system maintains the write latency our developers expect, since it doesn't replicate synchronously (with Paxos or otherwise), but it's still consistent and fast enough that we can switch datacenters at a moment's notice with a minimum of unreplicated data.

Onward and upward

We've had a fully functional version of asynchronous Megastore replication for a while. We've been testing it heavily, working out the kinks, and stressing it to make sure it's robust as possible. We've also been using it in our internal version of App Engine for a couple months. I'm excited to announce that we'll be migrating the public App Engine datastore to use it in a couple weeks, on September 22nd.

This migration does require some datastore downtime. First, we'll switch the datastore to read only mode for a short period, probably around 20-30 minutes, while we do our normal data replication flush, and roll forward any transactions that have been committed but not fully applied. Then, since Megastore replication uses a new transaction log format, we need to take the entire datastore down while we drop and recreate our transaction log columns in Bigtable. We expect this to only take a few minutes. After that, we'll be back up and running on Megastore replication!

As described, Megastore replication will make App Engine much more resilient to hiccoughs and outages in individual datacenters and significantly reduce the likelihood of extended outages. It also opens the door to two new options which will give developers more control over how their data is read and written. First, we're exploring allowing reads from the non-primary datastore if the primary datastore is taking too long to respond, which could decrease the likelihood of timeouts on read operations. Second, we're exploring full Paxos for write operations on an opt-in basis, guaranteeing data is always synchronously replicated across datacenters, which would increase availability at the cost of additional write latency.

Both of these features are speculative right now, but we're looking forward to allowing developers to make the decisions that fit their applications best!

Planning for scheduled maintenance

Finally, a word about our maintenance schedule. App Engine's scheduled maintenance periods usually correspond to shifts in primary application serving between datacenters. Our maintenance periods usually last for about an hour, during which application serving is continuous, but access to the Datastore and memcache may be read-only or completely unavailable.

We've recently developed better visibility into when we expect to shift datacenters. This information isn't perfect, but we've heard from many developers that they'd like more advance notice from App Engine about when these maintenance periods will occur. Therefore, we're happy to announce below the preliminary maintenance schedule for the rest of 2009.

Tuesday, September 22nd, 5:00 PM Pacific Time (migration to Megastore)
Tuesday, November 3rd, 5:00 PM Pacific Time
Tuesday, December 1st, 5:00 PM Pacific Time

We don't expect this information to change, but if it does, we'll notify you (via the App Engine Downtime Notify Google Group) as soon as possible. The App Engine team members are personally dedicated to keeping your applications serving without interruption, and we realize that weekday maintenance periods aren't ideal for many. However, we've selected the day of the week and time of day for maintenance to balance disruption to App Engine developers with availability of the full engineering teams of the services App Engine relies upon, like GFS and Bigtable. In the coming months, we expect features like Megastore replication to help reduce the length of our maintenance periods.

Posted by Ryan Barrett, App Engine Team

阅读全文
类别：默认分类查看评论
文章来源:http://hi.baidu.com/knuthocean/blog/item/12bb9f3dea0e400abba1673c.html

posted @ 2009-12-03 13:43 Programmers 阅读(236) | 评论 (0) | 编辑收藏

常用链接

留言簿

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜

Migration to a Better Datastore

Bigtable replication and planned datacenter moves

To Paxos or not to Paxos

Planning for scheduled maintenance