BlogJava-Big Data Road-文章分类-Storm

Clojure DSL

徐红星 — Thu, 19 Jan 2012 07:38:00 GMT

     Storm 用Clojure DSL 来定义 spouts, bolts, and topologies。 Clojure的DSL访问的任何公开的Java API，如果你是一个Clojure的用户，你编写直接用Clojure 编写Storm 的Topologies，而不用接触Java。 Clojure的DSL定义在backtype.storm.clojure命名空间里。
       本页面概述了Clojure的DSL的所有细节，包括：
        1.定义Topology（拓扑结构）
        2.定义bolt
        3.定义spout
        4.在本地模式下或在集群模式下运行Topology
        5.测试Topology

定义 topologies

       要定义Topology，需要使用Topology函数。Topology需要两个参数：一个关于“Spou Specs”的映射和一个关于“Bolt Spec”的映射。每个Spout和Bolt指定组件到Topology上，如输入和并行拓扑结构的代码。
      让我们来看看在Storm启动项目的例子的拓扑定义
(topology
{"1" (spout-spec sentence-spout)
"2" (spout-spec (sentence-spout-parameterized
                   ["the cat jumped over the door"
                    "greetings from a faraway land"])
                   :p 2)}
{"3" (bolt-spec {"1" :shuffle "2" :shuffle}
                 split-sentence
                 :p 5)
"4" (bolt-spec {"3" ["word"]}
                 word-count
                 :p 6)})
映射 Spout 和Bolt Spces 都是从组件ID到Correponding Spec的映射。组件ID必须在映射间唯一。就像在Java中定义Topology一样，在一个Topology里，在申明bolts的输入时，组件ID将用到。

spout-spec

spout-spec 作为Spout实现的参数和可选的关键字参数使用。目前唯一的可选参数是：P, 这个用来定义Spout的并行度。如果你忽略 :p, spout将会作为单一任务执行。

bolt-spec

bolt-spec作为bolt的输入声明参数和可选的关键字参数使用。输入声明是数据流ID到数据流组的一个映射。数据流ID可以用以下两种形式中的一种：

[==component id== ==stream id==]: 在组件上订阅指定流
==component id==: 在组件上订阅默认流

数据流组可以是以下中的一个

:shuffle: 订阅shuffle组
字段名称的向量, like ["id" "name"]: 订阅指定字段上的字段组
:global: 订阅一个 global grouping
:all: subscribes with an all grouping
:direct: subscribes with a direct grouping

可以参考 Concepts 获得更多关于流组的信息. 这里有一个示例来展示不同的方法来声明输入:

{["2" "1"] :shuffle  "3" ["field1" "field2"]  ["4" "2"] :global} 
输入声明总共订阅三种流。他在组件“2”上定义流“1”，是Shuffle分组方式。在组件"3"上订阅默认的流，是Fileds分组方式，分组标准是"Field1"和"Field2"。在组件4上定义流“2”，是Global分组方式，
跟Spout-Spec 方式类似，bolt-spec目前唯一支持的关键参数是:p,这个用来定义bolt的并行度。

shell-bolt-spec

shell-bolt-spec是用在non-JVM语言环境下来实现bolts。他作为参数输入，命令行程序去跑。the name of the file implementing the bolt, an output specification, and then the same keyword arguments that bolt-spec accepts.

以下是 shell-bolt-spec的一个示例:

(shell-bolt-spec {"1" :shuffle "2" ["id"]}  "python"  "mybolt.py"  ["outfield1" "outfield2"]  :p 25)

输出声明的语法是在下面的defbolt部分详细描述。有如何在Storm上使用multilang的更多细节，请参阅使用非JVM语言

defbolt

defbolt is used for defining bolts in Clojure. Bolts have the constraint that they must be serializable, and this is why you can't just reify IRichBolt to implement a bolt (closures aren't serializable). defbolt works around this restriction and provides a nicer syntax for defining bolts than just implementing a Java interface.

At its fullest expressiveness, defbolt supports parameterized bolts and maintaining state in a closure around the bolt implementation. It also provides shortcuts for defining bolts that don't need this extra functionality. The signature for defbolt looks like the following:

(defbolt name output-declaration *option-map & impl)

Omitting the option map is equivalent to having an option map of {:prepare false}.

Simple bolts

Let's start with the simplest form of defbolt. Here's an example bolt that splits a tuple containing a sentence into a tuple for each word:

(defbolt split-sentence ["word"] [tuple collector]  (let [words (.split (.getString tuple 0) " ")]  (doseq [w words]  (emit-bolt! collector [w] :anchor tuple))  (ack! collector tuple)  ))

Since the option map is omitted, this is a non-prepared bolt. The DSL simply expects an implementation for the execute method of IRichBolt. The implementation takes two parameters, the tuple and the OutputCollector, and is followed by the body of the execute function. The DSL automatically type-hints the parameters for you so you don't need to worry about reflection if you use Java interop.

This implementation binds split-sentence to an actual IRichBolt object that you can use in topologies, like so:

(bolt-spec {"1" :shuffle}  split-sentence  :p 5)

Parameterized bolts

Many times you want to parameterize your bolts with other arguments. For example, let's say you wanted to have a bolt that appends a suffix to every input string it receives, and you want that suffix to be set at runtime. You do this with defbolt by including a :params option in the option map, like so:

(defbolt suffix-appender ["word"] {:params [suffix]}  [tuple collector]  (emit-bolt! collector [(str (.getString tuple 0) suffix)] :anchor tuple)  )

Unlike the previous example, suffix-appender will be bound to a function that returns an IRichBolt rather than be an IRichBolt object directly. This is caused by specifying :params in its option map. So to use suffix-appender in a topology, you would do something like:

(bolt-spec {"1" :shuffle}  (suffix-appender "-suffix")  :p 10)

Prepared bolts

To do more complex bolts, such as ones that do joins and streaming aggregations, the bolt needs to store state. You can do this by creating a prepared bolt which is specified by including {:prepare true} in the option map. Consider, for example, this bolt that implements word counting:

(defbolt word-count ["word" "count"] {:prepare true}  [conf context collector]  (let [counts (atom {})]  (bolt  (execute [tuple]  (let [word (.getString tuple 0)]  (swap! counts (partial merge-with +) {word 1})  (emit-bolt! collector [word (@counts word)] :anchor tuple)  (ack! collector tuple)  )))))

The implementation for a prepared bolt is a function that takes as input the topology config, TopologyContext, and OutputCollector, and returns an implementation of the IBolt interface. This design allows you to have a closure around the implementation of execute and cleanup.

In this example, the word counts are stored in the closure in a map called counts. The bolt macro is used to create the IBolt implementation. The bolt macro is a more concise way to implement the interface than reifying, and it automatically type-hints all of the method parameters. This bolt implements the execute method which updates the count in the map and emits the new word count.

Note that the execute method in prepared bolts only takes as input the tuple since the OutputCollector is already in the closure of the function (for simple bolts the collector is a second parameter to the execute function).

Prepared bolts can be parameterized just like simple bolts.

Output declarations

The Clojure DSL has a concise syntax for declaring the outputs of a bolt. The most general way to declare the outputs is as a map from stream id a stream spec. For example:

{"1" ["field1" "field2"]  "2" (direct-stream ["f1" "f2" "f3"])  "3" ["f1"]}

The stream id is a string, while the stream spec is either a vector of fields or a vector of fields wrapped by direct-stream. direct stream marks the stream as a direct stream (See Concepts and Direct groupings for more details on direct streams).

If the bolt only has one output stream, you can define the default stream of the bolt by using a vector instead of a map for the output declaration. For example:

["word" "count"]

This declares the output of the bolt as the fields ["word" "count"] on the default stream id.

Emitting, acking, and failing

Rather than use the Java methods on OutputCollector directly, the DSL provides a nicer set of functions for using OutputCollector: emit-bolt!, emit-direct-bolt!, ack!, and fail!.

emit-bolt!: takes as parameters the OutputCollector, the values to emit (a Clojure sequence), and keyword arguments for :anchor and :stream. :anchor can be a single tuple or a list of tuples, and :stream is the id of the stream to emit to. Omitting the keyword arguments emits an unanchored tuple to the default stream.
emit-direct-bolt!: takes as parameters the OutputCollector, the task id to send the tuple to, the values to emit, and keyword arguments for :anchor and :stream. This function can only emit to streams declared as direct streams.
ack!: takes as parameters the OutputCollector and the tuple to ack.
fail!: takes as parameters the OutputCollector and the tuple to fail.

See Guaranteeing message processing for more info on acking and anchoring.

defspout

defspout is used for defining spouts in Clojure. Like bolts, spouts must be serializable so you can't just reify IRichSpout to do spout implementations in Clojure. defspout works around this restriction and provides a nicer syntax for defining spouts than just implementing a Java interface.

The signature for defspout looks like the following:

(defspout name output-declaration *option-map & impl)

If you leave out the option map, it defaults to {:prepare true}. The output declaration for defspout has the same syntax as defbolt.

Here's an example defspout implementation from storm-starter:

(defspout sentence-spout ["sentence"]  [conf context collector]  (let [sentences ["a little brown dog"  "the man petted the dog"  "four score and seven years ago"  "an apple a day keeps the doctor away"]]  (spout  (nextTuple []  (Thread/sleep 100)  (emit-spout! collector [(rand-nth sentences)])   )  (ack [id]  ;; You only need to define this method for reliable spouts  ;; (such as one that reads off of a queue like Kestrel)  ;; This is an unreliable spout, so it does nothing here  ))))

The implementation takes in as input the topology config, TopologyContext, and SpoutOutputCollector. The implementation returns an ISpout object. Here, the nextTuple function emits a random sentence from sentences.

This spout isn't reliable, so the ack and fail methods will never be called. A reliable spout will add a message id when emitting tuples, and then ack or fail will be called when the tuple is completed or failed respectively. See Guaranteeing message processing for more info on how reliability works within Storm.

emit-spout! takes in as parameters the SpoutOutputCollector and the new tuple to be emitted, and accepts as keyword arguments :stream and :id. :stream specifies the stream to emit to, and :id specifies a message id for the tuple (used in the ack and fail callbacks). Omitting these arguments emits an unanchored tuple to the default output stream.

There is also a emit-direct-spout! function that emits a tuple to a direct stream and takes an additional argument as the second parameter of the task id to send the tuple to.

Spouts can be parameterized just like bolts, in which case the symbol is bound to a function returning IRichSpout instead of the IRichSpout itself. You can also declare an unprepared spout which only defines the nextTuple method. Here is an example of an unprepared spout that emits random sentences parameterized at runtime:

(defspout sentence-spout-parameterized ["word"] {:params [sentences] :prepare false}  [collector]  (Thread/sleep 500)  (emit-spout! collector [(rand-nth sentences)]))

The following example illustrates how to use this spout in a spout-spec:

(spout-spec (sentence-spout-parameterized  ["the cat jumped over the door"  "greetings from a faraway land"])  :p 2)

Running topologies in local mode or on a cluster

That's all there is to the Clojure DSL. To submit topologies in remote mode or local mode, just use the StormSubmitter or LocalCluster classes just like you would from Java.

To create topology configs, it's easiest to use the backtype.storm.config namespace which defines constants for all of the possible configs. The constants are the same as the static constants in the Config class, except with dashes instead of underscores. For example, here's a topology config that sets the number of workers to 15 and configures the topology in debug mode:

{TOPOLOGY-DEBUG true  TOPOLOGY-WORKERS 15}

Testing topologies

This blog post and its follow-up give a good overview of Storm's powerful built-in facilities for testing topologies in Clojure.

徐红星 2012-01-19 15:38 发表评论

Storm 序列化

徐红星 — Thu, 19 Jan 2012 06:29:00 GMT

本文翻译至Storm官方Wiki, 欢迎转载，转载请注明出处：

http://www.blogjava.net/xuhongxing016/articles/368731.html

初次翻译，英文好的同学，可以查看英文文档： https://github.com/nathanmarz/storm/wiki/Serialization

本文是介绍关于Storm 0.6.0及以上版本的序列化系统，Storm 在之前版本使用了另外一套序列化系统。

Tuple可以包含任何类型的对象。由于Storm是一个分布式系统，它需要知道任务之间传递对象时，怎样序列化和反序列化对象。

Storm使用Kryo来进行序列化。 Kryo是一个灵活和快速的的序列化库，序列化对象较小。默认情况下，Storm可以序列化的原始类型，字符串，字节数组的 ArrayList，HashMap，HashSet和Clojure的集合类型。如果你想在你的元组使用另一种类型，你需要注册一个自定义序列化。

动态类型
Tuple中的字段没有进行类型声明。你把对象放到Fields里，Storm

动态序列化对象。
在我们获得序列化接口之前，让我们花点时间理解Storm 的Tuple 为什么是动态类型。

添加静态类型的到Tuple 的Fileds，将会给Storm的API增加大量的复杂性。例如，Hadoop，静态类型的键和值，在用户使用时，需要的很多注释。 Hadoop的API使用起来比较麻烦，“类型安全”是不值得的。动态类型是简单容易使用。
很难用一种合理的静态方式来统计Storm的Tuples。假设一个Bolt订阅了多个流，这些流的Tuple可能在Fields上有不同的类型，当一个Bolt在执行阶段接受Tuple，这些Tuple可能来自于任何流，还有可能使任何类型的组合。可能有一些反射魔法，你可以为每一个被Bolt

订阅的Tuple流定义不同的方法，但是

，但简单，直接的方法是动态类型。
最后，使用动态类型的另一个原因是Storm可以以一种简单直接的方式来使用类似于Clojure和JRuby的动态类型语言
自定义序列化
如前所述，Storm使用Kryo

来进行序列化

。要实现自定义的序列化，您需要注册新的序列化与Kryo。强烈建议您通过Kryo的主页阅读理解它是如何处理自定义序列。

添加自定义的序列化是在Topology的配置属性“topology.kryo.register”完成的。它需要一个注册的清单，其中每个登记可以采取以下两种形式之一：

1、需要注册登记的类的名称。在这种情况下，Storm将使用Kryo的FieldsSerializer来序列化类。这可能是最佳的类，也可能不是 - 更多细节见Kryo文档。
2、映射下类的名字来注册登记com.esotericsoftware.kryo.Serializer实施。
让我们来看看一个例子。

topology.kryo.register:  
 - com.mycompany.CustomType1  
 - com.mycompany.CustomType2: com.mycompany.serializer.CustomType2Serializer 
 - com.mycompany.CustomType3 
com.mycompany.CustomType1 和  com.mycompany.CustomType3 使用  FieldsSerializer 来进行序列化
com.mycompany.CustomType2 使用 com.mycompany.serializer.CustomType2Serializer 来进行序列化
注册序列化器的
的帮助。配置类有一个名为registerSerialization一个方法，在注册时添加到配置里
有更高阶的配置称为Config.TOPOLOGY_SKIP_MISSING_KRYO_REGISTRATIONS。如果设置为true，Storm将忽略任何已注册的，但在classpath中没有自己的代码
可用的序列化。否则，Storm会引发错误时，它无法找到一个序列化。如果你运行多个群集上，每个人都有不同的序列化的拓扑结构，这是有用的。
但要声明所有在storm.yaml文件拓扑中所有的序列化。
Java序列化
Storm如果遇到一个类型，它没有一个序列化注册，它有可能使用Java序列化。如果对象不能被Java序列化程序序列，Storm会抛出一个错误。

Java序列化是极其昂贵的，无论是在CPU成本以及序列化的对象的大小。强烈建议您在Topology上生产环境之前注册定制序列。 Java序列化在处理Topology原型的时候就是这样的。

可以通过配置文件来关闭Java序列化功能，只需将Config.TOPOLOGY_FALL_BACK_ON_JAVA_SERIALIZATION配置设置为false

徐红星 2012-01-19 14:29 发表评论

Twitter Storm 分布式RPC

徐红星 — Thu, 19 Jan 2012 06:17:00 GMT

本文转载自

http://chenlx.blog.51cto.com/4096635/748348
本来准备自己翻译一下，后来Google一下，发现网上已有，遂转载。

分布式RPC

分布式RPC（DRPC）的真正目的是使用storm实时并行计算极端功能。Storm拓扑需要一个输入流作为函数参数，以一个输出流的形式发射每个函数调用的结果。

DRPC没有多少storm特性，因为它是从storm的原始流，spouts，bolts，拓扑来表达一个模式。DRPC没有单独打包，但它如此有用，以至于和storm捆绑在一起。

概述

分布式RPC通过“DRPC server”协调。DRPC服务器协调接收一个RPC请求，发送请求到storm拓扑，从storm拓扑接收结果，发送结果回等待的客户端。从一个客户端的角度来看，一个分布式RPC调用就像是一个常规的RPC调用。例如，一个客户端如何为带“http://twitter.com”参数的“reach”功能计算结果。

DRPCClient client = new DRPCClient("drpc-host", 3772); 
String result = client.execute("reach", "http://twitter.com");

分布式RPC工作流程如下：

客户端发送功能名称及功能所需参数到DRPC服务器去执行。图中的拓扑实现了此功能，它使用DRPCSpout从DRPC服务器接收功能调用流。每个功能调用通过DRPC服务器使用唯一ID标记，随后拓扑计算结果，在拓扑的最后，一个称之为“ReturnResults”的bolt连接到DRPC服务器，把结果交给这个功能调用（根据功能调用ID），DRPC服务器根据ID找到等待中的客户端，为等待中的客户端消除阻塞，并发送结果给客户端。

LinearDRPCTopologyBuilder

Storm有一个称之为LinearDRPCTopologyBuilder的拓扑Builder几乎自动完成DRPC所需的所有相关步骤。包括：

1.设置spout

2.返回结果给DRPC服务器

3.为bolt提供对一组元组的有限聚合功能

让我们看一个简单的例子。这是一个DRPC拓扑的实现，在输入参数的尾部追加“！”并返回：

public static class ExclaimBolt implements IBasicBolt { 
    public void prepare(Map conf, TopologyContext context) { 
    } 
 
    public void execute(Tuple tuple, BasicOutputCollector collector) { 
        String input = tuple.getString(1); 
        collector.emit(new Values(tuple.getValue(0), input + "!")); 
    } 
 
    public void cleanup() { 
    } 
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) { 
        declarer.declare(new Fields("id", "result")); 
    } 
 
} 
 
public static void main(String[] args) throws Exception { 
    LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); 
    builder.addBolt(new ExclaimBolt(), 3); 
    // ... 
}

如你所见，代码非常少。当创建LinearDRPCTopologyBuilder时，你把这个拓扑的DRPC功能名称告诉storm。一个DRPC服务器可以协调许多功能，功能名称用于区别不同的功能，首先声明的bolt将接收一个输入的2-tuples，第一个字段是请求ID，第二个字段是请求参数。LinearDRPCTopologyBuilder认为最后的bolt会发射一个输出流，该输出流包含[id, result]格式的2-tuples。最后，所有拓扑中间过程产生的元组（tuple）都包含请求id作为其第一个字段。

在这个例子中，ExclaimBolt只是简单地在元组的第二个字段尾部追加“！”字符。LinearDRPCTopologyBuilder处理其余的协调工作，包括连接DRPC服务器，发送最终结果。

本地模式DRPC

DRPC可以运行在本地模式。这是如何在本地模式运行上述例子：

LocalDRPC drpc = new LocalDRPC(); 
LocalCluster cluster = new LocalCluster(); 
 
cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); 
 
System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); 
 
cluster.shutdown(); 
drpc.shutdown();

首先你创建一个LocalDRPC对象。这个对象在进程内模拟一个DRPC服务器，就像在进程内模拟一个storm集群一样。然后你创建本地集群，在本地模式运行这个拓扑。创建本地拓扑和远程拓扑，LinearDRPCTopologyBuilder有不同的方法。在本地模式，LocalDRPC未绑定任何端口，拓扑也需要知道与哪个对象通讯，这是为什么createLocaclTopology方法需要接受LocalDRPC对象作为输入参数的原因。

载入拓扑后，你可以用LocalDRPC的execute方法执行DRPC调用。

远程模式DRPC

在实际的集群使用DRPC也很简单。有三个步骤：

1. 启动DRPC服务器

2. 配置DRPC服务器位置

3. 提交DRPC拓扑到storm集群

使用storm脚本启动DRPC服务器，和启动nimbus和ui一样：

bin/storm drpc

接下来，配置你的storm集群，让集群知道DRPC服务器的位置，这样DRPCSpout就知道从哪里读取功能调用。可以通过修改storm.yaml配置文件或拓扑配置完成配置DRPC服务器位置。修改storm.yaml配置文件如下所示：

drpc.servers: 
  - "drpc1.foo.com" 
  - "drpc2.foo.com"

最后，使用StormSubmitter启动DRPC拓扑，就像启动其它拓扑一样。在远程模式运行上述例子，代码如下所示：

StormSubmitter.submitTopology("exclamation-drpc", conf, builder.createRemoteTopology());

createRemoteTopology方法用于在storm集群创建拓扑。

一个更完整的例子

这个exclaimation DRPC例子只是一个用来说明DRPC概念的玩具。让我们看一个更完整的例子，该例子是一个真正需要storm集群的并行计算的DRPC功能。我们将要看的例子是对twitter网站上的一个URL的接触用户进行统计。

一个URL的接触用户数是在twitter网站上接触一个URL的用户数，你需要以下4步：

1. 获取tweeted the URL的全部用户

2. 获取这些用户的全部追随者

3. 使追随者集合中的用户唯一

4. 统计唯一的用户数

一个单独的reach计算在计算期间涉及到数千数据库访问和数千万追随者记录。它是一个真正的耗时计算。正如你将要看到的，在storm上实现这个功能非常简单。在一台机器上，reach计算花费数分钟，在storm集群，最难计算reach的URL也只需数秒。

Storm-starter项目这里定义了一个reach样例，reach拓扑定义如下所示：

LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("reach"); 
builder.addBolt(new GetTweeters(), 3); 
builder.addBolt(new GetFollowers(), 12) 
        .shuffleGrouping(); 
builder.addBolt(new PartialUniquer(), 6) 
        .fieldsGrouping(new Fields("id", "follower")); 
builder.addBolt(new CountAggregator(), 2) 
        .fieldsGrouping(new Fields("id"));

这个拓扑以4个步骤的形式执行：

1. GetTweeters获取tweeted the URL的用户。它转换一个[id, url]形式的输入流到[id, tweeter]形式的输出流。每个url元组将映射到多个tweeter元组。

2. GetFollowers获取这些tweeter的追随者。它转换一个[id, tweeter]形式的输入流到[id, follower]形式的输出流。跨所有任务，当某人追随多个tweeter，这些tweeter又tweeted相同的URL时，这可能会得到重复的追随者。

3. PartialUniquer按追随者ID对追随者数据流进行分组。同一的追随者去到同一的任务，因此每个PartialUniquer任务都接收到独立的相互独立的追随者集合。PartialUniquer一旦收到请求ID用于它的所有追随者元组，它就发射追随者子集的唯一总数。

4. 最后，CountAggregator从每个PartialUniquer任务接收计数并对它们求和。

让我们来看看PartialUniquer：

public static class PartialUniquer implements IRichBolt, FinishedCallback { 
    OutputCollector _collector; 
    Map> _sets = new HashMap>(); 
     
    public void prepare(Map conf, TopologyContext context, OutputCollector collector) { 
        _collector = collector; 
    } 
 
    public void execute(Tuple tuple) { 
        Object id = tuple.getValue(0); 
        Set curr = _sets.get(id); 
        if(curr==null) { 
            curr = new HashSet(); 
            _sets.put(id, curr); 
        } 
        curr.add(tuple.getString(1)); 
        _collector.ack(tuple); 
    } 
 
    public void cleanup() { 
    } 
 
    public void finishedId(Object id) { 
        Set curr = _sets.remove(id); 
        int count; 
        if(curr!=null) { 
            count = curr.size(); 
        } else { 
            count = 0; 
        } 
        _collector.emit(new Values(id, count)); 
    } 
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) { 
        declarer.declare(new Fields("id", "partial-count")); 
    } 
}

当PartialUniquer在exectue方法中接收一个follower元组时，它用一个内部HashMap添加它到与请求ID对应的集合。

PartialUniquer也实现了FinishedCallback接口，它告诉LinearDRPCTopologyBuilder，对于任意给定的请求ID，当它已收到所有指向它的元组时，请通知它。这个回调是finishedId方法。在这个回调中，PartialUniquer发射单一的元组，元组包含它的追随者子集的唯一总数。

在底层，CoordinatedBolt用于检测一个bolt何时收到该请求ID的所有元组。CoordinatedBolt使用direct stream管理协调。

其它的拓扑应该是不言自明。如你所见，reach计算的每一单步都是并行执行的，而且定义一个DRPC拓扑也非常简单。

Non-Linear DRPC拓扑

LinearDRPCTopologyBuilder仅处理“线性的”DRPC拓扑，计算以一连串步骤的形式表达（像reach）。不难想象某些功能将需要更复杂的拓扑结构，这些拓扑带有带分支和合并bolt。目前，要做到这一点，你需要直接使用CoordinateBolt。务必在邮件列表中谈谈你的非线性DRPC拓扑用例，写下DRPC拓扑更普遍的抽象结构。

LinearDRPCTopologyBuilder如何工作？

DRPCSpout发射[args, return-info]，return-info是DRPC服务器的主机和端口，还有DRPC服务器生成的ID。

拓扑组成部分：

DRPCSpout
PrepareRequest（生成一个请求ID，创建一个返回信息流，一个参数流）
CoordinatedBolt包装器和直接分组
JoinResult（同返回信息一起连接结果）
ReturnResult（连接DRPC服务器并返回结果）
LinearDRPCTopologyBuilder是一个构建在Storm原语之上的高层次抽象的好例子。

高级

同时编排处理多个请的KeyedFairBolt
如何直接使用CoordinateBolt

徐红星 2012-01-19 14:17 发表评论

Storm简介

徐红星 — Sat, 14 Jan 2012 09:08:00 GMT

Storm简介
Storm是一个分布式的、容错的实时计算系统，可以方便地在一个计算机集群中编写与扩展复杂的实时计算。在海量领域里，Storm用于实时数据的处理，Hadoop用于批数据的处理，两者可以说是绝代双雄！Storm保证每个消息都会得到处理，而且它很快——在一个小集群中，每秒可以处理数以百万计的消息。
Storm的优点
1. 简单的编程模型。类似于MapReduce降低了并行批处理复杂性，Storm降低了进行实时处理的复杂性。
2. 服务化,一个服务框架,支持热部署,即时上线或下线App.
3. 可以使用各种编程语言。你可以在Storm之上使用各种编程语言。默认支持Clojure、Java、Ruby和Python。要增加对其他语言的支持，只需实现一个简单的Storm通信协议即可。
4. 容错性。Storm会管理工作进程和节点的故障。
5. 水平扩展。计算是在多个线程、进程和服务器之间并行进行的。
6. 可靠的消息处理。Storm保证每个消息至少能得到一次完整处理。任务失败时，它会负责从消息源重试消息。
7. 快速。系统的设计保证了消息能得到快速的处理，使用ZeroMQ作为其底层消息队列。
8. 本地模式。Storm有一个“本地模式”，可以在处理过程中完全模拟Storm集群。这让你可以快速进行开发和单元测试。
有优点，必定有缺点。不过相对来说，我觉得这些问题都不大
1. 目前的开源版本中只是单节点Nimbus（我们可以在生产环境实现一个双nimbus的布局）。
2. Clojure是一个在JVM平台运行的动态函数式编程语言,优势在于流程计算，Storm的部分核心内容由Clojure编写，虽然性能上提高不少但同时也提升了维护成本--学学Clojure也很不错，只要你融入里面。

Storm架构

Storm集群由一个主节点和多个工作节点组成，分布式的架构大多如此，没什么好说的。主节点运行了一个名为“Nimbus”的守护进程，用于分配代码、布置任务及故障检测。每个工作节点都运行了一个名为“Supervisor”的守护进程，用于监听工作，开始并终止工作进程。Nimbus和Supervisor都能快速失败，而且是无状态的，这样十分健壮，两者的协调当然是由Zookeeper来完成的，ZooKeeper用于管理集群中的不同组件，ZeroMQ是内部消息系统，JZMQ是ZeroMQMQ的Java Binding。

徐红星 2012-01-14 17:08 发表评论