BlogJava-paulwong-随笔分类-STORM

HADOOP各种框架应用领域

paulwong — Sun, 04 Jan 2015 04:57:00 GMT

***** Data Analytics : Technology Area *****
1. Real Time Analytics : Apache Storm
2. In-memory Analytics : Apache Spark
3. Search Analytics : Apache Elastic search, SOLR
4. Log Analytics : Apache ELK Stack,ESK Stack(Elastic Search, Log
Stash, Spark Streaming, Kibana)
5. Batch Analytics : Apache MapReduce

***** NO SQL DB *****
1. MongoDB
2. Hbase
3. Cassandra

***** SOA *****
1. Oracle SOA
2. JBoss SOA
3. TiBco SOA
4. SOAP, RESTful Webservices

paulwong 2015-01-04 12:57 发表评论

最火爆的开源流式系统Storm vs 新星Samza

paulwong — Tue, 02 Dec 2014 07:03:00 GMT

分布计算系统框架，按照数据集的特点来说，主要分为data-flow和streaming两种。data-flow主要是以数据块为数据源来处理数据，代表有：MR、Spark等，我称作它们为大数据，而streaming主要是处理单位内得到的数据，这种方式，更注重于实时性，主要包括Strom、JStorm和Samza等，我称作它们为快数据。

在这篇文章中，我主要谈论streaming相关的框架。

第一个是Storm，一个实时计算系统，它假定数据源是动态的，可以向流水一样处理数据。

它的特点是：低延迟、高性能、分布式、可扩展和容错性。

架构如下图所示。

Storm的具体概念可以参照：http://blog.csdn.net/hljlzc2007/article/details/12976211，这里不做具体介绍。

Storm目前算是最最稳定的开源流式处理框架，但是个人认为它有两个问题。

1. Storm虽然支持多个语言编写spout和bolt端的代码，但是它的主要技术实现是clojure，这给玩大数据、开源的朋友带来了极大的不变，因为大家会的语言不是以java和C++等大众语言为主，这样的话，变得不可控了，难以深入了解、修改其细节。

2. Storm可以支持在Yarn(Hadoop 2.0)上，可以和其他开源框架共享Hadoop集群的资源，但是性能不佳，这个有待Storm改善

当然无论如何，Storm依然是目前开源流式处理框架的王者。

第二个我想说的是JStorm，这个是阿里做的，算是Storm的另一个实现，它用的语言是Java.

特点：

1. 客户端的API与Storm基本上是一致的，如果从Storm迁移过来，不需要修改bolt和spout的代码

2. Jstrom比Strom稳定，速度更快

3. 提供了一些新的特性

大家有兴趣可以去玩玩，项目地址https://github.com/alibaba/jstorm

第三个是Samza

Samza是由LinkedIn开源的一个技术，它是一个开源的分布式流处理系统，非常类似于Storm。不同的是它运行在Hadoop之上，并且使用了自己开发的Kafka分布式消息处理系统。

这是Linkin开发的一个小而美的项目，如何美呢？

1. 只有几千行代码，完成的功能就可以和Storm媲美，当然目前还有很多的不足

2. 和Kafka结合紧密，更方便的处理数据

3. 运行在Yarn上

之前我做过的一个项目，是Kafka + Storm + ElasticSearch，将来完全可以将Storm替换成Samza，这样的话，还可以利用Hadoop集群的资源，做一些存储、离线分析的功能。将实时处理和离线分析都运行在Hadoop上，不得不说Samza是一个伟大的项目，这样可以减少项目的增长复杂度，利于维护，还是那句话，小而美的东西，更受欢迎一些。

架构：

Samza主要包含三层，

1. 流处理层 --> Kafka

2. 执行层 --> YARN

3. 处理层 --> Samza API

Samza的流处理层和执行层都是可插拔式的，开发人员可以使用其他框架来替代，不局限于上述两种技术。

Samza提供了一个YARN ApplicationMaster，和YARN job，运行在集群之外，下图中不同颜色代表不同的主机。

Samza客户端告诉YARN的Resouce Manager，它想启动一个Samza job， YARN RM 告诉YARN Node manager，分配空间给YARN ApplicationMaster，NM指定完空间后，YARN container会运行Samza Task Runner。

Samza状态管理

流式处理数据对状态的管理是很难的，由于数据是流动的，本身没有状态，这样就需要靠历史数据来记录应用的场合，Samza提供了一个内部的key-value数据库，它是基于LevelDB，运行的JVM之外的，使用它来存储历史数据。这样的做的好处是：

1. 减少JVM的开销

2. 使用内部存储，极大提高的吞吐率

3. 减少并发操作

Samza处理流程.

下图是Samza官方给的一例子，根据Member ID分组，计算页面访问次数。入口消息分别来自Machine1、2，出口是Machine3，我们可以这样理解，消息分散在不同的消息系统中（Kafka），Samza从不同的Kafka中读取topic，在将topic进行处理后，发送到Machine3，这里不做过多分解，具体可以参照官方文档。

项目地址：https://github.com/apache/incubator-samza

官方文件：http://samza.incubator.apache.org/

以上给了我们无限遐想，Storm是否会保持领先地位，Samza能否取而代之呢，无论如何，作为开发者来说，几千行代码，我都迫不及待去要读一下了。

paulwong 2014-12-02 15:03 发表评论

Auto rebalance Storm

paulwong — Fri, 09 May 2014 15:48:00 GMT

http://stackoverflow.com/questions/15010420/storm-topology-rebalance-using-java-code

使用Nimbus获取STORM的信息
http://www.andys-sundaypink.com/i/retrieve-storm-cluster-statistic-from-nimbus-java-mode/

TSocket tsocket = new TSocket("localhost", 6627);
TFramedTransport tTransport = new TFramedTransport(tsocket);
TBinaryProtocol tBinaryProtocol = new TBinaryProtocol(tTransport);
Nimbus.Client client = new Nimbus.Client(tBinaryProtocol);
String topologyId = "test-1-234232567";

try {

tTransport.open();
ClusterSummary clusterSummary = client.getClusterInfo();
StormTopology stormTopology = client.getTopology(topologyId);
TopologyInfo topologyInfo = client.getTopologyInfo(topologyId);
List executorSummaries = topologyInfo.get_executors();

List topologies = clusterSummary.get_topologies();
for(ExecutorSummary executorSummary : executorSummaries){

String id = executorSummary.get_component_id();
ExecutorInfo executorInfo = executorSummary.get_executor_info();
ExecutorStats executorStats = executorSummary.get_stats();
System.out.println("executorSummary :: " + id + " emit size :: " + executorStats.get_emitted_size());
}
} catch (TTransportException e) {
e.printStackTrace();
} catch (TException e) {
e.printStackTrace();
} catch (NotAliveException e) {
e.printStackTrace();
}

paulwong 2014-05-09 23:48 发表评论

浅释STORM

paulwong — Fri, 09 May 2014 14:56:00 GMT

STORM是一个消息处理引擎，可以处理源源不断的进来的消息，这些消息的处理是可以按步骤的。

处理的方式有各种自定义：

可自定义消息处理的步骤
可自定义每种类型的消息需要多少个进程来处理
每个步骤里的消息是在某个进程里的线程来做处理的
可自定义每个步骤里的消息的线程数
可以增加和删除要处理的消息类型

如果要处理某种消息了，要怎么办呢？

定义数据来源组件(SPOUT)
定义处理步骤(BOLT)
组合成一个消息处理流程框架TOPOLOGY
定义处理消息的进程的数量、定义每个步骤并发时可用的线程数
部署TOPOLOGY

当一个TOPOLOGY被部署到STORM时，STORM会查找配置对象的WORKER数量，根据这个数量相应的启动N个JVM，然后根据每个步骤配置的NUMTASKS生成相应个数的线程，然后每个步骤中配置的数量实例化相应个数的对象，然后就启动一个线程不断的执行SPOUT中的nextTuple()方法，如果这个方法中有输出结果，就启动另一线程，并在此线程中将这个结果作为参数传到下一个对象的excue方法中。

如果此时又有一个步骤BOLT需要执行的话，也是新取一个线程去执行BOLT中的方法启动的线程不会越过NUMTASKS的数量。

paulwong 2014-05-09 22:56 发表评论

Storm performance

paulwong — Thu, 08 May 2014 01:19:00 GMT

The configuration is used to tune various aspects of the running topology. The two configurations specified here are very common:

TOPOLOGY_WORKERS (set with setNumWorkers) specifies how many processes you want allocated around the cluster to execute the topology. Each component in the topology will execute as many threads. The number of threads allocated to a given component is configured through the setBolt and setSpout methods. Those threadsexist within worker processes. Each worker process contains within it some number of threads for some number of components. For instance, you may have 300 threads specified across all your components and 50 worker processes specified in your config. Each worker process will execute 6 threads, each of which of could belong to a different component. You tune the performance of Storm topologies by tweaking the parallelism for each component and the number of worker processes those threads should run within.
TOPOLOGY_DEBUG (set with setDebug), when set to true, tells Storm to log every message every emitted by a component. This is useful in local mode when testing topologies, but you probably want to keep this turned off when running topologies on the cluster.

There's many other configurations you can set for the topology. The various configurations are detailed on the Javadoc for Config.

Common configurations

There are a variety of configurations you can set per topology. A list of all the configurations you can set can be found here. The ones prefixed with "TOPOLOGY" can be overridden on a topology-specific basis (the other ones are cluster configurations and cannot be overridden). Here are some common ones that are set for a topology:

Config.TOPOLOGY_WORKERS: This sets the number of worker processes to use to execute the topology. For example, if you set this to 25, there will be 25 Java processes across the cluster executing all the tasks. If you had a combined 150 parallelism across all components in the topology, each worker process will have 6 tasks running within it as threads.
Config.TOPOLOGY_ACKERS: This sets the number of tasks that will track tuple trees and detect when a spout tuple has been fully processed. Ackers are an integral part of Storm's reliability model and you can read more about them onGuaranteeing message processing.
Config.TOPOLOGY_MAX_SPOUT_PENDING: This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS: This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed. This value defaults to 30 seconds, which is sufficient for most topologies. SeeGuaranteeing message processing for more information on how Storm's reliability model works.
Config.TOPOLOGY_SERIALIZATIONS: You can register more serializers to Storm using this config so that you can use custom types within tuples.

Reference:
http://storm.incubator.apache.org/documentation/Running-topologies-on-a-production-cluster.html

storm rebalance 命令调整topology并行数及问题分析

http://blog.csdn.net/jmppok/article/details/17243857

flume+kafka+storm+mysql 数据流
http://blog.csdn.net/jmppok/article/details/17259145

http://storm.incubator.apache.org/documentation/Tutorial.html

paulwong 2014-05-08 09:19 发表评论

安装STORM

paulwong — Sun, 04 May 2014 10:01:00 GMT

install ZeroMQ

wget http://download.zeromq.org/historic/zeromq-2.1.7.tar.gz
tar -xzf zeromq-2.1.7.tar.gz
cd zeromq-2.1.7
./configure
//在configure时可能会报缺包，安装即可：sudo apt-get install g++ uuid-dev
make
sudo make install
install JZMQ

git clone https://github.com/nathanmarz/jzmq.git
cd jzmq
./autogen.sh
./configure
make
sudo make install
下载并解压STORM
编辑conf/storm.yaml

storm.zookeeper.servers:
- "1.2.3.5"
- "1.2.3.6"
- "1.2.3.7"
storm.local.dir: "/opt/folder"
nimbus.host: "54.72.4.92"
supervisor.slots.ports:
- 6700
- 6701
- 6702
编辑/etc/profile

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export STORM_HOME=/home/ubuntu/java/storm-0.8.1
export KAFKA_HOME=/home/ubuntu/java/kafka_2.9.2-0.8.1.1
export ZOOKEEPER_HOME=/home/ubuntu/java/zookeeper-3.4.6

export PATH=$JAVA_HOME/bin:$STORM_HOME/bin:$KAFKA_HOME/bin:$ZOOKEEPER_HOME/bin:$PATH

制作启动命令: start-storm.sh
storm nimbus &
storm supervisor &
storm ui &

安装途中如果遇到问题
http://my.oschina.net/mingdongcheng/blog/43009

paulwong 2014-05-04 18:01 发表评论

STORM启动与部署TOPOLOGY

paulwong — Wed, 11 Sep 2013 03:00:00 GMT

启动ZOOPKEEPER
zkServer.sh start
启动NIMBUS
storm nimbus &
启动SUPERVISOR
storm supervisor &
启动UI
storm ui &
部署TOPOLOGY
storm jar /opt/hadoop/loganalyst/storm-dependend/data/teststorm-1.0.jar teststorm.TopologyMain /opt/hadoop/loganalyst/storm-dependend/data/words.txt
删除TOPOLOGY
storm kill {toponame}
激活TOPOLOGY
storm active {toponame}
不激活TOPOLOGY
storm deactive {toponame}
列出所有TOPOLOGY
storm list

paulwong 2013-09-11 11:00 发表评论

STORM资源

paulwong — Sun, 08 Sep 2013 11:59:00 GMT

Install Storm
http://www.jansipke.nl/installing-a-storm-cluster-on-centos-hosts/
http://www.cnblogs.com/kemaswill/archive/2012/10/24/2737833.html
http://abentotoro.blog.sohu.com/197023262.html
http://www.cnblogs.com/panfeng412/archive/2012/11/30/how-to-install-and-deploy-storm-cluster.html

使用 Twitter Storm 处理实时的大数据
http://www.ibm.com/developerworks/cn/opensource/os-twitterstorm/

Storm数据流模型的分析及讨论
http://www.cnblogs.com/panfeng412/archive/2012/07/29/storm-stream-model-analysis-and-discussion.html
http://www.cnblogs.com/panfeng412/tag/Storm/

storm-kafka
https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka

使用Storm实现实时大数据分析！
http://www.csdn.net/article/2012-12-24/2813117-storm-realtime-big-data-analysis

storm-deploy-aws
https://github.com/nathanmarz/storm-deploy/wiki

!!!知乎网站上的Twitter Storm
http://www.zhihu.com/topic/19673110

storm-elastic-search
https://github.com/hmsonline/storm-elastic-search

storm-examples
https://github.com/stormprocessor/storm-examples

kafka-aws
https://github.com/nathanmarz/kafka-deploy

Next Gen Real-time Streaming with Storm-Kafka Integration
http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/

flume+kafka+storm+mysql 数据流
http://blog.csdn.net/baiyangfu/article/details/8096088
http://blog.csdn.net/baiyangfu/article/category/1244640

Kafka学习笔记
http://blog.csdn.net/baiyangfu/article/details/8096084

STORM+KAFKA
https://github.com/buildlackey/cep

STORM+KETTLE
https://github.com/buildlackey/kettle-storm

paulwong 2013-09-08 19:59 发表评论

STORM与HADOOP的比较

paulwong — Sun, 08 Sep 2013 11:49:00 GMT

对于一堆时刻在增长的数据，如果要统计，可以采取什么方法呢？

等数据增长到一定程度的时候，跑一个统计程序进行统计。适用于实时性要求不高的场景。
如将数据导到HDFS，再运行一个MAP REDUCE JOB。
如果实时性要求高的，上面的方法就不行了。因此就带来第二种方法。
在数据每次增长一笔的时候，就进行统计JOB，结果放到DB或搜索引擎的INDEX中。
STORM就是完成这种工作的。

HADOOP与STORM比较

数据来源：HADOOP是HDFS上某个文件夹下的可能是成TB的数据，STORM是实时新增的某一笔数据
处理过程：HADOOP是分MAP阶段到REDUCE阶段，STORM是由用户定义处理流程，
流程中可以包含多个步骤，每个步骤可以是数据源(SPOUT)或处理逻辑(BOLT)
是否结束：HADOOP最后是要结束的，STORM是没有结束状态，到最后一步时，就停在那，直到有新
数据进入时再从头开始
处理速度：HADOOP是以处理HDFS上大量数据为目的，速度慢，STORM是只要处理新增的某一笔数据即可
可以做到很快。
适用场景：HADOOP是在要处理一批数据时用的，不讲究时效性，要处理就提交一个JOB，STORM是要处理
某一新增数据时用的，要讲时效性
与MQ对比：HADOOP没有对比性，STORM可以看作是有N个步骤，每个步骤处理完就向下一个MQ发送消息，
监听这个MQ的消费者继续处理

paulwong 2013-09-08 19:49 发表评论