﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-paulwong-随笔分类-HADOOP</title><link>http://www.blogjava.net/paulwong/category/50835.html</link><description /><language>zh-cn</language><lastBuildDate>Wed, 22 Apr 2015 03:37:36 GMT</lastBuildDate><pubDate>Wed, 22 Apr 2015 03:37:36 GMT</pubDate><ttl>60</ttl><item><title>HADOOP各种框架应用领域</title><link>http://www.blogjava.net/paulwong/archive/2015/01/04/422020.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 04 Jan 2015 04:57:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2015/01/04/422020.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/422020.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2015/01/04/422020.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/422020.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/422020.html</trackback:ping><description><![CDATA[***** Data Analytics : Technology Area *****<br />1. Real Time Analytics	: Apache Storm<br />2. In-memory Analytics	: Apache Spark<br />3. Search Analytics	: Apache Elastic search, SOLR<br />4. Log Analytics	: Apache ELK Stack,ESK Stack(Elastic Search, Log <br />Stash, Spark Streaming, Kibana)<br />5. Batch Analytics	: Apache MapReduce<br /><br />***** NO SQL DB *****	<br />1. MongoDB	<br />2. Hbase	<br />3. Cassandra	<br /><br />***** SOA *****	<br />1. Oracle SOA	<br />2. JBoss SOA	<br />3. TiBco SOA	<br />4. SOAP, RESTful Webservices&nbsp;<img src ="http://www.blogjava.net/paulwong/aggbug/422020.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2015-01-04 12:57 <a href="http://www.blogjava.net/paulwong/archive/2015/01/04/422020.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>编译HADOOP源码</title><link>http://www.blogjava.net/paulwong/archive/2014/12/16/421437.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 15 Dec 2014 17:41:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/12/16/421437.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/421437.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/12/16/421437.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/421437.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/421437.html</trackback:ping><description><![CDATA[<a href="https://github.com/apache/hadoop/blob/trunk/BUILDING.txt" target="_blank">https://github.com/apache/hadoop/blob/trunk/BUILDING.txt</a><br />
<br />
配置 eclipse 编译、开发 Hadoop（MapReduce）源代码<br />
<a href="http://blog.csdn.net/basicthinker/article/details/6174442" target="_blank">http://blog.csdn.net/basicthinker/article/details/6174442</a><br />
<br />
hadoop2.2.0源代码编译<br />
<a href="http://my.oschina.net/cloudcoder/blog/192224" target="_blank">http://my.oschina.net/cloudcoder/blog/192224</a><br />
<br />
Apache Hadoop 源代码编译环境搭建<br />
<a href="http://qq85609655.iteye.com/blog/1986991" target="_blank">http://qq85609655.iteye.com/blog/1986991</a><br />
<br />
<br />
<br />
<ol>
     <li>Download code from&nbsp;<a href="https://codeload.github.com/apache/hadoop/zip/trunk" target="_blank">https://codeload.github.com/apache/hadoop/zip/trunk</a>, then unzip it, there is a folder hadoop-trunk.<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->wget&nbsp;https://codeload.github.com/apache/hadoop/zip/trunk<br />
     unzip trunk</div>
     </li>
     <li>Install native&nbsp;libraries<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->Ubuntu<br />
     sudo&nbsp;apt-get&nbsp;-y&nbsp;install&nbsp;maven&nbsp;build-essential&nbsp;autoconf&nbsp;automake&nbsp;libtool&nbsp;cmake&nbsp;zlib1g-dev&nbsp;pkg-config&nbsp;libssl-dev<br />
     <br />
     Cent OS<br />
     yum&nbsp;-y&nbsp;install&nbsp;&nbsp;lzo-devel&nbsp;&nbsp;zlib-devel&nbsp;&nbsp;gcc&nbsp;autoconf&nbsp;automake&nbsp;libtool&nbsp;openssl-devel cmake<br />
     get&nbsp;<span style="font-family: 微软雅黑; font-size: 14px; line-height: 26px; background-color: #ffffff;">protobuf&nbsp;</span>zip from&nbsp;<span style="font-size: 14px;">http://f.dataguru.cn/thread-459689-1-1.html<br />
     ./configure<br />
     make<br />
     make check<br />
     make install</span></div>
     </li>
     <li>$vi /etc/profile <br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all">export&nbsp;PROTOC_HOME=<span style="font-size: 12px;">/root/java/hadoop-source/protobuf-2.5.0</span><br />export&nbsp;PATH=$PATH:$PROTOC_HOME/src</div></li>
     <li>cd to&nbsp;hadoop-trunk, run&nbsp;<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->mvn&nbsp;compile&nbsp;-Pnative</div>
     </li>
     <li>cd to hadoop-maven-plugins, run<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->mvn&nbsp;install</div>
     </li>
     <li>cd to&nbsp;hadoop-trunk<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->mvn&nbsp;install&nbsp;-DskipTests</div>
     </li>
     <li>Make sure still in&nbsp;hadoop-trunk folder,&nbsp;Build Eclipse project<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->mvn&nbsp;eclipse:eclipse&nbsp;-DskipTests</div>
     </li>
     <li>Import the maven project to Eclipse</li>
     <br />
     <br />
</ol><img src ="http://www.blogjava.net/paulwong/aggbug/421437.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-12-16 01:41 <a href="http://www.blogjava.net/paulwong/archive/2014/12/16/421437.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Simplehbase</title><link>http://www.blogjava.net/paulwong/archive/2014/07/15/415803.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Tue, 15 Jul 2014 00:35:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/07/15/415803.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/415803.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/07/15/415803.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/415803.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/415803.html</trackback:ping><description><![CDATA[<a href="https://github.com/zhang-xzhi/simplehbase/ " target="_blank">https://github.com/zhang-xzhi/simplehbase/ </a><br /><a href="https://github.com/zhang-xzhi/simplehbase/wiki " target="_blank">https://github.com/zhang-xzhi/simplehbase/wiki</a><br /><br /><br />## simplehbase简介 <br />simplehbase是java和hbase之间的轻量级中间件。 <br />主要包含以下功能。 <br />*  数据类型映射：java类型和hbase的bytes之间的数据转换。 <br />*  简单操作封装：封装了hbase的put,get,scan等操作为简单的java操作方式。 <br />*  hbase query封装：封装了hbase的filter，可以使用sql-like的方式操作hbase。 <br />*  动态query封装：类似于myibatis，可以使用xml配置动态语句查询hbase。 <br />*  insert,update支持: 建立在hbase的checkAndPut之上。 <br />*  hbase多版本支持：提供接口可以对hbase多版本数据进行查询,映射。 <br />*  hbase原生接口支持。 <br /><br /><br />### v0.9 <br />新增 <br /><br />支持HTable如下使用方式，对HTable可以定时flush。 <br />主要场景： <br />批量写入，但是flush可以配置为指定时间间隔进行。 <br />不降低批操作的吞吐，同时，有一定的实时性保证。 <br /><br />支持用户自定义htablePoolService。 <br />多个HTable可以使用同一个线程池。 <br /><br />intelligentScanSize功能,可以根据limit的值设定scan的cachingsize大小。 <br /><br /><br />### v0.8 <br />批量操作接口新增 <br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000FF; ">public</span>&nbsp;&lt;T&gt;&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;putObjectList(List&lt;PutRequest&lt;T&gt;&gt;&nbsp;putRequestList);&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectList(List&lt;RowKey&gt;&nbsp;rowKeyList,&nbsp;Class&lt;?&gt;&nbsp;type);&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;&lt;T&gt;&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;putObjectListMV(List&lt;PutRequest&lt;T&gt;&gt;&nbsp;putRequests,<span style="color: #0000FF; ">long</span>&nbsp;timestamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;&lt;T&gt;&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;putObjectListMV(List&lt;PutRequest&lt;T&gt;&gt;&nbsp;putRequests,Date&nbsp;timestamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;&lt;T&gt;&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;putObjectListMV(List&lt;PutRequest&lt;T&gt;&gt;&nbsp;putRequestList)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectMV(RowKey&nbsp;rowKey,&nbsp;Class&lt;?&gt;&nbsp;type,&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;timeStamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectMV(RowKey&nbsp;rowKey,&nbsp;Class&lt;?&gt;&nbsp;type,&nbsp;Date&nbsp;timeStamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectListMV(List&lt;RowKey&gt;&nbsp;rowKeyList,&nbsp;Class&lt;?&gt;&nbsp;type,<span style="color: #0000FF; ">long</span>&nbsp;timeStamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectListMV(List&lt;RowKey&gt;&nbsp;rowKeyList,&nbsp;Class&lt;?&gt;&nbsp;type,Date&nbsp;timeStamp)&nbsp;<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;deleteObjectListMV(List&lt;DeleteRequest&gt;&nbsp;deleteRequestList,Class&lt;?&gt;&nbsp;type);&nbsp;</div><br /><br />Util新增（前缀查询使用） <br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;RowKey&nbsp;getEndRowKeyOfPrefix(RowKey&nbsp;prefixRowKey)&nbsp;</div><br />性能改进 <br />把get的实现从scan调回get。 <br /><br />### v0.7新增功能： <br />支持查询时主记录和关联的RowKey同时返回。&nbsp;<img src ="http://www.blogjava.net/paulwong/aggbug/415803.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-07-15 08:35 <a href="http://www.blogjava.net/paulwong/archive/2014/07/15/415803.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>安装CLOUDERA</title><link>http://www.blogjava.net/paulwong/archive/2014/05/23/414035.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 23 May 2014 10:16:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/05/23/414035.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/414035.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/05/23/414035.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/414035.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/414035.html</trackback:ping><description><![CDATA[<a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html" target="_blank">http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html</a><br />
<br />
<br />
<a href="http://www.cnblogs.com/xuesong/p/3604080.html" target="_blank">http://www.cnblogs.com/xuesong/p/3604080.html</a><br />
<br />
<br />
<a href="http://www.linuxidc.com/Linux/2013-12/94180.htm" target="_blank">http://www.linuxidc.com/Linux/2013-12/94180.htm</a><br />
<br />
卸载<br />
<a href="http://www.cnblogs.com/shudonghe/articles/3133290.html" target="_blank">http://www.cnblogs.com/shudonghe/articles/3133290.html</a><br />
<br />
安装文件：<br />
<a href="http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=4ZFrtT9ZQN" target="_blank">http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=4ZFrtT9ZQN</a><br />
<br />
<br />
<ol>
     <li>change to no password<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->sudo&nbsp;chmod&nbsp;+w&nbsp;/etc/sudoers<br />
     sudo&nbsp;vi /etc/sudoers&nbsp;<br />
     ufuser ALL=(ALL)&nbsp;NOPASSWD:&nbsp;ALL<br />
     sudo&nbsp;chmod&nbsp;-w&nbsp;/etc/sudoers</div>
     <br />
     </li>
     <li>change disable<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->sudo&nbsp;vi&nbsp;/etc/selinux/config<br />
     <span style="font-family: Verdana; font-size: 12px; line-height: 18px; background-color: #f5fafe;">SELINUX=disabled<br />sudo reboot</span></div>
     <br />
     </li>
     <li>add to /etc/hosts<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->sudo vi /etc/hosts<br /><br />10.0.0.4&nbsp;ufhdp001.cloudapp.net&nbsp;ufhdp001<br />
     10.0.0.5&nbsp;ufhdp002.cloudapp.net&nbsp;ufhdp002</div>
     <br />
     </li>
     <li>download bin<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
     <br />
     Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
     http://www.CodeHighlighter.com/<br />
     <br />
     -->wget&nbsp;http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin</div>
     <br />
     </li>
     <li>run the bin<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all">
     <div>chmod 755 cloudera-manager-installer.bin&nbsp;</div>
     sudo&nbsp;./cloudera-manager-installer.bin&nbsp;</div>
     </li>
</ol><img src ="http://www.blogjava.net/paulwong/aggbug/414035.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-05-23 18:16 <a href="http://www.blogjava.net/paulwong/archive/2014/05/23/414035.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>2014年值得关注的十个Hadoop大数据创业公司</title><link>http://www.blogjava.net/paulwong/archive/2014/05/23/414019.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 23 May 2014 04:15:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/05/23/414019.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/414019.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/05/23/414019.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/414019.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/414019.html</trackback:ping><description><![CDATA[<p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">开源大数据框架Apache Hadoop已经成了大数据处理的事实标准，同时也几乎成了大数据的代名词，虽然这多少有些以偏概全。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">根据Gartner的估计，目前的Hadoop生态系统市场规模在7700万美元左右，2016年，该市场规模将快速增长至8.13亿美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">但是在Hadoop这个快速扩增的蓝海中游泳并非易事，不仅开发大数据基础设施技术产品这件事很难，销售起来也很难，具体到大数据基础设施工具如 Hadoop、NoSQL数据库和流处理系统则更是难上加难。客户需要大量培训和教育，付费用户需要大量支持和及时跟进的产品开发工作。而跟企业级客户打 交道往往并非创业公司团队的强项。此外，大数据基础设施技术创业通常对风险投资规模也有较高要求。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">尽管困难重重，Hadoop创业公司依然如雨后春笋冒出，除了Cloudera、Datameer、DataStax和MapR等已经功成名就的 Hadoop创业公司外，最近CIO杂志评出了2014年十大最值得关注的Hadoop创业公司，了解这些公司的产品和商业模式对企业大数据技术创业者和 大数据应用用户来说都非常有参考价值：</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">一、<a href="http://www.platfora.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Platfora</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="platfora.png" src="http://static.oschina.net/uploads/img/201404/24071338_Dg4Y.png" alt="platfora" border="0" height="72" width="226" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：所提供的大数据分析解决方案能够将Hadoop中的原始数据转换成可互动的，基于内存计算的商业智能服务。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2011年，迄今已募集6500万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：Platfora的目标是简化复杂难用的Hadoop，推动Hadoop在企业市场的应用。Platfora的做法是简化数据采集和分析 流程，将Hadoop中的原始数据自动转化成可以互动的商业智能服务，无需ETL或者数据仓库。(参考阅读：Hadoop只是穷人的ETL)</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">二、<a href="http://alpinenow.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Alpine Data Labs</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Alpine Data.png" src="http://static.oschina.net/uploads/img/201404/24071339_qHWV.png" alt="alpine data" border="0" height="150" width="364" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供基于Hadoop的数据分析平台</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2010年，迄今累计融资2350万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：复杂的高级分析和机器学习应用通常都需要脚本和代码开发高手实现，这进一步<a href="http://www.ctocio.com/hotnews/15099.html" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">推高了数据科学家的技术门槛</a>。实际上大数据企业高管和IT经理都没时间也没兴致学习编程技术，或者去了解复杂的Hadoop。Alpine Data通过SaaS服务的方式大幅降低了预测分析的应用门槛。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">三、</strong><a href="http://www.altiscale.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Altiscale</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="altiscale.png" src="http://static.oschina.net/uploads/img/201404/24071339_c03z.png" alt="altiscale" border="0" height="84" width="230" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供Hadoop即服务（HaaS）</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2012年3月，迄今融资1200万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：大数据正在闹人才荒，而通过云计算提供Hadoop相关服务无疑是普及Hadoo的一条捷径，根据TechNavio的估计，2016年 HaaS市场规模将高达190亿美元，是块大蛋糕。但是HaaS市场的竞争已经日趋激烈，包括亚马逊EMR、微软的Hadoop on Azure，以及Rackspace的Hortonworks云服务等都是重量级玩家，Altiscale还需要与Hortonworks、 Cloudera、Mortar Data、Qubole、Xpleny展开直接竞争。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">四、<a href="http://www.trifacta.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Trifacta</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="trifacta.png" src="http://static.oschina.net/uploads/img/201404/24071339_aHSi.png" alt="trifacta" border="0" height="80" width="272" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供平台帮助用户将复杂的原始数据转化成干净的结构化格式供分析使用。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2012年，迄今融资1630万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：大数据技术平台和分析工具之间存在一个巨大的瓶颈，那就是数据分析专家需要花费大量精力和时间转化数据，而且业务数据分析师们往往也并不 具备独立完成数据转化工作的技术能力。为了解决这个问题Trifacta开发出了&#8220;预测互动&#8221;技术，将数据操作可视化，而且Trifacta的机器学习算 法还能同时观察用户和数据属性，预测用户意图，并自动给出建议。Trifata的竞争对手是Paxata、Informatica和CirroHow。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">五、<a href="http://www.splicemachine.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Splice Machin</a></strong><strong style="margin: 0px; padding: 0px;"><a href="http://www.splicemachine.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">e</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Splice Machine.png" src="http://static.oschina.net/uploads/img/201404/24071339_GqsP.png" alt="splice machine" border="0" height="72" width="132" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供面向大数据应用的，基于Hadoop的SQL兼容数据库。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2012年，迄今融资1900万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：新的数据技术使得传统关系型数据库的一些流行功能如ACID合规、交易一致性和标准的SQL查询语言等得以在廉价可扩展的Hadoop上 延续。Splice Machine保留了NoSQL数据库所有的优点，例如auto-sharding，容错、可扩展性等，同时又保留了SQL。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">六、<a href="http://www.datatorrent.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">DataTorrent</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="datatorrent.png" src="http://static.oschina.net/uploads/img/201404/24071339_DqP6.png" alt="datarorrent" border="0" height="68" width="274" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供基于Hadoop平台的实时流处理平台</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2012年，2013年6月获得800万美元A轮融资。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：大数据的未来是快数据，而DataTorrent正是要解决快数据的问题。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">七、<a href="http://www.qubole.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Qubole</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="qubole.png" src="http://static.oschina.net/uploads/img/201404/24071339_Vbbd.png" alt="qubole" border="0" height="128" width="318" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供大数据DaaS服务，基于&#8220;真正的自动扩展Hadoop集群&#8221;。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2011年，累计融资700万美元。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：大数据人才一将难求，对于大多数企业来说，像使用SaaS企业应用一样使用Hadoop是一个现实的选择。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">八、<a href="http://www.continuuity.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Continuuity&nbsp;</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="COntinuuity.png" src="http://static.oschina.net/uploads/img/201404/24071340_2Duv.png" alt="continuuity" border="0" height="60" width="314" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供基于Hadoop的大数据应用托管平台</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2011年，累计获得1250万美元融资，创始人兼CEO Todd Papaioannou曾是雅虎副总裁云架构负责人，去年夏天Todd离开Continuuity后，联合创始人CTO Jonathan Gray接替担任CEO一职。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：Continuuity的商业模式非常聪明也非常独特，他们绕过非常难缠的Hadoop专家，直接向Java开发者提供应用开发平台，其 旗舰产品Reactor是一个基于Hadoop的Java集成化数据和应用框架，Continuuity将底层基础设施进行抽象处理，通过简单的Java 和REST API提供底层基础设施服务，为用户大大简化了Hadoop基础设施的复杂性。Continuuity最新发布的服务&#8212;&#8212;Loom是一个集群管理方案，通 过Loom创建的集群可以使用任意硬件和软件堆叠的模板，从单一的LAMP服务器和传统应用服务器如JBoss到包含数千个节点的大规模的Hadoop集 群。集群还可以部署在多个云服务商的环境中（例如Rackspace、Joyent、Openstack等）而且还能使用常见的SCM工具。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">&nbsp;</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">九、</strong><a href="http://www.xplenty.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Xplenty</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Xplenty.png" src="http://static.oschina.net/uploads/img/201404/24071340_ERx8.png" alt="xplenty" border="0" height="92" width="112" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供HaaS服务</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2012年，从Magma风险投资获得金额不详的融资。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：虽然Hadoop已经成了大数据的事实工业标准，但是Hadoop的开发、部署和维护对技术人员的技能依然有着极高要求。Xplenty 的技术通过无需编写代码的Hadoop开发环境提供Hadoop处理服务，企业无需投资软硬件和专业人才就能快速享受大数据技术。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">十、</strong><a href="http://www.nuevora.com/" style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Nuevora</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Nuevora.png" src="http://static.oschina.net/uploads/img/201404/24071340_8vNy.png" alt="nuevora" border="0" height="56" width="196" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务：提供大数据分析应用</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">简介：创立于2011年，累计获得300万早期投资。</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微软雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由：Nuevora的着眼点是大数据应用最早启动的两个领域：营销和客户接触。Nuevora的nBAAP（大数据分析与应用）平台的主要功 能包括基于最佳时间预测算法的定制分析应用，nBAAP基于三个关键大数据技术：Hadoop（大数据处理）、R（预测分析）和Tableau（数据可视 化）</p><img src ="http://www.blogjava.net/paulwong/aggbug/414019.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-05-23 12:15 <a href="http://www.blogjava.net/paulwong/archive/2014/05/23/414019.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>KMEANS PAGERANK ON HADOOP</title><link>http://www.blogjava.net/paulwong/archive/2014/05/07/413384.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 07 May 2014 15:57:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/05/07/413384.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/413384.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/05/07/413384.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/413384.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/413384.html</trackback:ping><description><![CDATA[<a href="https://github.com/keokilee/kmeans-hadoop" target="_blank">https://github.com/keokilee/kmeans-hadoop</a><br /><br /><a href="https://github.com/rorlig/hadoop-pagerank-java" target="_blank">https://github.com/rorlig/hadoop-pagerank-java</a><br /><br /><a href="http://wuyanzan60688.blog.163.com/blog/static/12777616320131011426159/" target="_blank">http://wuyanzan60688.blog.163.com/blog/static/12777616320131011426159/<br />
<br />
</a><a href="http://codecloud.net/hadoop-k-means-591.html" target="_blank">http://codecloud.net/hadoop-k-means-591.html<br />
<br />
<br />
</a>
<div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #0000FF; ">import</span>&nbsp;java.io.*;<br />
<span style="color: #0000FF; ">import</span>&nbsp;java.net.URI;<br />
<span style="color: #0000FF; ">import</span>&nbsp;java.util.Iterator;<br />
<span style="color: #0000FF; ">import</span>&nbsp;java.util.Random;<br />
<span style="color: #0000FF; ">import</span>&nbsp;java.util.Vector;<br />
<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.filecache.DistributedCache;<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.fs.FileSystem;<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.fs.Path;<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.io.*;<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.*;<br />
<span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.util.GenericOptionsParser;<br />
<br />
<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;KMeans&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">enum</span>&nbsp;Counter&nbsp;{&nbsp;CENTERS,&nbsp;CHANGE,&nbsp;ITERATIONS&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;Point&nbsp;<span style="color: #0000FF; ">implements</span>&nbsp;WritableComparable&lt;Point&gt;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Longs&nbsp;because&nbsp;this&nbsp;will&nbsp;store&nbsp;sum&nbsp;of&nbsp;many&nbsp;ints</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;LongWritable&nbsp;x;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;LongWritable&nbsp;y;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;IntWritable&nbsp;num;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;For&nbsp;summation&nbsp;points</span><span style="color: #008000; "><br />
</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;Point()&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(0);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(0);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.num&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable(0);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;Point(<span style="color: #0000FF; ">int</span>&nbsp;x,&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;y)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(x);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(y);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.num&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable(1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;Point(IntWritable&nbsp;x,&nbsp;IntWritable&nbsp;y)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(x.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LongWritable(y.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">this</span>.num&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable(1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;add(Point&nbsp;that)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.set(x.get()&nbsp;+&nbsp;that.x.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.set(y.get()&nbsp;+&nbsp;that.y.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;num.set(num.get()&nbsp;+&nbsp;that.num.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;norm()&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.set(x.get()&nbsp;/&nbsp;num.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.set(y.get()&nbsp;/&nbsp;num.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;num.set(1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;write(DataOutput&nbsp;out)&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.write(out);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.write(out);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;num.write(out);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;readFields(DataInput&nbsp;in)&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.readFields(in);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.readFields(in);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;num.readFields(in);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;distance(Point&nbsp;that)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;dx&nbsp;=&nbsp;that.x.get()&nbsp;-&nbsp;x.get();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;dy&nbsp;=&nbsp;that.y.get()&nbsp;-&nbsp;y.get();<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;dx&nbsp;*&nbsp;dx&nbsp;+&nbsp;dy&nbsp;*&nbsp;dy;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;String&nbsp;toString()&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;ret&nbsp;=&nbsp;x.toString()&nbsp;+&nbsp;'\t'&nbsp;+&nbsp;y.toString();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(num.get()&nbsp;!=&nbsp;1)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ret&nbsp;+=&nbsp;'\t'&nbsp;+&nbsp;num.toString();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;ret;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;compareTo(Point&nbsp;that)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;ret&nbsp;=&nbsp;x.compareTo(that.x);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(ret&nbsp;==&nbsp;0)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ret&nbsp;=&nbsp;y.compareTo(that.y);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(ret&nbsp;==&nbsp;0)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ret&nbsp;=&nbsp;num.compareTo(that.num);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;ret;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;Map<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">extends</span>&nbsp;MapReduceBase<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">implements</span>&nbsp;Mapper&lt;Text,&nbsp;Text,&nbsp;Point,&nbsp;Point&gt;<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">private</span>&nbsp;Vector&lt;Point&gt;&nbsp;centers;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">private</span>&nbsp;IOException&nbsp;error;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;configure(JobConf&nbsp;conf)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">try</span>&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;paths[]&nbsp;=&nbsp;DistributedCache.getLocalCacheFiles(conf);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(paths.length&nbsp;!=&nbsp;1)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throw</span>&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IOException("Need&nbsp;exactly&nbsp;1&nbsp;centers&nbsp;file");<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileSystem&nbsp;fs&nbsp;=&nbsp;FileSystem.getLocal(conf);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;SequenceFile.Reader&nbsp;in&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;SequenceFile.Reader(fs,&nbsp;paths[0],&nbsp;conf);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;centers&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Vector&lt;Point&gt;();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">while</span>(in.next(x,&nbsp;y))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;centers.add(<span style="color: #0000FF; ">new</span>&nbsp;Point(x,&nbsp;y));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;in.close();<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Generate&nbsp;new&nbsp;points&nbsp;if&nbsp;we&nbsp;don't&nbsp;have&nbsp;enough.</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;k&nbsp;=&nbsp;conf.getInt("k",&nbsp;0);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Random&nbsp;rand&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Random();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">final</span>&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;MAX&nbsp;=&nbsp;1024*1024;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;i&nbsp;=&nbsp;centers.size();&nbsp;i&nbsp;&lt;&nbsp;k;&nbsp;i++)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.set(rand.nextInt(MAX));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.set(rand.nextInt(MAX));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;centers.add(<span style="color: #0000FF; ">new</span>&nbsp;Point(x,&nbsp;y));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;<span style="color: #0000FF; ">catch</span>&nbsp;(IOException&nbsp;e)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error&nbsp;=&nbsp;e;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;map(Text&nbsp;xt,&nbsp;Text&nbsp;yt,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OutputCollector&lt;Point,&nbsp;Point&gt;&nbsp;output,&nbsp;Reporter&nbsp;reporter)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(error&nbsp;!=&nbsp;<span style="color: #0000FF; ">null</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throw</span>&nbsp;error;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;x&nbsp;=&nbsp;Integer.valueOf(xt.toString());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;y&nbsp;=&nbsp;Integer.valueOf(yt.toString());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Point&nbsp;p&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Point(x,&nbsp;y);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Point&nbsp;center&nbsp;=&nbsp;<span style="color: #0000FF; ">null</span>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;distance&nbsp;=&nbsp;Long.MAX_VALUE;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(Point&nbsp;c&nbsp;:&nbsp;centers)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;d&nbsp;=&nbsp;c.distance(p);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(d&nbsp;&lt;=&nbsp;distance)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;distance&nbsp;=&nbsp;d;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;center&nbsp;=&nbsp;c;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(center,&nbsp;p);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;Combine<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">extends</span>&nbsp;MapReduceBase<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">implements</span>&nbsp;Reducer&lt;Point,&nbsp;Point,&nbsp;Point,&nbsp;Point&gt;<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;reduce(Point&nbsp;center,&nbsp;Iterator&lt;Point&gt;&nbsp;points,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OutputCollector&lt;Point,&nbsp;Point&gt;&nbsp;output,&nbsp;Reporter&nbsp;reporter)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Point&nbsp;sum&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Point();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">while</span>(points.hasNext())&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum.add(points.next());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(center,&nbsp;sum);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;Reduce<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">extends</span>&nbsp;MapReduceBase<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">implements</span>&nbsp;Reducer&lt;Point,&nbsp;Point,&nbsp;IntWritable,&nbsp;IntWritable&gt;<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;reduce(Point&nbsp;center,&nbsp;Iterator&lt;Point&gt;&nbsp;points,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OutputCollector&lt;IntWritable,&nbsp;IntWritable&gt;&nbsp;output,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Reporter&nbsp;reporter)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Point&nbsp;sum&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Point();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">while</span>&nbsp;(points.hasNext())&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum.add(points.next());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum.norm();<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable((<span style="color: #0000FF; ">int</span>)&nbsp;sum.x.get());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable((<span style="color: #0000FF; ">int</span>)&nbsp;sum.y.get());<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(x,&nbsp;y);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;reporter.incrCounter(Counter.CHANGE,&nbsp;sum.distance(center));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;reporter.incrCounter(Counter.CENTERS,&nbsp;1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;error(String&nbsp;msg)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.err.println(msg);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.exit(1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;initialCenters(<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;k,&nbsp;JobConf&nbsp;conf,&nbsp;FileSystem&nbsp;fs,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;in,&nbsp;Path&nbsp;out)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;BufferedReader&nbsp;input&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;BufferedReader(<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;InputStreamReader(fs.open(in)));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;SequenceFile.Writer&nbsp;output&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;SequenceFile.Writer(<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fs,&nbsp;conf,&nbsp;out,&nbsp;IntWritable.<span style="color: #0000FF; ">class</span>,&nbsp;IntWritable.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;x&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntWritable&nbsp;y&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IntWritable();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;i&nbsp;=&nbsp;0;&nbsp;i&nbsp;&lt;&nbsp;k;&nbsp;i++)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;line&nbsp;=&nbsp;input.readLine();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(line&nbsp;==&nbsp;<span style="color: #0000FF; ">null</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error("Not&nbsp;enough&nbsp;points&nbsp;for&nbsp;number&nbsp;of&nbsp;means");<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;parts[]&nbsp;=&nbsp;line.split("\t");<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(parts.length&nbsp;!=&nbsp;2)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">throw</span>&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;IOException("Found&nbsp;a&nbsp;point&nbsp;without&nbsp;two&nbsp;parts");<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x.set(Integer.valueOf(parts[0]));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y.set(Integer.valueOf(parts[1]));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.append(x,&nbsp;y);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.close();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input.close();<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;main(String&nbsp;args[])&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;JobConf&nbsp;conf&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;JobConf(KMeans.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;GenericOptionsParser&nbsp;opts&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;GenericOptionsParser(conf,&nbsp;args);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;paths[]&nbsp;=&nbsp;opts.getRemainingArgs();<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileSystem&nbsp;fs&nbsp;=&nbsp;FileSystem.get(conf);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(paths.length&nbsp;&lt;&nbsp;3)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error("Usage:\n"<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+&nbsp;"\tKMeans&nbsp;&lt;file&nbsp;to&nbsp;display&gt;\n"<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+&nbsp;"\tKMeans&nbsp;&lt;output&gt;&nbsp;&lt;k&gt;&nbsp;&lt;input&nbsp;file&gt;<img src="http://www.blogjava.net/Images/dot.gif" alt="" />"<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;outdir&nbsp;&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(paths[0]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;k&nbsp;=&nbsp;Integer.valueOf(paths[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;firstin&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(paths[2]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(k&nbsp;&lt;&nbsp;1&nbsp;||&nbsp;k&nbsp;&gt;&nbsp;20)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error("Strange&nbsp;number&nbsp;of&nbsp;means:&nbsp;"&nbsp;+&nbsp;paths[1]);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(fs.exists(outdir))&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(!fs.getFileStatus(outdir).isDir())<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error("Output&nbsp;directory&nbsp;\""&nbsp;+&nbsp;outdir.toString()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+&nbsp;"\"&nbsp;exists&nbsp;and&nbsp;is&nbsp;not&nbsp;a&nbsp;directory.");<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;<span style="color: #0000FF; ">else</span>&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fs.mkdirs(outdir);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Input:&nbsp;text&nbsp;file,&nbsp;each&nbsp;line&nbsp;"x\ty"</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setInputFormat(KeyValueTextInputFormat.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;i&nbsp;=&nbsp;2;&nbsp;i&nbsp;&lt;&nbsp;paths.length;&nbsp;i++)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileInputFormat.addInputPath(conf,&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(paths[i]));<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setInt("k",&nbsp;k);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Map:&nbsp;(x,y)&nbsp;-&gt;&nbsp;(centroid,&nbsp;point)</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapperClass(Map.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapOutputKeyClass(Point.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapOutputValueClass(Point.<span style="color: #0000FF; ">class</span>);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Combine:&nbsp;(centroid,&nbsp;points)&nbsp;-&gt;&nbsp;(centroid,&nbsp;weighted&nbsp;point)</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setCombinerClass(Combine.<span style="color: #0000FF; ">class</span>);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Reduce:&nbsp;(centroid,&nbsp;weighted&nbsp;points)&nbsp;-&gt;&nbsp;(x,&nbsp;y)&nbsp;new&nbsp;centroid</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setReducerClass(Reduce.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputKeyClass(IntWritable.<span style="color: #0000FF; ">class</span>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputValueClass(IntWritable.<span style="color: #0000FF; ">class</span>);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Output</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputFormat(SequenceFileOutputFormat.<span style="color: #0000FF; ">class</span>);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Chose&nbsp;initial&nbsp;centers</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;centers&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(outdir,&nbsp;"initial.seq");<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;initialCenters(k,&nbsp;conf,&nbsp;fs,&nbsp;firstin,&nbsp;centers);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;Iterate</span><span style="color: #008000; "><br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;change&nbsp;&nbsp;=&nbsp;Long.MAX_VALUE;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;URI&nbsp;cache[]&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;URI[1];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;iter&nbsp;=&nbsp;1;&nbsp;iter&nbsp;&lt;=&nbsp;1000&nbsp;&amp;&amp;&nbsp;change&nbsp;&gt;&nbsp;100&nbsp;*&nbsp;k;&nbsp;iter++)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Path&nbsp;jobdir&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(outdir,&nbsp;Integer.toString(iter));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileOutputFormat.setOutputPath(conf,&nbsp;jobdir);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setJobName("k-Means&nbsp;"&nbsp;+&nbsp;iter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setJarByClass(KMeans.<span style="color: #0000FF; ">class</span>);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cache[0]&nbsp;=&nbsp;centers.toUri();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DistributedCache.setCacheFiles(&nbsp;cache,&nbsp;conf&nbsp;);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;RunningJob&nbsp;result&nbsp;=&nbsp;JobClient.runJob(conf);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println("Iteration:&nbsp;"&nbsp;+&nbsp;iter);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;change&nbsp;&nbsp;&nbsp;=&nbsp;result.getCounters().getCounter(Counter.CHANGE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;centers&nbsp;&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;Path(jobdir,&nbsp;"part-00000");<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
}<br />
<br />
192.5.53.208</div>
<img src ="http://www.blogjava.net/paulwong/aggbug/413384.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-05-07 23:57 <a href="http://www.blogjava.net/paulwong/archive/2014/05/07/413384.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Packt celebrates International Day Against DRM, May 6th 2014</title><link>http://www.blogjava.net/paulwong/archive/2014/05/06/413334.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Tue, 06 May 2014 12:05:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/05/06/413334.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/413334.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/05/06/413334.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/413334.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/413334.html</trackback:ping><description><![CDATA[<p align="center" style="text-align:center"><a name="_GoBack"></a><strong><span style="font-size:16.0pt;background:white;">Packt celebrates International Day Against DRM, May 6<sup>th</sup> 2014</span></strong></p>
<p align="center" style="text-align:center"><strong>&nbsp;</strong></p>
<p><strong>&nbsp;</strong></p>
<p align="center" style="text-align:center"><strong><span style="font-size:12.0pt;background:white;"><a href="http://bit.ly/1q6bpha"> <img src="http://www.blogjava.net/images/blogjava_net/paulwong/drd.png" width="600" height="230" alt="" /></a></span></strong></p>
<p align="center" style="text-align: center;"><strong>&nbsp;</strong></p>
<p style="background:white">&nbsp;</p>
<p style="background:white">&nbsp;</p>
<p style="text-align:justify;text-justify:inter-ideograph; line-height:150%;background:white"><span style="font-size:12.0pt; line-height:150%;">According to the definition of DRM on Wikipedia, <strong>Digital Rights Management (DRM)</strong> is a class of technologies that are used by hardware manufacturers, publishers, copyright holders, and individuals with the intent to control the use of digital content and devices after sale.</span></p>
<p style="text-align:justify;text-justify:inter-ideograph; line-height:150%;background:white">&nbsp;</p>
<p style="text-align:justify;text-justify:inter-ideograph; line-height:150%;background:white"><span style="font-size:12.0pt; line-height:150%;">However, Packt Publishing firmly believes that you should be able to read and interact with your content when you want, where you want, and how you want &#8211; to that end they have been advocates of DRM-free content since their very first eBook was published back in 2004. </span></p>
<p style="line-height:150%;background:white">&nbsp;</p>
<span style="font-size:12.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;">To show their continuing support for </span><span style="font-size:12.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;Times New Roman&quot;;"><a href="https://www.defectivebydesign.org/">Day Against DRM</a></span><span style="font-size:12.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;">, Packt Publishing is offering all its DRM-free content at $10 for 24 hours only on May 6<sup>th</sup> &#8211; that&#8217;s all 2000+ eBooks and Videos. Check it out at: </span><span style="font-size:12.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;Times New Roman&quot;;">http://bit.ly/1q6bpha</span><span style="font-size:12.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;">.</span><img src ="http://www.blogjava.net/paulwong/aggbug/413334.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-05-06 20:05 <a href="http://www.blogjava.net/paulwong/archive/2014/05/06/413334.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>A book: Web Crawling and Data Mining with Apache Nutch</title><link>http://www.blogjava.net/paulwong/archive/2014/02/03/409510.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 03 Feb 2014 05:14:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2014/02/03/409510.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/409510.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2014/02/03/409510.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/409510.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/409510.html</trackback:ping><description><![CDATA[<div>Recently I am reading a book &lt;Web Crawling and Data Mining with Apache Nutch&gt;, <a href="http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book" target="_blank">http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book</a>, it is really a great book. And I get help in my project.</div>
<div><br />
</div>
<div>In my project I need to crawl the web content and do the data analyst. From the book I can know how to use and integrate Nutch and Solr frameworks to implement it.</div>
<div><br />
</div>
<div>If you have similiar case, recommand to read this book.</div>
<img src ="http://www.blogjava.net/paulwong/aggbug/409510.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2014-02-03 13:14 <a href="http://www.blogjava.net/paulwong/archive/2014/02/03/409510.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>【转载】经典漫画讲解HDFS原理 </title><link>http://www.blogjava.net/paulwong/archive/2013/10/26/405663.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 26 Oct 2013 01:15:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/10/26/405663.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/405663.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/10/26/405663.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/405663.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/405663.html</trackback:ping><description><![CDATA[<span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">分布式文件系统比较出名的有HDFS&nbsp;&nbsp;和 GFS，其中HDFS比较简单一点。本文是一篇描述非常简洁易懂的漫画形式讲解HDFS的原理。比一般PPT要通俗易懂很多。不难得的学习资料。<br /></span><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">1、三个部分: 客户端、nameserver（可理解为主控和文件索引,类似linux的inode）、datanode（存放实际数据）</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="225" src="http://my.csdn.net/uploads/201208/11/1344691496_8076.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">在这里，client的形式我所了解的有两种，通过hadoop提供的api所编写的程序可以和hdfs进行交互，另外一种就是安装了hadoop的datanode其也可以通过命令行与hdfs系统进行交互，如在datanode上上传则使用如下命令行：bin/hadoop fs -put example1 user/chunk/<br /><br /></span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">2、如何写数据过程</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="476" src="http://my.csdn.net/uploads/201208/11/1344691715_8066.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="480" src="http://my.csdn.net/uploads/201208/11/1344692755_3243.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="217" src="http://my.csdn.net/uploads/201208/12/1344702703_2919.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">3、读取数据过程</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="439" src="http://my.csdn.net/uploads/201208/11/1344693039_4501.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">4、容错：第一部分：故障类型及其检测方法（nodeserver 故障，和网络故障，和脏数据问题）</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="471" src="http://my.csdn.net/uploads/201208/11/1344693728_5407.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="442" src="http://my.csdn.net/uploads/201208/11/1344693685_4529.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">5、容错第二部分：读写容错</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="429" src="http://my.csdn.net/uploads/201208/11/1344693811_7697.png" border="0" alt="" id="img_0.5895301518030465" initialized="true" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">6、容错第三部分：dataNode 失效</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="421" src="http://my.csdn.net/uploads/201208/11/1344694035_2660.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">7、备份规则</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="450" src="http://my.csdn.net/uploads/201208/11/1344694119_7534.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><br /><br /><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><span style="font-family: 微软雅黑, 宋体; background-color: #dfdfdf;">8、结束语</span><br style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf;" /><img width="600" height="235" src="http://my.csdn.net/uploads/201208/11/1344694185_4387.png" border="0" alt="" style="word-wrap: break-word; font-family: 微软雅黑, 宋体; background-color: #dfdfdf; cursor: pointer;" /><img src ="http://www.blogjava.net/paulwong/aggbug/405663.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-10-26 09:15 <a href="http://www.blogjava.net/paulwong/archive/2013/10/26/405663.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Install Hadoop in the AWS cloud</title><link>http://www.blogjava.net/paulwong/archive/2013/09/08/403816.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 08 Sep 2013 05:45:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/09/08/403816.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/403816.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/09/08/403816.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/403816.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/403816.html</trackback:ping><description><![CDATA[<ol>
     <li>get the Whirr tar file<br />
     <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->wget&nbsp;http://www.eu.apache.org/dist/whirr/stable/whirr-0.8.2.tar.gz</div>
     </li><li>untar the Whirr tar file<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->tar&nbsp;-vxf&nbsp;whirr-0.8.2.tar.gz</div></li><li>create credentials file<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->mkdir&nbsp;~/.whirr<br />cp&nbsp;conf/credentials.sample&nbsp;~/.whirr/credentials</div></li><li>add the following content to credentials file<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->#&nbsp;Set&nbsp;cloud&nbsp;provider&nbsp;connection&nbsp;details<br />PROVIDER=aws-ec2<br />IDENTITY=&lt;AWS&nbsp;Access&nbsp;Key&nbsp;ID&gt;<br />CREDENTIAL=&lt;AWS&nbsp;Secret&nbsp;Access&nbsp;Key&gt;</div></li><li><div>generate a rsa key pair<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->ssh-keygen&nbsp;-t&nbsp;rsa&nbsp;-P&nbsp;''</div></div></li><li>create a hadoop.properties file and add the following content<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->whirr.cluster-name=whirrhadoopcluster<br />whirr.instance-templates=1&nbsp;hadoop-jobtracker+hadoop-namenode,2&nbsp;hadoop-datanode+hadoop-tasktracker<br />whirr.provider=aws-ec2<br />whirr.private-key-file=${sys:user.home}/.ssh/id_rsa<br />whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub<br />whirr.hadoop.version=1.0.2<br />whirr.aws-ec2-spot-price=0.08</div></li><li>launch hadoop<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->bin/whirr&nbsp;launch-cluster&nbsp;--config&nbsp;hadoop.properties</div></li><li>launch proxy<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->cd&nbsp;~/.whirr/whirrhadoopcluster/<br />./hadoop-proxy.sh</div></li><li>add a rule to iptables<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->0.0.0.0/0 50030<br />0.0.0.0/0 50070</div></li><li>check the web ui in the browser<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->http://&lt;aws-public-dns&gt;:50030</div></li><li>add to /etc/profile<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->export&nbsp;HADOOP_CONF_DIR=~/.whirr/whirrhadoopcluster/</div></li><li>check if the hadoop works<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->hadoop&nbsp;fs&nbsp;-ls&nbsp;/</div></li></ol><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><ol>
</ol><img src ="http://www.blogjava.net/paulwong/aggbug/403816.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-09-08 13:45 <a href="http://www.blogjava.net/paulwong/archive/2013/09/08/403816.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Install hadoop+hbase+nutch+elasticsearch</title><link>http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 30 Aug 2013 17:17:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/403513.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html#Feedback</comments><slash:comments>3</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/403513.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/403513.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: This document is for Anyela Chavarro.Only these version of each framework work togetherCode highlighting produced by Actipro CodeHighlighter (freeware)http://www.CodeHighlighter.com/-->H...&nbsp;&nbsp;<a href='http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html'>阅读全文</a><img src ="http://www.blogjava.net/paulwong/aggbug/403513.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-08-31 01:17 <a href="http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Implementation for CombineFileInputFormat Hadoop 0.20.205</title><link>http://www.blogjava.net/paulwong/archive/2013/08/29/403442.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Thu, 29 Aug 2013 08:08:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/08/29/403442.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/403442.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/08/29/403442.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/403442.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/403442.html</trackback:ping><description><![CDATA[运行MAPREDUCE JOB时，如果输入的文件比较小而多时，默认情况下会生成很多的MAP JOB，即一个文件一个MAP JOB，因此需要优化，使多个文件能合成一个MAP JOB的输入。<br /><br />具体的原理是下述三步: <br /><br />1.根据输入目录下的每个文件,如果其长度超过mapred.max.split.size,以block为单位分成多个split(一个split是一个map的输入),每个split的长度都大于mapred.max.split.size, 因为以block为单位, 因此也会大于blockSize, 此文件剩下的长度如果大于mapred.min.split.size.per.node, 则生成一个split, 否则先暂时保留.<br /><br />2. 现在剩下的都是一些长度效短的碎片,把每个rack下碎片合并, 只要长度超过mapred.max.split.size就合并成一个split, 最后如果剩下的碎片比mapred.min.split.size.per.rack大, 就合并成一个split, 否则暂时保留.<br /><br />3. 把不同rack下的碎片合并, 只要长度超过mapred.max.split.size就合并成一个split, 剩下的碎片无论长度, 合并成一个split.<br />举例: mapred.max.split.size=1000<br />      mapred.min.split.size.per.node=300<br />      mapred.min.split.size.per.rack=100<br />输入目录下五个文件,rack1下三个文件,长度为2050,1499,10, rack2下两个文件,长度为1010,80. 另外blockSize为500.<br />经过第一步, 生成五个split: 1000,1000,1000,499,1000. 剩下的碎片为rack1下:50,10; rack2下10:80<br />由于两个rack下的碎片和都不超过100, 所以经过第二步, split和碎片都没有变化.<br />第三步,合并四个碎片成一个split, 长度为150.<br /><br />如果要减少map数量, 可以调大mapred.max.split.size, 否则调小即可.<br /><br />其特点是: 一个块至多作为一个map的输入，一个文件可能有多个块，一个文件可能因为块多分给做为不同map的输入， 一个map可能处理多个块，可能处理多个文件。<br /><br />注：CombineFileInputFormat是一个抽象类，需要编写一个继承类。<br /><br /><br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000FF; ">import</span>&nbsp;java.io.IOException;<br /><br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.conf.Configuration;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.io.LongWritable;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.io.Text;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.FileSplit;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.InputSplit;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.JobConf;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.LineRecordReader;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.RecordReader;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.Reporter;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.lib.CombineFileInputFormat;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.lib.CombineFileRecordReader;<br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.hadoop.mapred.lib.CombineFileSplit;<br /><br />@SuppressWarnings("deprecation")<br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;CombinedInputFormat&nbsp;<span style="color: #0000FF; ">extends</span>&nbsp;CombineFileInputFormat&lt;LongWritable,&nbsp;Text&gt;&nbsp;{<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;@SuppressWarnings({&nbsp;"unchecked",&nbsp;"rawtypes"&nbsp;})<br />&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;RecordReader&lt;LongWritable,&nbsp;Text&gt;&nbsp;getRecordReader(InputSplit&nbsp;split,&nbsp;JobConf&nbsp;conf,&nbsp;Reporter&nbsp;reporter)&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;CombineFileRecordReader(conf,&nbsp;(CombineFileSplit)&nbsp;split,&nbsp;reporter,&nbsp;(Class)&nbsp;myCombineFileRecordReader.<span style="color: #0000FF; ">class</span>);<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;myCombineFileRecordReader&nbsp;<span style="color: #0000FF; ">implements</span>&nbsp;RecordReader&lt;LongWritable,&nbsp;Text&gt;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">private</span>&nbsp;<span style="color: #0000FF; ">final</span>&nbsp;LineRecordReader&nbsp;linerecord;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;myCombineFileRecordReader(CombineFileSplit&nbsp;split,&nbsp;Configuration&nbsp;conf,&nbsp;Reporter&nbsp;reporter,&nbsp;Integer&nbsp;index)&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileSplit&nbsp;filesplit&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;FileSplit(split.getPath(index),&nbsp;split.getOffset(index),&nbsp;split.getLength(index),&nbsp;split.getLocations());<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;linerecord&nbsp;=&nbsp;<span style="color: #0000FF; ">new</span>&nbsp;LineRecordReader(conf,&nbsp;filesplit);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;close()&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;linerecord.close();<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;LongWritable&nbsp;createKey()&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;TODO&nbsp;Auto-generated&nbsp;method&nbsp;stub</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;linerecord.createKey();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;Text&nbsp;createValue()&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;TODO&nbsp;Auto-generated&nbsp;method&nbsp;stub</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;linerecord.createValue();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">long</span>&nbsp;getPos()&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;TODO&nbsp;Auto-generated&nbsp;method&nbsp;stub</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;linerecord.getPos();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">float</span>&nbsp;getProgress()&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;TODO&nbsp;Auto-generated&nbsp;method&nbsp;stub</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;linerecord.getProgress();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">boolean</span>&nbsp;next(LongWritable&nbsp;key,&nbsp;Text&nbsp;value)&nbsp;<span style="color: #0000FF; ">throws</span>&nbsp;IOException&nbsp;{<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;TODO&nbsp;Auto-generated&nbsp;method&nbsp;stub</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;linerecord.next(key,&nbsp;value);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />}</div><br /><br />在运行时这样设置：<br /><br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000FF; ">if</span>&nbsp;(argument&nbsp;!=&nbsp;<span style="color: #0000FF; ">null</span>)&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.set("mapred.max.split.size",&nbsp;argument);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;<span style="color: #0000FF; ">else</span>&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.set("mapred.max.split.size",&nbsp;"134217728");&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;128&nbsp;MB</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008000; ">//</span><span style="color: #008000; "><img src="http://www.blogjava.net/Images/dot.gif"  alt="" /></span><span style="color: #008000; "><br /></span><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setInputFormat(CombinedInputFormat.<span style="color: #0000FF; ">class</span>);</div><br /><br /><img src ="http://www.blogjava.net/paulwong/aggbug/403442.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-08-29 16:08 <a href="http://www.blogjava.net/paulwong/archive/2013/08/29/403442.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>大数据平台架构设计资源</title><link>http://www.blogjava.net/paulwong/archive/2013/08/18/403001.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 18 Aug 2013 10:27:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/08/18/403001.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/403001.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/08/18/403001.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/403001.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/403001.html</trackback:ping><description><![CDATA[!!!基于Hadoop的大数据平台实施记&#8212;&#8212;整体架构设计<br /><a href="http://blog.csdn.net/jacktan/article/details/9200979" target="_blank">http://blog.csdn.net/jacktan/article/details/9200979</a><br /><br /><br /><br /><br /><br /><br /><br /><img src ="http://www.blogjava.net/paulwong/aggbug/403001.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-08-18 18:27 <a href="http://www.blogjava.net/paulwong/archive/2013/08/18/403001.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>How to install Hadoop cluster(2 node cluster) and Hbase on Vmware Workstation. It also includes installing Pig and Hive in the appendix</title><link>http://www.blogjava.net/paulwong/archive/2013/08/17/402982.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 17 Aug 2013 14:23:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/08/17/402982.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/402982.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/08/17/402982.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/402982.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/402982.html</trackback:ping><description><![CDATA[By Tzu-Cheng Chuang 1-28-2011<br /><br /><br />Requires: Ubuntu10.04, hadoop0.20.2, zookeeper 3.3.2 HBase0.90.0<br />1. Download Ubuntu 10.04 desktop 32 bit from Ubuntu website.<br /><br />2. Install Ubuntu 10.04 with username: hadoop, password: password,&nbsp; disk size: 20GB, memory: 2048MB, 1 processor, 2 cores<br /><br />3. Install build-essential (for GNU C, C++ compiler)&nbsp;&nbsp;&nbsp; $ sudo apt-get install build-essential <br /><br />4. Install sun-jave-6-jdk<br />&nbsp;&nbsp;&nbsp; (1) Add the Canonical Partner Repository to your apt repositories<br />&nbsp;&nbsp;&nbsp; $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Update the source list<br />&nbsp;&nbsp;&nbsp; $ sudo apt-get update<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Install sun-java-6-jdk and make sure Sun&#8217;s java is the default jvm<br />&nbsp;&nbsp;&nbsp; $ sudo apt-get install sun-java6-jdk<br />&nbsp;&nbsp;&nbsp;&nbsp; (4) Set environment variable by modifying ~/.bashrc file, put the following two lines in the end of the file<br />&nbsp;&nbsp;&nbsp; export JAVA_HOME=/usr/lib/jvm/java-6-sun<br />&nbsp; &nbsp; export PATH=$PATH:$JAVA_HOME/bin&nbsp;<br /><br /> 5. Configure SSH server so that ssh to localhost doesn&#8217;t need a passphrase <br />&nbsp;&nbsp;&nbsp; (1) Install openssh server<br />&nbsp;&nbsp;&nbsp; $ sudo apt-get install openssh-server<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Generate RSA pair key<br />&nbsp;&nbsp;&nbsp; $ ssh-keygen &#8211;t ras &#8211;P ""<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Enable SSH access to local machine<br />&nbsp;&nbsp;&nbsp; $ cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys <br /><br />6. Disable IPv6 by&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; modifying&nbsp; /etc/sysctl.conf file, put the following two lines in the end of the file<br /> #disable <br />ipv6 net.ipv6.conf.all.disable_ipv6 = 1 <br />net.ipv6.conf.default.disable_ipv6 = 1 <br />net.ipv6.conf.lo.disable_ipv6 = 1 <br /><br />7. Install hadoop<br />&nbsp;&nbsp;&nbsp; (1) Download hadoop-0.20.2.tar.gz(stable release on 1/25/2011)&nbsp; from Apache hadoop website&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (2) Extract hadoop archive file to /usr/local/&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (3) Make symbolic link&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (4) Modify /usr/local/hadoop/conf/hadoop-env.sh&nbsp;&nbsp;&nbsp; <br />Change from # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun<br />&nbsp;&nbsp;&nbsp;&nbsp; (5)Create /usr/local/hadoop-datastore folder&nbsp;&nbsp;&nbsp; <br />$ sudo mkdir /usr/local/hadoop-datastore<br /> $ sudo chown hadoop:hadoop /usr/local/hadoop-datastore<br /> $ sudo chmod 750 /usr/local/hadoop-datastore<br />&nbsp;&nbsp;&nbsp;&nbsp; (6)Put the following code in /usr/local/hadoop/conf/core-site.xml&nbsp;&nbsp;&nbsp; <br />hadoop.tmp.dir/usr/local/hadoop/tmp/dir/hadoop-${user.name}A base for other temporary directories.fs.default.namehdfs://master:54310The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.<br />&nbsp;&nbsp;&nbsp; (7) Put the following code in /usr/local/hadoop/conf/mapred-site.xml&nbsp;&nbsp;&nbsp; <br />mapred.job.trackermaster:54311The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.<br />&nbsp;&nbsp;&nbsp;&nbsp; (8) Put the following code in /usr/local/hadoop/conf/hdfs-site.xml&nbsp;&nbsp;&nbsp; <br />dfs.replication1Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.<br />&nbsp;&nbsp;&nbsp;&nbsp; (9) Add hadoop to environment variable by modifying ~/.bashrc&nbsp;&nbsp;&nbsp; <br />export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH <br /><br />8. Restart Ubuntu Linux<br /><br />9. Copy this virtual machine to another folder. At least we have 2 copies of Ubuntu linux<br /><br />10. Modify /etc/hosts on both Linux Virtual Image machines, add in the following lines in the file. The IP address depends on each machine. We can use (ifconfig) to find out IP address.<br /> # /etc/hosts (for master AND slave) 192.168.0.1 master 192.168.0.2 slave&nbsp;&nbsp;&nbsp;&nbsp; Modify the following line, because it might cause Hbase to find out wrong ip.&nbsp;&nbsp;&nbsp; <br />192.168.0.1 ubuntu <br /><br />11. Check hadoop user access on both machines.<br />The hadoop user on the master (aka hadoop@master) must be able to connect a) to its own user account on the master &#8211; i.e. ssh master in this context and not necessarily ssh localhost &#8211; and b) to the hadoop user account on the slave (aka hadoop@slave)&nbsp; via a password-less SSH login. On both machines, make sure each one can connect to master, slave without typing passwords.<br /><br />12. Cluster configuration <br />&nbsp;&nbsp;&nbsp; (1) Modify /usr/local/hadoop/conf/masters<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; only on master machine&nbsp;&nbsp;&nbsp; master<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Modify /usr/local/hadoop/conf/slaves<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; only on master machine&nbsp;&nbsp;&nbsp; master slave<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Change &#8220;localhost&#8221; to &#8220;master&#8221; in /usr/local/conf/hadoop/conf/core-site.xml and /usr/local/hadoop/conf/mapred-site.xml<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; only on master machine&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (4) Change dfs.replication to &#8220;1&#8221; in /usr/local/conf/hadoop/conf/hdfs-site.xml<br />&nbsp;&nbsp;&nbsp; only on master machine&nbsp;&nbsp;&nbsp; <br /><br />13. Format the namenode only once and only on master machine <br />$ /usr/local/hadoop/bin/hadoop namenode &#8211;format <br /><br />14. Later on, start the multi-node cluster by typing following code only on master. So far, please don&#8217;t start hadoop yet. <br />$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh <br /><br />15. Install zookeeper only on master node <br />&nbsp;&nbsp;&nbsp; (1) download zookeeper-3.3.2.tar.gz from Apache hadoop website&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (2) Extract&nbsp; zookeeper-3.3.2.tar.gz&nbsp;&nbsp;&nbsp; $ tar &#8211;xzf zookeeper-3-3.2.tar.gz<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Move folder zookeeper-3.3.2 to /home/hadoop/ and create a symbloink link<br />&nbsp;&nbsp;&nbsp; $ mv zookeeper-3.3.2 /home/hadoop/ ; ln &#8211;s /home/hadoop/zookeeper-3.3.2 /home/hadoop/zookeeper<br />&nbsp;&nbsp;&nbsp;&nbsp; (4) copy conf/zoo_sample.cfg to conf/zoo.cfg<br />&nbsp;&nbsp;&nbsp; $ cp conf/zoo_sample.cfg confg/zoo.cfg<br />&nbsp;&nbsp;&nbsp;&nbsp; (5) Modify conf/zoo.cfg&nbsp;&nbsp;&nbsp; dataDir=/home/hadoop/zookeeper/snapshot <br /><br />16. Install Hbase on both master and slave nodes, configure it as fully-distributed <br />&nbsp;&nbsp;&nbsp; (1) Download hbase-0.90.0.tar.gz from Apache hadoop website&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (2) Extract&nbsp; hbase-0.90.0.tar.gz&nbsp;&nbsp;&nbsp; $ tar &#8211;xzf hbase-0.90.0.tar.gz<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Move folder hbase-0.90.0 to /home/hadoop/ and create a symbloink link&nbsp;&nbsp;&nbsp; $ mv hbase-0.90.0 /home/hadoop/ ; ln &#8211;s /home/hadoop/hbase-0.90.0 /home/hadoop/hbase<br />&nbsp;&nbsp;&nbsp;&nbsp; (4) Edit /home/hadoop/hbase/conf/hbase-site.xml, put the following in between and hbase.rootdirhdfs://master:54310/hbase The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR hbase.cluster.distributedtrueThe mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) hbase.zookeeper.quorummasterComma separated list of servers in the ZooKeeper Quorum. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on.<br />&nbsp;&nbsp;&nbsp;&nbsp; (5) modify environment variables in /home/hadoop/hbase/conf/hbase-env.sh<br />&nbsp;&nbsp;&nbsp; export JAVA_HOME=/usr/lib/jvm/java-6-sun/<br /> export HBASE_IDENT_STRING=$HOSTNAME<br /> export HBASE_MANAGES_ZK=false<br />&nbsp;&nbsp;&nbsp;&nbsp; (6)Overwrite /home/hadoop/hbase/conf/regionservers<br />&nbsp; on both machines&nbsp;&nbsp;&nbsp; master slave<br />&nbsp;&nbsp;&nbsp;&nbsp; (7)copy /usr/local/hadoop-0.20.2/haoop-0.20.2-core.jar to /home/hadoop/hbase/lib/&nbsp; on both machines.<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; This is very important to fix version difference issue. Pay attention to its ownership and mode(755).&nbsp;&nbsp;&nbsp; <br /><br />17. Start zookeeper. It seems the zookeeper bundled with Hbase is not set up correctly. <br />$ /home/hadoop/zookeeper/bin/zkServer.sh start&nbsp;&nbsp;&nbsp;&nbsp; (Optional)We can test if zookeeper is running correctly by&nbsp; typing&nbsp;&nbsp;&nbsp;&nbsp; $ /home/hadoop/zookeeper/bin/zkCli.sh &#8211;server 127.0.0.1:2181 <br /><br />18. Start hadoop cluster <br />$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh <br /><br />19. Start Hbase<br /> $ /home/hadoop/hbase/bin/start-hbase.sh <br /><br />20. Use Hbase shell<br /> $ /home/hadoop/hbase/bin/hbase shell&nbsp;&nbsp;&nbsp;&nbsp; Check if hbase is running smoothly<br />&nbsp;&nbsp;&nbsp; Open your browser, and type in the following.<br />&nbsp;&nbsp;&nbsp; http://localhost:60010&nbsp;&nbsp;&nbsp; <br /><br /><br />21. Later on, stop the multi-node cluster by typing following code only on master <br />&nbsp;&nbsp;&nbsp; (1) Stop Hbase&nbsp;&nbsp;&nbsp; $ /home/hadoop/hbase/bin/stop-hbase.sh<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Stop hadoop file system (HDFS)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />$ /usr/local/hadoop/bin/stop-mapred.sh <br />$ /usr/local/hadoop/bin/stop-dfs.sh<br />&nbsp;&nbsp;&nbsp;&nbsp; (3) Stop zookeeper&nbsp;&nbsp;&nbsp;&nbsp; <br />$ /home/hadoop/zookeeper/bin/zkServer.sh stop <br /><br />Reference<br />http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/<br />http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/<br />http://wiki.apache.org/hadoop/Hbase/10Minutes<br />http://hbase.apache.org/book/quickstart.html<br />http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/<br /><br />Author<br />Tzu-Cheng Chuang <br /><br /><br />Appendix- Install Pig and Hive<br />1. Install Pig 0.8.0 on this cluster <br />&nbsp;&nbsp;&nbsp; (1) Download pig-0.8.0.tar.gz from Apache pig project website.&nbsp; Then extract the file and move it to /home/hadoop/&nbsp;&nbsp;&nbsp; <br />$ tar &#8211;xzf pig-0.8.0.tar.gz ; mv pig-0.8.0 /home/hadoop/<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Make symbolink link under pig-0.8.0/conf/&nbsp;&nbsp;&nbsp; <br />$ ln -s /usr/local/hadoop/conf/core-site.xml /home/hadoop/pig-0.8.0/conf/core-site.xml <br />$ ln -s /usr/local/hadoop/conf/mapred-site.xml /home/hadoop/pig-0.8.0/conf/mapred-site.xml <br />$ ln -s /usr/local/hadoop/conf/hdfs-site.xml /home/hadoop/pig-0.8.0/conf/hdfs-site.xml<br />&nbsp;&nbsp;&nbsp;&nbsp; 3) Start pig in map-reduce mode: $ /home/hadoop/pig-0.8.0/bin/pig<br />&nbsp;&nbsp;&nbsp;&nbsp; (4) Exit pig from grunt&gt;&nbsp;&nbsp;&nbsp; quit <br /><br />2. Install Hive on this cluster <br />&nbsp;&nbsp;&nbsp; (1) Download hive-0.6.0.tar.gz from Apache hive project website, and then extract the file and move it to /home/hadoop/&nbsp;&nbsp;&nbsp; $ tar &#8211;xzf hive-0.6.0.tar.gz ; mv hive-0.6.0 ~/<br />&nbsp;&nbsp;&nbsp;&nbsp; (2) Modify java heap size in hive-0.6.0/bin/ext/execHiveCmd.sh&nbsp; Change 4096 to 1024&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; (3) Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before a table can be created in Hive&nbsp;&nbsp;&nbsp; $ hadoop fs &#8211;mkdir /tmp $ hadoop fs &#8211;mkdir /user/hive/warehouse $ hadoop fs &#8211;chmod g+w /tmp $ hadoop fs &#8211;chmod g+w /user/hive/warehouse<br />&nbsp;&nbsp;&nbsp;&nbsp; (4) start Hive&nbsp;&nbsp;&nbsp;&nbsp; $ /home/hadoop/hive-0.6.0/bin/hive <br /><br />&nbsp;&nbsp;&nbsp;&nbsp; 3. (Optional)Load data by using Hive <br />&nbsp;&nbsp;&nbsp; Create a file /home/hadoop/customer.txt&nbsp;&nbsp;&nbsp; 1, Kevin 2, David 3, Brian 4, Jane 5, Alice&nbsp;&nbsp;&nbsp;&nbsp; After hive shell is started, type in&nbsp;&nbsp;&nbsp; &gt; CREATE TABLE IF NOT EXISTS customer(id INT, name STRING) &gt; ROW FORMAT delimited fields terminated by ',' &gt; STORED AS TEXTFILE; &gt;LOAD DATA INPATH '/home/hadoop/customer.txt' OVERWRITE INTO TABLE customer; &gt;SELECT customer.id, customer.name from customer;<br /><br /><a href="http://chuangtc.info/ParallelComputing/SetUpHadoopClusterOnVmwareWorkstation.htm" target="_blank">http://chuangtc.info/ParallelComputing/SetUpHadoopClusterOnVmwareWorkstation.htm</a><img src ="http://www.blogjava.net/paulwong/aggbug/402982.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-08-17 22:23 <a href="http://www.blogjava.net/paulwong/archive/2013/08/17/402982.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Kettle - HADOOP数据转换工具</title><link>http://www.blogjava.net/paulwong/archive/2013/08/01/402269.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Thu, 01 Aug 2013 09:21:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/08/01/402269.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/402269.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/08/01/402269.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/402269.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/402269.html</trackback:ping><description><![CDATA[ETL（Extract-Transform-Load的缩写，即数据抽取、转换、装载的过程），对于企业或行业应用来说，我们经常会遇到各种数据的处理，转换，迁移，所以了解并掌握一种etl工具的使用，必不可少，这里我介绍一个我在工作中使用了3年左右的ETL工具Kettle,本着好东西不独享的想法，跟大家分享碰撞交流一下！在使用中我感觉这个工具真的很强大，支持图形化的GUI设计界面，然后可以以工作流的形式流转，在做一些简单或复杂的数据抽取、质量检测、数据清洗、数据转换、数据过滤等方面有着比较稳定的表现，其中最主要的我们通过熟练的应用它，减少了非常多的研发工作量，提高了我们的工作效率，不过对于我这个.net研发者来说唯一的遗憾就是这个工具是Java编写的。<br /><br /><a href="http://www.cnblogs.com/limengqiang/archive/2013/01/16/KettleApply1.html" target="_blank">http://www.cnblogs.com/limengqiang/archive/2013/01/16/KettleApply1.html</a><img src ="http://www.blogjava.net/paulwong/aggbug/402269.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-08-01 17:21 <a href="http://www.blogjava.net/paulwong/archive/2013/08/01/402269.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>使用Sqoop实现HDFS与Mysql互转</title><link>http://www.blogjava.net/paulwong/archive/2013/05/11/399153.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 11 May 2013 13:27:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/05/11/399153.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/399153.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/05/11/399153.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/399153.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/399153.html</trackback:ping><description><![CDATA[<br />
简介<br />
Sqoop是一个用来将Hadoop和关系型数据库中的数据相互转移的工具，可以将一个关系型数据库（例如 ： MySQL ,Oracle ,Postgres等）中的数据导入到Hadoop的HDFS中，也可以将HDFS的数据导入到关系型数据库中。<br />
<br />
http://sqoop.apache.org/<br />
<br />
环境<br />
当调试过程出现IncompatibleClassChangeError一般都是版本兼容问题。<br />
<br />
为了保证hadoop和sqoop版本的兼容性，使用Cloudera，<br />
<br />
Cloudera简介：<br />
<br />
Cloudera为了让Hadoop的配置标准化，可以帮助企业安装，配置，运行hadoop以达到大规模企业数据的处理和分析。<br />
<br />
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDHTarballs/3.25.2013/CDH4-Downloadable-Tarballs/CDH4-Downloadable-Tarballs.html<br />
<br />
下载安装hadoop-0.20.2-cdh3u6，sqoop-1.3.0-cdh3u6。<br />
<br />
安装<br />
安装比较简单，直接解压即可<br />
<br />
唯一需要做的就是将mysql的jdbc适配包mysql-connector-java-5.0.7-bin.jar copy到$SQOOP_HOME/lib下。<br />
<br />
配置好环境变量：/etc/profile<br />
<br />
export SQOOP_HOME=/home/hadoop/sqoop-1.3.0-cdh3u6/<br />
<br />
export PATH=$SQOOP_HOME/bin:$PATH<br />
<br />
MYSQL转HDFS-示例<br />
./sqoop import --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shop -m 1 --target-dir /user/recsys/input/shop/$today<br />
<br />
<br />
HDFS转MYSQ-示例<br />
./sqoop export --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shopassoc  --fields-terminated-by ',' --export-dir /user/recsys/output/shop/$today<br />
<br />
示例参数说明<br />
(其他参数我未使用，故不作解释，未使用，就没有发言权，详见命令help)<br />
<br />
<br />
参数类型<br />
<br />
参数名<br />
<br />
解释<br />
<br />
公共<br />
<br />
connect<br />
<br />
Jdbc-url<br />
<br />
公共<br />
<br />
username<br />
<br />
---<br />
<br />
公共<br />
<br />
password<br />
<br />
---<br />
<br />
公共<br />
<br />
table<br />
<br />
表名<br />
<br />
Import<br />
<br />
target-dir<br />
<br />
制定输出hdfs目录，默认输出到/user/$loginName/<br />
<br />
export<br />
<br />
fields-terminated-by<br />
<br />
Hdfs文件中的字段分割符，默认是&#8220;\t&#8221;<br />
<br />
export<br />
<br />
export-dir<br />
<br />
hdfs文件的路径<img src ="http://www.blogjava.net/paulwong/aggbug/399153.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-05-11 21:27 <a href="http://www.blogjava.net/paulwong/archive/2013/05/11/399153.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop集群监控工具ambari安装</title><link>http://www.blogjava.net/paulwong/archive/2013/05/03/398731.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 03 May 2013 05:55:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/05/03/398731.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/398731.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/05/03/398731.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/398731.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/398731.html</trackback:ping><description><![CDATA[<p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　Apache Ambari是对Hadoop进行监控、管理和生命周期管理的开源项目。它也是一个为Hortonworks数据平台选择管理组建的项目。Ambari向Hadoop MapReduce、HDFS、 HBase、Pig, Hive、HCatalog以及Zookeeper提供服务。最近准备装ambari，在网上找了许久，没找到比较系统的ambari安装过程，于是，就根据官网进行了安装，下面是我推荐的正确的较完善的安装方式，希望对大家有所帮助。</span></p><h1><span style="color: #000000;">　　一、准备工作</span></h1><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　1、系统：我的系统是CentOS6.2，x86_64，本次集群采用两个节点。管理节点：192.168.10.121；客户端节点：192.168.10.122</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　2、系统最好配置能上网，这样方便后面的操作，否则需要配置yum仓库，比较麻烦。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　3、集群中ambari-serveer（管理节点）到客户端配置无密码登录。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　4、集群同步时间</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　5、SELinux，iptables都处于关闭状态。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　6、ambari版本：1.2.0</span></p><h1><span style="color: #000000;">　　二、安装步骤</span></h1><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　A、配置好集群环境</span></p><div style="font-size: 12px; margin: 5px 0px; color: #aaaaaa; line-height: 19px;"><div style="margin-top: 5px;"><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008000; ">#</span><span style="color: #008000; ">###########&nbsp;&nbsp;配置无密码登录&nbsp;&nbsp;#################</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh-keygen&nbsp;-t&nbsp;dsa</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;cat&nbsp;/root/.ssh/id_dsa.pub&nbsp;&gt;&gt;&nbsp;/root/.ssh/authorized_keys</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;scp&nbsp;/root/.ssh/id_dsa.pub&nbsp;192.168.10.122:/root/</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh&nbsp;192.168.10.122</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud122</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;cat&nbsp;/root/.ssh/id_dsa.pub&nbsp;&gt;&gt;&nbsp;/root/.ssh/authorized_keys<br /><br />#############&nbsp;&nbsp;NTP&nbsp;时间同步&nbsp;&nbsp;#################</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ntpdate&nbsp;time.windows.com</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh&nbsp;ccloud122&nbsp;ntpdate&nbsp;time.windows.com<br /><br />###########&nbsp;&nbsp;SELinux&nbsp;&amp;&nbsp;iptables&nbsp;关闭&nbsp;&nbsp;&nbsp;###########</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;setenforce&nbsp;0</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh&nbsp;ccloud122&nbsp;setenforce&nbsp;0</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;chkconfig&nbsp;iptables&nbsp;off</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;service&nbsp;iptables&nbsp;stop</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh&nbsp;ccloud122&nbsp;chkconfig&nbsp;iptables&nbsp;off</span><span style="color: #008000; "><br /></span><span style="color: #000000; ">[root</span><span style="color: #800080; ">@ccloud121</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">~</span><span style="color: #000000; ">]</span><span style="color: #008000; ">#</span><span style="color: #008000; ">&nbsp;ssh&nbsp;ccloud122&nbsp;service&nbsp;iptables&nbsp;stop</span></div></div></div><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　B、管理节点上安装ambari-server</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　1、下载repo文件　</span>　　　　</p><div style="font-size: 12px; margin: 5px 0px; color: #aaaaaa; line-height: 19px;"><pre style="margin-top: 0px; margin-bottom: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: 'Courier New';"><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #800000; font-weight: bold; ">[</span><span style="color: #800000; ">root@ccloud121&nbsp;~</span><span style="color: #800000; font-weight: bold; ">]</span><span style="color: #000000; ">#&nbsp;wget&nbsp;http://public-repo-</span><span style="color: #000000; ">1</span><span style="color: #000000; ">.hortonworks.com/AMBARI-</span><span style="color: #000000; ">1</span><span style="color: #000000; ">.x/repos/centos6/ambari.repo<br /><br /></span><span style="color: #800000; font-weight: bold; ">[</span><span style="color: #800000; ">root@ccloud121&nbsp;~</span><span style="color: #800000; font-weight: bold; ">]</span><span style="color: #000000; ">#&nbsp;cp&nbsp;ambari.repo&nbsp;/etc/yum.repos.d</span></div></pre></div><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　这样，ambari-server的yum仓库就做好了。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　2、安装epel仓库</span></p><div style="font-size: 12px; margin: 5px 0px; color: #aaaaaa; line-height: 19px;"><pre style="margin-top: 0px; margin-bottom: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: 'Courier New';"><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><span style="color: #800000; font-weight: bold; ">[</span><span style="color: #800000; ">root@ccloud121&nbsp;~</span><span style="color: #800000; font-weight: bold; ">]</span><span style="color: #000000; ">#&nbsp;yum&nbsp;install&nbsp;epel-release&nbsp;&nbsp;&nbsp;#&nbsp;查看仓库列表，应该有HDP，EPEL&nbsp;</span><span style="color: #800000; font-weight: bold; ">[</span><span style="color: #800000; ">root@ccloud121&nbsp;~</span><span style="color: #800000; font-weight: bold; ">]</span><span style="color: #000000; ">#&nbsp;yum&nbsp;repolist</span></div></pre></div><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　3、通过yum安装amabari bits，这同时也会安装PostgreSQL</span></p><div style="font-size: 12px; margin: 5px 0px; color: #aaaaaa; line-height: 19px;"><pre style="margin-top: 0px; margin-bottom: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: 'Courier New';"><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #800000; font-weight: bold; ">[</span><span style="color: #800000; ">root@ccloud121&nbsp;~</span><span style="color: #800000; font-weight: bold; ">]</span><span style="color: #000000; ">#&nbsp;yum&nbsp;install&nbsp;ambari-server</span></div></pre></div><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　　这个步骤要等一会，它需要上网下载，大概39M左右。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　4、运行ambari-server setup，安装ambari-server，它会自动安装配置PostgreSQL，同时要求输入用户名和密码，如果按n，它用默认的用户名/密码值：ambari-server/bigdata。接着就开始下载安装JDK。安装完成后，ambari-server就可以启动了。</span></p><h1><span style="color: #000000;">　　三、集群启动</span></h1><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #ff6600;">　　　　</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　1、直接接通过ambari-server start和amabari-server stop即可启动和关闭ambari-serveer。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　2、启动成功后，在浏览器输入http://192.168.10.121:8080</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">　　　　界面如下图所示：</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #ff6600;">　　　　<img src="http://images.cnitblog.com/blog/469064/201303/05180124-b078e46324e140a68bd4d1f2ba37d7bd.jpg" alt="" width="635" height="626" style="border: 0px;" /></span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">登录名和密码都是admin。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #000000;">这样就可以登录到管理控制台。</span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><span style="color: #ff6600;"><img src="http://images.cnitblog.com/blog/469064/201303/05180243-7f5e07bea9ef4cc2a9ca5b80963c03fe.png" alt="" width="672" height="414" style="border: 0px;" /></span></p><p style="margin-top: 10px; margin-bottom: 10px; color: #aaaaaa; font-family: georgia, Verdana, Helvetica, Arial; font-size: 13px; line-height: 19px;"><br /></p><img src ="http://www.blogjava.net/paulwong/aggbug/398731.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-05-03 13:55 <a href="http://www.blogjava.net/paulwong/archive/2013/05/03/398731.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一网打尽13款开源Java大数据工具</title><link>http://www.blogjava.net/paulwong/archive/2013/05/03/398700.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 03 May 2013 01:05:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/05/03/398700.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/398700.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/05/03/398700.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/398700.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/398700.html</trackback:ping><description><![CDATA[<p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>下面将介绍大数据领域支持Java的主流开源工具</strong>：</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce391277b5.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce391277b5.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>1.	HDFS</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">HDFS是Hadoop应用程序中主要的分布式储存系统， HDFS集群包含了一个NameNode（主节点），这个节点负责管理所有文件系统的元数据及存储了真实数据的DataNode（数据节点，可以有很多）。HDFS针对海量数据所设计，所以相比传统文件系统在大批量小文件上的优化，HDFS优化的则是对小批量大型文件的访问和存储。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce3c49ded6.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"></a><a href="http://cms.csdnimg.cn/article/201304/28/517ce3c49ded6.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3c49ded6.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>2.	MapReduce</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Hadoop MapReduce是一个软件框架，用以轻松编写处理海量（TB级）数据的并行应用程序，以可靠和容错的方式连接<span style="line-height: 1.45em;">大型集群中</span><span style="line-height: 1.45em;">上万个节点（商用硬件）。</span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce3ee64519.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3ee64519.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>3.	HBase</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HBase是Hadoop数据库，一个分布式、可扩展的大数据存储。它提供了大数据集上随机和实时的读/写访问，并针对了商用服务器集群上的大型表格做出优化&#8212;&#8212;上百亿行，上千万列。其核心是Google Bigtable论文的开源实现，分布式列式存储。就像Bigtable利用GFS（Google File System）提供的分布式数据存储一样，它是Apache Hadoop在HDFS基础上提供的一个类Bigatable。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce413366c7.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce413366c7.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>4.	Cassandra</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Cassandra是一个高性能、可线性扩展、高有效性数据库，可以运行在商用硬件或云基础设施上打造完美的任务关键性数据平台。在横跨数据中心的复制中，Cassandra同类最佳，为用户提供更低的延时以及更可靠的灾难备份。通过log-structured update、反规范化和物化视图的强支持以及强大的内置缓存，Cassandra的数据模型提供了方便的二级索引（column indexe）。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce4611885c.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4611885c.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>5.	Hive</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Hive是Hadoop的一个数据仓库系统，促进了数据的综述（将结构化的数据文件映射为一张数据库表）、即席查询以及存储在Hadoop兼容系统中的大型数据集分析。Hive提供完整的SQL查询功能&#8212;&#8212;HiveQL语言，同时当使用这个语言表达一个<span style="line-height: 1.45em;">逻辑</span><span style="line-height: 1.45em;">变得低效和繁琐</span><span style="line-height: 1.45em;">时，HiveQL还允许传统的Map/Reduce程序员使用自己定制的Mapper和Reducer。</span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce470085ed.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce470085ed.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>6.	Pig</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Pig是一个用于大型数据集分析的平台，它包含了一个用于数据分析应用的高级语言以及评估这些应用的基础设施。Pig应用的闪光特性在于它们的结构经得起大量的并行，也就是说让它们支撑起非常大的数据集。Pig的基础设施层包含了产生Map-Reduce任务的编译器。Pig的语言层当前包含了一个原生语言&#8212;&#8212;Pig Latin，开发的初衷是易于编程和保证可扩展性。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce47b8e077.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce47b8e077.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 99px; height: 99px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>7.	Chukwa</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Chukwa是个开源的数据收集系统，用以监视大型分布系统。建立于HDFS和Map/Reduce框架之上，继承了Hadoop的可扩展性和稳定性。Chukwa同样包含了一个灵活和强大的工具包，用以显示、监视和分析结果，以保证数据的使用达到最佳效果。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce4870b072.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4870b072.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>8.	Ambari</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Ambari是一个基于web的工具，用于配置、管理和监视Apache Hadoop集群，支持Hadoop HDFS,、Hadoop MapReduce、Hive、HCatalog,、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari同样还提供了集群状况仪表盘，比如heatmaps和查看MapReduce、Pig、Hive应用程序的能力，以友好的用户界面对它们的性能特性进行诊断。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce49282930.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49282930.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>9.	ZooKeeper</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache ZooKeeper是一个针对大型分布式系统的可靠协调系统，提供的功能包括：配置维护、命名服务、分布式同步、组服务等。ZooKeeper的目标就是封装好复杂易出错的关键服务，将简单易用的接口和性能高效、功能稳定的系统提供给用户。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce49e31e19.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49e31e19.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>10.	Sqoop</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Sqoop是一个用来将Hadoop和关系型数据库中的数据相互转移的工具，可以将一个关系型数据库中数据导入Hadoop的HDFS中，也可以将HDFS中数据导入关系型数据库中。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce4b0d3c61.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4b0d3c61.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>11.	Oozie</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Oozie是一个可扩展、可靠及可扩充的工作流调度系统，用以管理Hadoop作业。Oozie Workflow作业是活动的Directed Acyclical Graphs（DAGs）。Oozie Coordinator作业是由周期性的Oozie Workflow作业触发，周期一般决定于时间（频率）和数据可用性。Oozie与余下的Hadoop堆栈结合使用，开箱即用的支持多种类型Hadoop作业（比如：Java map-reduce、Streaming map-reduce、Pig、 Hive、Sqoop和Distcp）以及其它系统作业（比如Java程序和Shell脚本）。</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce4bdedb23.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4bdedb23.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 100px; height: 100px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>12.	Mahout</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Mahout是个可扩展的机器学习和数据挖掘库，当前Mahout支持主要的4个用例：</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">推荐挖掘：搜集用户动作并以此给用户推荐可能喜欢的事物。</span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">聚集：收集文件并进行相关文件分组。</span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">分类：从现有的分类文档中学习，寻找文档中的相似特征，并为无标签的文档进行正确的归类。</span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">频繁项集挖掘：将一组项分组，并识别哪些个别项会经常一起出现。</span></li></ul><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a href="http://cms.csdnimg.cn/article/201304/28/517ce4cf93346.jpg" target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4cf93346.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>13.	HCatalog</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HCatalog是Hadoop建立数据的映射表和存储管理服务，它包括：</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个共享模式和数据类型机制。</span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个抽象表，这样用户就不需要关注数据存储的方式和地址。</span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">为类似Pig、MapReduce及Hive这些数据处理工具提供互操作性。</span></li></ul><img src ="http://www.blogjava.net/paulwong/aggbug/398700.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-05-03 09:05 <a href="http://www.blogjava.net/paulwong/archive/2013/05/03/398700.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HADOOP服务器</title><link>http://www.blogjava.net/paulwong/archive/2013/05/01/398605.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Tue, 30 Apr 2013 16:02:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/05/01/398605.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/398605.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/05/01/398605.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/398605.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/398605.html</trackback:ping><description><![CDATA[Centos集群服务器，公网ip<br />服务器地址<br />master：	mypetsbj.xicp.net:13283<br />slave1 ：	mypetsbj.xicp.net:13282<br />slave2 ：	mypetsbj.xicp.net:13286<br /><br /><table border="0" cellpadding="0" cellspacing="0" width="438" style="border-collapse:  collapse;width:329pt">  <colgroup><col width="438" style="width:329pt">  </colgroup><tbody><tr height="18" style="height:13.5pt">   <td height="18" width="438" style="height: 13.5pt; width: 329pt;"><a href="http://mypetsbj.xicp.net:13296/" target="_blank">http://mypetsbj.xicp.net:13296</a></td>  </tr>  <tr height="18" style="height:13.5pt">   <td height="18" style="height:13.5pt"><a href="http://mypetsbj.xicp.net:13304/" target="_blank">http://mypetsbj.xicp.net:13304</a></td>  </tr>  <tr height="18" style="height:13.5pt">   <td height="18" style="height:13.5pt"></td>  </tr>  <tr height="18" style="height:13.5pt">   <td height="18" style="height:13.5pt"><a href="http://mypetsbj.xicp.net:14113/" target="_blank">http://mypetsbj.xicp.net:14113</a></td>  </tr>  <tr height="18" style="height:13.5pt">   <td height="18" style="height: 13.5pt;"><a href="http://mypetsbj.xicp.net:11103/" target="_blank">http://mypetsbj.xicp.net:11103</a></td></tr></tbody></table><br />服务器开机时间<br />08:00 到 23:59&nbsp;<br /><br />opt/hadoop<br /><br /><span style="color: #ffffff;">用户名/密码</span><br /><span style="color: #ffffff;">hadoop/wzp </span><img src ="http://www.blogjava.net/paulwong/aggbug/398605.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-05-01 00:02 <a href="http://www.blogjava.net/paulwong/archive/2013/05/01/398605.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一个PIG脚本例子分析</title><link>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 13 Apr 2013 07:21:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397791.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397791.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397791.html</trackback:ping><description><![CDATA[执行脚本：<br />
<div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->PIGGYBANK_PATH=$PIG_HOME/contrib/piggybank/java/piggybank.jar<br />
INPUT=pig/input/test-pig-full.txt<br />
OUTPUT=pig/output/test-pig-output-$(date&nbsp;&nbsp;+%Y%m%d%H%M%S)<br />
PIGSCRIPT=analyst_status_logs.pig<br />
<br />
<span style="color: #008000; ">#</span><span style="color: #008000; ">analyst_500_404_month.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_500_404_day.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_404_percentage.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_500_percentage.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_unique_path.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_user_logs.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_status_logs.pig</span><span style="color: #008000; "><br />
</span><br />
<br />
pig&nbsp;-p&nbsp;PIGGYBANK_PATH=$PIGGYBANK_PATH&nbsp;-p&nbsp;INPUT=$INPUT&nbsp;-p&nbsp;OUTPUT=$OUTPUT&nbsp;$PIGSCRIPT</div><br /><br />要分析的数据源，LOG 文件<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->46.20.45.18&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:00:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;302&nbsp;-&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;46.20.45.18&nbsp;<span style="color: #800000; ">""</span>&nbsp;11011AEC9542DB0983093A100E8733F8&nbsp;0<br />46.20.45.18&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:00:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/sign-in.jspx&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;200&nbsp;3926&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;46.20.45.18&nbsp;<span style="color: #800000; ">""</span>&nbsp;11011AEC9542DB0983093A100E8733F8&nbsp;0<br />69.59.28.19&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:01:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;302&nbsp;-&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;69.59.28.19&nbsp;<span style="color: #800000; ">""</span>&nbsp;36D80DE7FE52A2D89A8F53A012307B0A&nbsp;15</div><br /><br />PIG脚本：<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->--注册JAR包，因为要用到DateExtractor<br />register&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$PIGGYBANK_PATH</span><span style="color: #800000; ">'</span>;<br /><br />--声明一个短函数名<br />DEFINE&nbsp;DATE_EXTRACT_MM&nbsp;<br />org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(<span style="color: #800000; ">'</span><span style="color: #800000; ">yyyy-MM</span><span style="color: #800000; ">'</span>);<br /><br />DEFINE&nbsp;DATE_EXTRACT_DD&nbsp;<br />org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(<span style="color: #800000; ">'</span><span style="color: #800000; ">yyyy-MM-dd</span><span style="color: #800000; ">'</span>);<br /><br />--&nbsp;pig/input/test-pig-full.txt<br />--把数据从变量所指的文件加载到PIG中，并定义数据列名，此时的数据集为数组(a,b,c)<br />raw_logs&nbsp;=&nbsp;load&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$INPUT</span><span style="color: #800000; ">'</span>&nbsp;USING&nbsp;org.apache.pig.piggybank.storage.MyRegExLoader(<span style="color: #800000; ">'</span><span style="color: #800000; ">^(\\S+)&nbsp;(\\S+)&nbsp;(\\S+)&nbsp;\\[([\\w:/]+\\s[+\\-]\\d{4})\\]&nbsp;"(\\S+)&nbsp;(\\S+)&nbsp;(HTTP[^"]+)"&nbsp;(\\S+)&nbsp;(\\S+)&nbsp;"([^"]*)"&nbsp;"([^"]*)"&nbsp;"(\\S+)"&nbsp;"(\\S+)"&nbsp;(\\S+)&nbsp;"(.*)"&nbsp;(\\S+)&nbsp;(\\S+)</span><span style="color: #800000; ">'</span>)<br />as&nbsp;(remoteAddr:&nbsp;chararray,&nbsp;<br />n2:&nbsp;chararray,&nbsp;<br />n3:&nbsp;chararray,&nbsp;<br />time:&nbsp;chararray,&nbsp;<br />method:&nbsp;chararray,<br />path:chararray,<br />protocol:chararray,<br />status:&nbsp;int,&nbsp;<br />bytes_string:&nbsp;chararray,&nbsp;<br />referrer:&nbsp;chararray,&nbsp;<br />browser:&nbsp;chararray,&nbsp;<br />n10:chararray,<br />remoteLogname:&nbsp;chararray,&nbsp;<br />remoteAddr12:&nbsp;chararray,&nbsp;<br />path2:&nbsp;chararray,&nbsp;<br />sessionid:&nbsp;chararray,&nbsp;<br />n15:&nbsp;chararray<br />);<br /><br />--过滤数据<br />filter_logs&nbsp;=&nbsp;FILTER&nbsp;raw_logs&nbsp;BY&nbsp;<span style="color: #0000FF; ">not</span>&nbsp;(browser&nbsp;matches&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">.*pingdom.*</span><span style="color: #800000; ">'</span>);<br />--item_logs&nbsp;=&nbsp;FOREACH&nbsp;raw_logs&nbsp;GENERATE&nbsp;browser;<br /><br />--percent&nbsp;500&nbsp;logs<br />--重定义数据项，数据集只取2项status,month<br />reitem_percent_500_logs&nbsp;=&nbsp;FOREACH&nbsp;filter_logs&nbsp;GENERATE&nbsp;status,DATE_EXTRACT_MM(time)&nbsp;as&nbsp;month;<br />--分组数据集，此时的数据结构为MAP(a{(aa,bb,cc),(dd,ee,ff)},b{(bb,cc,dd),(ff,gg,hh)})<br />group_month_percent_500_logs&nbsp;=&nbsp;GROUP&nbsp;reitem_percent_500_logs&nbsp;BY&nbsp;(month);<br />--重定义分组数据集数据项，进行分组统计，此时要联合分组数据集和原数据集统计<br />final_month_500_logs&nbsp;=&nbsp;FOREACH&nbsp;group_month_percent_500_logs&nbsp;<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;--对原数据集做count，因为是在foreachj里做count的，即使是对原数据集，也会自动会加month==group的条件<br />&nbsp;&nbsp;&nbsp;&nbsp;--从这里可以看出对于group里的数据集，完全没用到<br />&nbsp;&nbsp;&nbsp;&nbsp;--这时是以每一行为单位的，统计MAP中的KEY-a对应的数组在原数据集中的个数<br />&nbsp;&nbsp;&nbsp;&nbsp;total&nbsp;=&nbsp;COUNT(reitem_percent_500_logs);<br />&nbsp;&nbsp;&nbsp;&nbsp;--对原数据集做filter，因为是在foreachj里做count的，即使是对原数据集，也会自动会加month==group的条件<br />&nbsp;&nbsp;&nbsp;&nbsp;--重新过滤一下原数据集，得到status==500,month==group的数据集<br />&nbsp;&nbsp;&nbsp;&nbsp;t&nbsp;=&nbsp;filter&nbsp;reitem_percent_500_logs&nbsp;by&nbsp;status==&nbsp;500;&nbsp;--create&nbsp;a&nbsp;bag&nbsp;which&nbsp;contains&nbsp;only&nbsp;T&nbsp;values<br />&nbsp;&nbsp;&nbsp;&nbsp;--重定义数据项，取group，统计结果<br />&nbsp;&nbsp;&nbsp;&nbsp;generate&nbsp;flatten(group)&nbsp;as&nbsp;col1,&nbsp;100*(double)COUNT(t)/(double)total;<br />}<br />STORE&nbsp;final_month_500_logs&nbsp;into&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$OUTPUT</span><span style="color: #800000; ">'</span>&nbsp;using&nbsp;PigStorage(<span style="color: #800000; ">'</span><span style="color: #800000; ">,</span><span style="color: #800000; ">'</span>);</div><br /><img src ="http://www.blogjava.net/paulwong/aggbug/397791.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-13 15:21 <a href="http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>把命令行中的值传进PIG中</title><link>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 07:32:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397645.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397645.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397645.html</trackback:ping><description><![CDATA[<a href="http://wiki.apache.org/pig/ParameterSubstitution" target="_blank">http://wiki.apache.org/pig/ParameterSubstitution<br />
<br />
<br />
</a>
<div>
<div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->%pig&nbsp;-param&nbsp;input=/user/paul/sample.txt&nbsp;-param&nbsp;output=/user/paul/output/</div>
</div><br /><br />PIG中获取<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->records&nbsp;=&nbsp;LOAD&nbsp;<span style="color: #800080; ">$input</span>;</div><img src ="http://www.blogjava.net/paulwong/aggbug/397645.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-10 15:32 <a href="http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG中的分组统计百分比</title><link>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 06:13:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397642.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397642.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397642.html</trackback:ping><description><![CDATA[<a href="http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field" target="_blank">http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field<br /><br /></a><a href="http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query" target="_blank">http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query</a><img src ="http://www.blogjava.net/paulwong/aggbug/397642.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-10 14:13 <a href="http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG小议</title><link>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 05 Apr 2013 13:33:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397411.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397411.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397411.html</trackback:ping><description><![CDATA[<div><strong>什么是PIG</strong></div><div>是一种设计语言，通过设计数据怎么流动，然后由相应的引擎将此变成MAPREDUCE JOB去HADOOP中运行。</div><div></div><div></div><div><strong>PIG与SQL</strong></div><div>两者有相同之处，执行一个或多个语句，然后出来一些结果。</div><div>但不同的是，SQL要先把数据导到表中才能执行，SQL不关心中间如何做，即发一个SQL语句过去，就有结果出来。</div><div>PIG，无须导数据到表中，但要设计直到出结果的中间过程，步骤如何等等。</div><img src ="http://www.blogjava.net/paulwong/aggbug/397411.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-05 21:33 <a href="http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG资源</title><link>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 05 Apr 2013 10:19:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397406.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397406.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397406.html</trackback:ping><description><![CDATA[Hadoop Pig学习笔记(一) 各种SQL在PIG中实现<br />
<a href="http://guoyunsky.iteye.com/blog/1317084" target="_blank">http://guoyunsky.iteye.com/blog/1317084<br />
<br />
</a><a href="http://guoyunsky.iteye.com/category/196632" target="_blank">http://guoyunsky.iteye.com/category/196632<br />
<br />
</a>Hadoop学习笔记(9) Pig简介<br />
<a href="http://www.distream.org/?p=385" target="_blank">http://www.distream.org/?p=385</a><br />
<br />
<br />
[hadoop系列]Pig的安装和简单示例<br />
<a href="http://blog.csdn.net/inkfish/article/details/5205999" target="_blank">http://blog.csdn.net/inkfish/article/details/5205999</a><br />
<br />
<br />
Hadoop and Pig for Large-Scale Web Log Analysis<br />
<a href="http://www.devx.com/Java/Article/48063" target="_blank">http://www.devx.com/Java/Article/48063</a>
<br />
<br />
<br />
Pig实战<br />
<a href="http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html" target="_blank">http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html</a><br />
<br />
<br />
[原创]Apache Pig中文教程（进阶）<br />
<a href="http://www.codelast.com/?p=4249" target="_blank">http://www.codelast.com/?p=4249</a><br />
<br />
<br />
基于hadoop平台的pig语言对apache日志系统的分析<br />
<a href="http://goodluck-wgw.iteye.com/blog/1107503" target="_blank">http://goodluck-wgw.iteye.com/blog/1107503</a><br />
<br />
<br />
!!Pig语言<br />
<a href="http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318" target="_blank">http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318</a><br />
<br />
<br />
Embedding Pig In Java Programs<br />
<a href="http://wiki.apache.org/pig/EmbeddedPig" target="_blank">http://wiki.apache.org/pig/EmbeddedPig</a><br />
<br />
<br />
一个pig事例(REGEX_EXTRACT_ALL, DBStorage，结果存进数据库)<br />
<a href="http://www.myexception.cn/database/1256233.html" target="_blank">http://www.myexception.cn/database/1256233.html</a><br />
<br />
<br />
Programming Pig<br />
<a href="http://ofps.oreilly.com/titles/9781449302641/index.html" target="_blank">http://ofps.oreilly.com/titles/9781449302641/index.html</a><br />
<br />
<br />
[原创]Apache Pig的一些基础概念及用法总结（1）<br />
<a href="http://www.codelast.com/?p=3621" target="_blank">http://www.codelast.com/?p=3621<br />
<br /></a><br />
!PIG手册<br /><a href="http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions" target="_blank">http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions</a><img src ="http://www.blogjava.net/paulwong/aggbug/397406.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-05 18:19 <a href="http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop集群中添加节点步骤</title><link>http://www.blogjava.net/paulwong/archive/2013/03/16/396544.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 16 Mar 2013 15:04:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/03/16/396544.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/396544.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/03/16/396544.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/396544.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/396544.html</trackback:ping><description><![CDATA[在新节点安装好hadoop<br /><br /><br />把namenode的有关配置文件复制到该节点<br /><br /><br />修改masters和slaves文件,增加该节点<br /><br /><br />设置ssh免密码进出该节点<br /><br /><br />单独启动该节点上的datanode和tasktracker(hadoop-daemon.sh start  datanode/tasktracker)<br /><br /><br />运行start-balancer.sh进行数据负载均衡<br />  <br /><br />负载均衡:作用:当节点出现故障,或新增加节点时,数据块分布可能不均匀,负载均衡可以重新平衡各个datanode上数据块的分布<img src ="http://www.blogjava.net/paulwong/aggbug/396544.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-03-16 23:04 <a href="http://www.blogjava.net/paulwong/archive/2013/03/16/396544.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HBASE读书笔记-基础功能</title><link>http://www.blogjava.net/paulwong/archive/2013/02/06/395168.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 06 Feb 2013 01:53:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/02/06/395168.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/395168.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/02/06/395168.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/395168.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/395168.html</trackback:ping><description><![CDATA[<ol>
     <li>HBASE的SHELL命令使用<br />
     <br />
     </li>
     <li>HBASE的JAVA CLIENT的使用<br /><br />新增和修改记录用PUT。<br /><br />PUT的执行流程：<br />首先会在内存中增加MEMSTORE，如果这个表有N个COLOUMN FAMILY，则会产生N个MEMSTORE，记录中的值属于不同的COLOUMN FAMILY的，会保存到不同的MEMSTORE中。MEMSTORE中的值不会马上FLUSH到文件中，而是到MEMSTORE满的时候再FLUSH，且FLUSH的时候不会写入已存在的HFILE中，而是新增一个HFILE去保存。另外会写WRITE AHEAD LOG，这是由于新增记录时不是马上写入HFILE的，如果中途出现DOWN机时，则HBASE重启时会根据这个LOG来恢复数据。<br /><br />删除记录用DELETE。<br /><br />删除时并不会将在HFILE中的内容删除，而是作一标记，然后在查询的时候可以不取这些记录。<br /><br />读取单条记录用GET。<br /><br />读取的时候会将记录保存到CAHE中，同样如果这个表有N个COLOUMN FAMILY，则会产生N个CAHE<br />，记录中的值属于不同的COLOUMN FAMILY的，会保存到不同的CAHE中。这样下次客户端再取记录时会综合CAHE和MEMSTORE来返回数据。<br /><br />新增表用HADMIN。<br /><br />查询多条记录用SCAN和FILTER。<br />
     <br />
     </li>
     <li>HBASE的分布式计算<br /><br />为什么会有分布式计算<br />前面的API是针对ONLINE的应用，即要求低延时的，相当于OLTP。而针对大量数据时这些API就不适用了。<br />如要针对全表数据进行分析时用SCAN，这样会将全表数据取回本地，如果数据量在100G时会耗几个小时，为了节省时间，引入多线程做法，但要引入多线程时，需遵从新算法：将全表数据分成N个段，每段用一个线程处理，处理完后，交结果合成，然后进行分析。<br /><br />如果数据量在200G或以上时间就加倍了，多线程的方式不能满足了，因此引入多进程方式，即将计算放在不同的物理机上处理，这时就要考虑每个物理机DOWN机时的处理方式等情况了，HADOOP的MAPREDUCE则是这种分布式计算的框架了，对于应用者而言，只须处理分散和聚合的算法，其他的无须考虑。<br /><br />HBASE的MAPREDUCE<br />使用TABLEMAP和TABLEREDUCE。<br /><br />HBASE的部署架构和组成的组件<br />架构在HADOOP和ZOOPKEEPER之上。<br /><br />HBASE的查询记录和保存记录的流程<br />说见前一编博文。<br /><br />HBASE作为数据来源地、保存地和共享数据源的处理方式<br />即相当于数据库中JOIN的算法：REDUCE SIDE JOIN、MAP SIDE JOIN。<br /></li>
</ol><img src ="http://www.blogjava.net/paulwong/aggbug/395168.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-02-06 09:53 <a href="http://www.blogjava.net/paulwong/archive/2013/02/06/395168.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>监控HBASE</title><link>http://www.blogjava.net/paulwong/archive/2013/02/04/395107.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 04 Feb 2013 07:08:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/02/04/395107.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/395107.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/02/04/395107.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/395107.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/395107.html</trackback:ping><description><![CDATA[@import url(http://www.blogjava.net/CuteSoft_Client/CuteEditor/Load.ashx?type=style&file=SyntaxHighlighter.css);@import url(/css/cuteeditor.css);
<div>Hadoop/Hbase是开源版的google Bigtable, GFS, MapReduce的实现，随着互联网的发展，大数据的处理显得越发重要，Hadoop/Hbase的用武之地也越发广泛。为了更好的使用Hadoop/Hbase系统，需要有一套完善的监控系统，来了解系统运行的实时状态，做到一切尽在掌握。Hadoop/Hbase有自己非常完善的metrics framework, 里面包种各种维度的系统指标的统计，另外，这套metrics framework设计的也非常不错，用户可以很方便地添加自定义的metrics。更为重要的一点是metrics的展示方式，目前它支持三种方式：一种是落地到本地文件，一种是report给Ganglia系统，另一种是通过JMX来展示。本文主要介绍怎么把Hadoop/Hbase的metrics report给Ganglia系统，通过浏览器来查看。<br />
<br />
介绍后面的内容之前有必要先简单介绍一下Ganglia系统。Ganglia是一个开源的用于系统监控的系统，它由三部分组成：gmond, gmetad, webfrontend, 三部分是这样分工的：<br />
<br />
gmond: 是一个守护进程，运行在每一个需要监测的节点上，收集监测统计，发送和接受在同一个组播或单播通道上的统计信息<br />
gmetad: 是一个守护进程，定期检查gmond，从那里拉取数据，并将他们的指标存储在RRD存储引擎中<br />
webfrontend: 安装在有gmetad运行的机器上，以便读取RRD文件，用来做前台展示<br />
<br />
简单总结它们三者的各自的功用，gmond收集数据各个node上的metrics数据，gmetad汇总gmond收集到的数据，webfrontend在前台展示gmetad汇总的数据。Ganglia缺省是对系统的一些metric进行监控，比如cpu/memory/net等。不过Hadoop/Hbase内部做了对Ganglia的支持，只需要简单的改配置就可以将Hadoop/Hbase的metrics也接入到ganglia系统中进行监控。<br />
<br />
接下来介绍如何把Hadoop/Hbase接入到Ganglia系统，这里的Hadoop/Hbase的版本号是0.94.2，早期的版本可能会有一些不同，请注意区别。Hbase本来是Hadoop下面的子项目，因此所用的metrics framework原本是同一套Hadoop metrics，但后面hadoop有了改进版本的metrics framework:metrics2(metrics version 2), Hadoop下面的项目都已经开始使用metrics2, 而Hbase成了Apache的顶级子项目，和Hadoop成为平行的项目后，目前还没跟进metrics2，它用的还是原始的metrics.因此这里需要把Hadoop和Hbase的metrics分开介绍。<br />
<br />
Hadoop接入Ganglia:<br />
<br />
1. Hadoop metrics2对应的配置文件为：hadoop-metrics2.properties<br />
2. hadoop metrics2中引用了source和sink的概念，source是用来收集数据的, sink是用来把source收集的数据consume的（包括落地文件，上报ganglia，JMX等）<br />
3. hadoop metrics2配置支持Ganglia:</div>
<div>
<div style="background-color: #eeeeee; font-size: 13px; border-left-color: #cccccc; padding: 4px 5px 4px 4px; width: 98%; word-break: break-all; "><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->#*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink30<br />
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31<br />
&nbsp;<br />
*.sink.ganglia.period=10<br />
*.sink.ganglia.supportsparse=true<br />
*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both<br />
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40<br />
&nbsp;<br />
#uncomment&nbsp;as&nbsp;your&nbsp;needs<br />
namenode.sink.ganglia.servers=10.235.6.156:8649<br />
#datanode.sink.ganglia.servers=10.235.6.156:8649<br />
#jobtracker.sink.ganglia.servers=10.0.3.99:8649<br />
#tasktracker.sink.ganglia.servers=10.0.3.99:8649<br />
#maptask.sink.ganglia.servers=10.0.3.99:8649<br />
#reducetask.sink.ganglia.servers=10.0.3.99:8649</div>
</div>
<br />
<div><br />
</div>
<div>这里需要注意的几点：<br />
<br />
(1) 因为Ganglia3.1与3.0不兼容，需要根据Ganglia的版本选择使用GangliaSink30或者GangliaSink31<br />
(2) period配置上报周期，单位是秒(s)<br />
(3) namenode.sink.ganglia.servers指定Ganglia gmetad所在的host:port，用来向其上报数据<br />
(4) 如果同一个物理机器上同时启动了多个hadoop进程(namenode/datanode, etc)，根据需要把相应的进程的sink.ganglia.servers配置好即可<br />
Hbase接入Ganglia:<br />
<br />
1. Hbase所用的hadoop metrics对应的配置文件是: hadoop-metrics.properties<br />
2. hadoop metrics里核心是Context，写文件有写文件的TimeStampingFileContext, 向Ganglia上报有GangliaContext/GangliaContext31<br />
3. hadoop metrics配置支持Ganglia:</div>
<div>
<div style="background-color: #eeeeee; font-size: 13px; border-left-color: #cccccc; padding: 4px 5px 4px 4px; width: 98%; word-break: break-all; "><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->#&nbsp;Configuration&nbsp;of&nbsp;the&nbsp;"hbase"&nbsp;context&nbsp;for&nbsp;ganglia<br />
#&nbsp;Pick&nbsp;one:&nbsp;Ganglia&nbsp;3.0&nbsp;(former)&nbsp;or&nbsp;Ganglia&nbsp;3.1&nbsp;(latter)<br />
#&nbsp;hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext<br />
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31<br />
hbase.period=10<br />
hbase.servers=10.235.6.156:8649</div>
</div>
<div><br />
</div>
<div>这里需要注意几点：<br />
<br />
(1) 因为Ganglia3.1和3.0不兼容，所以如果是3.1以前的版本，需要用GangliaContext, 如果是3.1版的Ganglia，需要用GangliaContext31<br />
(2) period的单位是秒(s)，通过period可以配置向Ganglia上报数据的周期<br />
(3) servers指定的是Ganglia gmetad所在的host:port，把数据上报到指定的gmetad<br />
(4) 对rpc和jvm相关的指标都可以进行类似的配置</div>
<div><br />
</div>
<div><br />
</div>
<div><br />
</div>
<div><br />
</div>
<div><br />
</div><img src ="http://www.blogjava.net/paulwong/aggbug/395107.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-02-04 15:08 <a href="http://www.blogjava.net/paulwong/archive/2013/02/04/395107.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HBASE部署要点</title><link>http://www.blogjava.net/paulwong/archive/2013/02/04/395101.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 04 Feb 2013 04:10:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/02/04/395101.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/395101.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/02/04/395101.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/395101.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/395101.html</trackback:ping><description><![CDATA[<div>REGIONS SERVER和TASK TRACKER SERVER不要在同一台机器上，最好如果有MAPREDUCE JOB运行的话，应该分开两个CLUSTER，即两群不同的服务器上，这样MAPREDUCE 的线下负载不会影响到SCANER这些线上负载。</div>
<div><br />
</div>
<div>如果主要是做MAPREDUCE JOB的话，将REGIONS SERVER和TASK TRACKER SERVER放在一起是可以的。</div>
<div><br />
</div>
<div><br />
</div>
<div><span style="background-color: yellow; color: red; ">原始集群模式</span></div>
<div><br />
</div>
10个或以下节点，无MAPREDUCE JOB，主要用于低延迟的访问。每个节点上的配置为：CPU4-6CORE，内存24-32G，4个SATA硬盘。Hadoop NameNode, JobTracker, HBase Master, 和ZooKeeper全都在同一个NODE上。
<div><br />
</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">小型集群模式（10-20台服务器）</span></div>
<div><br />
</div>
HBase Master放在单独一台机器上, 以便于使用较低配置的机器。ZooKeeper也放在单独一台机器上，NameNode和JobTracker放在同一台机器上。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">中型集群模式（20-50台服务器）</span></div>
<div><br />
</div>
由于无须再节省费用，可以将HBase Master和ZooKeeper放在同一台机器上,&nbsp;ZooKeeper和HBase Master要三个实例。NameNode和JobTracker放在同一台机器上。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">大型集群模式（&gt;50台服务器）</span></div>
<div><br />
</div>
和中型集群模式相似，但ZooKeeper和HBase Master要五个实例。NameNode和Second&nbsp;NameNode要有足够大的内存。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">HADOOP MASTER节点</span></div>
<div><br />
</div>
NameNode和Second&nbsp;NameNode服务器配置要求：（小型）8CORE CPU，16G内存，1G网卡和SATA 硬盘，中弄再增加多16G内存，大型则再增加多32G内存。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">HBASE MASTER节点</span></div>
<div><br />
</div>
服务器配置要求：4CORE CPU，8-16G内存，1G网卡和2个SATA 硬盘，一个用于操作系统，另一个用于HBASE MASTER LOGS。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">HADOOP DATA NODES和HBASE REGION SERVER节点</span></div>
<div><br />
</div>
DATA NODE和REGION SERVER应在同一台服务器上，且不应该和TASK TRACKER在一起。服务器配置要求：8-12CORE CPU，24-32G内存，1G网卡和12*1TB SATA 硬盘，一个用于操作系统，另一个用于HBASE MASTER LOGS。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">ZOOPKEEPERS节点</span></div>
<div><br />
</div>
服务器配置和HBASE MASTER相似，也可以与HBASE MASTER放在一起，但就要多增加一个硬盘单独给ZOOPKEEPER使用。</div>
<div><br />
</div>
<div>
<div><span style="background-color: yellow; color: red; ">安装各节点</span></div>
<div><br />
</div>
JVM配置：</div>
-Xmx8g&#8212;设置HEAP的最大值到8G，不建议设到15 GB.<br />
-Xms8g&#8212;设置HEAP的最小值到8GS.<br />
-Xmn128m&#8212;设置新生代的值到128 MB，默认值太小。<br />
-XX:+UseParNewGC&#8212;设置对于新生代的垃圾回收器类型，这种类型是会停止JAVA进程，然后再进行回收的，但由于新生代体积比较小，持续时间通常只有几毫秒，因此可以接受。<br />
-XX:+UseConcMarkSweepGC&#8212;设置老生代的垃圾回收类型，如果用新生代的那个会不合适，即会导致JAVA进程停止的时间太长，用这种不会停止JAVA进程，而是在JAVA进程运行的同时，并行的进行回收。<br />
-XX:CMSInitiatingOccupancyFraction&#8212;设置CMS回收器运行的频率。<br />
<div><br />
</div>
<div><br />
</div>
<div><br />
</div>
<div><br />
</div><img src ="http://www.blogjava.net/paulwong/aggbug/395101.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-02-04 12:10 <a href="http://www.blogjava.net/paulwong/archive/2013/02/04/395101.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HBASE读书笔记</title><link>http://www.blogjava.net/paulwong/archive/2013/02/01/395020.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 01 Feb 2013 05:55:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/02/01/395020.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/395020.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/02/01/395020.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/395020.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/395020.html</trackback:ping><description><![CDATA[<div>GET、PUT是ONLINE的操作，MAPREDUCE是OFFLINE的操作</div>
<div></div><br/><br/>
<div><span style="color: #0000ff; background-color: yellow;">HDFS写流程</span></div>
<div>客户端收到要保存文件的请求后，将文件以64M为单位拆成若干份BLOCK，形成一个列表，即由几个BLOCK组成，将这些信息告诉NAME NODE，我要保存这个，NAME NODE算出一个列表，哪段BLOCK应该写到哪个DATA NODE，客户端将第一个BLOCK传到第一个节点DATA NODE A，通知其保存，同时让它通知DATA NODE D和DATA NODE B也保存一份，DATA NODE D收到信息后进行了保存，同时通知DATA NODE B保存一份，DATA NODE B保存完成后则通知客户端保存完成，客户端再去向NAME NODE中取下一个BLOCK要保存的位置，重复以上的动作，直到所有的BLOCK都保存完成。</div>
<div></div><br/>
<div><span style="color: #0000ff; background-color: yellow;">HDFS读流程</span></div>
<div>客户端向NAME NODE请求读一个文件，NAME NODE返回这个文件所构成的所有BLOCK的DATA NODE IP及BLOCK ID，客户端并行的向各DATA NODE发出请求，要取某个BLOCK ID的BLOCK，DATA NODE发回所要的BLOCK给客户端，客户端收集到所有的BLOCK后，整合成一个完整的文件后，此流程结束。<br />
<br />
<br />
</div>
<div></div>
<div><span style="color: #0000ff; background-color: yellow;">MAPREDUCE流程</span></div>
<div>输入数据 -- 非多线程了，而是多进程的挑选数据，即将输入数据分成多块，每个进程处理一块 -- 分组 -- 多进程的汇集数据 -- 输出</div>
<div><br />
<span style="color: #0000ff; background-color: yellow;">HBASE表结构</span></div>
<div>HBASE中将一个大表数据分成不同的小表，每个小表叫REGION，存放REGION的服务器叫REGIONSERVER，一个REGIONSERVER可以存放多个REGION。通常REGIONSERVER和DATA NODE是在同一服务器，以减少NETWORK IO。</div>
<div></div>
<div>-ROOT-表存放于MASTER SERVER上，记录了一共有多少个REGIONSERVER，每个REGION SERVER上都有一个.META.表，上面记录了本REGION SERVER放有哪几个表的哪几个REGION。如果要知道某个表共有几个REGION，就得去所有的REGION SERVER上查.META.表，进行汇总才能得知。</div>
<div></div>
<div>客户端如果要查ROW009的信息，先去咨询ZOOPKEEPER，-ROOT-表在哪里，然后问-ROOT-表，哪个.META.知道这个信息，然后去问.META.表，哪个REGION有这个信息，然后去那个REGION问ROW009的信息，然后那个REGION返回此信息。<br />
</div>
<br />
<br />
<div><span style="color: #0000ff; background-color: yellow;">HBASE MAPREDUCE</span></div>
<div>一个REGION一个MAP任务，而任务里的map方法执行多少次，则由查询出来的记录有多少条，则执行多少次。</div>
<div>REDUCE任务负责向REGION写数据，但写到哪个REGION则由那个KEY归属哪个REGION管，则写到哪个REGION，有可能REDUCE任务会和所有的REGION SERVER交互。<br />
</div>
<br />
<br />
<div><span style="color: #0000ff; background-color: yellow;">在HBASE的MAPREDUCE JOB中使用JOIN</span></div>
<div>REDUCE-SIDE JOIN<br />
利用现有的SHUTTLE分组机制，在REDUCE阶段做JOIN，但由于MAP阶段数据大，可能会有性能问题。</div>
<div>MAP-SIDE JOIN</div>
<div>将数据较少的一表读到一公共文件中，然后在MPA方法中循环另一表的数据，再将要的数据从公共文件中读取。这样可以减少SHUTTLE和SORT的时间，同时也不需要REDUCE任务。</div>
<div></div>
<img src ="http://www.blogjava.net/paulwong/aggbug/395020.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-02-01 13:55 <a href="http://www.blogjava.net/paulwong/archive/2013/02/01/395020.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Hadoop的几种Join方法</title><link>http://www.blogjava.net/paulwong/archive/2013/01/31/395000.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Thu, 31 Jan 2013 10:24:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/01/31/395000.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/395000.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/01/31/395000.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/395000.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/395000.html</trackback:ping><description><![CDATA[1)      在Reduce阶段进行Join,这样运算量比较小.(这个适合被Join的数据比较小的情况下.)<br />2)      压缩字段,对数据预处理,过滤不需要的字段.<br />3)      最后一步就是在Mapper阶段过滤,这个就是Bloom Filter的用武之地了.也就是需要详细说明的地方.<br /><br /> <br />下面就拿一个我们大家都熟悉的场景来说明这个问题: 找出上个月动感地带的客户资费的使用情况,包括接入和拨出.<br /><br />(这个只是我臆想出来的例子,根据实际的DB数据存储结构,在这个场景下肯定有更好的解决方案,大家不要太较真哦)<br /><br />这个时候的两个个数据集都是比较大的,这两个数据集分别是:上个月的通话记录,动感地带的手机号码列表.<br /><br /><br />比较直接的处理方法有2种:<br /><br /><strong>1)在 Reduce 阶段,通过动感地带号码来过滤.</strong><br /><br />                优点:这样需要处理的数据相对比较少,这个也是比较常用的方法.<br /><br />                缺点:很多数据在Mapper阶段花了老鼻子力气汇总了,还通过网络Shuffle到Reduce节点,结果到这个阶段给过滤了.<br /><br /> <br /><br /><strong>2)在 Mapper 阶段时,通过动感地带号码来过滤数据.</strong><br /><br />                优点:这样可以过滤很多不是动感地带的数据,比如神州行,全球通.这些过滤的数据就可以节省很多网络带宽了.<br /><br />                缺点:就是动感地带的号码不是小数目,如果这样处理就需要把这个大块头复制到所有的Mapper节点,甚至是Distributed Cache.(Bloom Filter就是用来解决这个问题的)<br /><br /><br />Bloom Filter就是用来解决上面方法2的缺点的.<br /><br />方法2的缺点就是大量的数据需要在多个节点复制.Bloom Filter通过多个Hash算法, 把这个号码列表压缩到了一个Bitmap里面. 通过允许一定的错误率来换空间, 这个和我们平时经常提到的时间和空间的互换类似.详细情况可以参考:<br /><br />http://blog.csdn.net/jiaomeng/article/details/1495500<br /><br />但是这个算法也是有缺陷的,就是会把很多神州行,全球通之类的号码当成动感地带.但在这个场景中,这根本不是问题.因为这个算法只是过滤一些号码,漏网之鱼会在Reduce阶段进行精确匹配时顾虑掉.<br /><br />这个方法改进之后基本上完全回避了方法2的缺点:<br /><br />1)      没有大量的动感地带号码发送到所有的Mapper节点.<br />2)      很多非动感地带号码在Mapper阶段就过滤了(虽然不是100%),避免了网络带宽的开销及延时.<br /><br /><br />继续需要学习的地方:Bitmap的大小, Hash函数的多少, 以及存储的数据的多少. 这3个变量如何取值才能才能在存储空间与错误率之间取得一个平衡.<img src ="http://www.blogjava.net/paulwong/aggbug/395000.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-01-31 18:24 <a href="http://www.blogjava.net/paulwong/archive/2013/01/31/395000.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>