﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-Skynet-随笔分类-数据挖掘</title><link>http://www.blogjava.net/Skynet/category/41197.html</link><description /><language>zh-cn</language><lastBuildDate>Fri, 11 Dec 2009 14:39:25 GMT</lastBuildDate><pubDate>Fri, 11 Dec 2009 14:39:25 GMT</pubDate><ttl>60</ttl><item><title>和 业务讨论的 推荐</title><link>http://www.blogjava.net/Skynet/archive/2009/12/11/305591.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Fri, 11 Dec 2009 08:20:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/12/11/305591.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/305591.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/12/11/305591.html#Feedback</comments><slash:comments>2</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/305591.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/305591.html</trackback:ping><description><![CDATA[<br />
<br />
定义： <br />
灰羊群&nbsp;&nbsp; （无主见的用户群体）<br />
黑羊&nbsp;&nbsp; （ 对自己需要什么有明确的认识，我们一般称为专家用户。&nbsp; ）<br />
<br />
<br />
1. 区分 灰（无主见） 黑 羊群<br />
<br />
2.<br />
user session 关联 #当 关联关系维护使用 用户的会话ID（用户不同心情，起始在数据中就应该是不同分类的）<br />
user 推荐&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 而推荐出产品 还是 跟 用户唯一编号有关<br />
#在推荐中需要描述&nbsp; 用户的多角度 问题<br />
<br />
3. <br />
蛮力推荐（ 全数据 ；描述初期清洗后的数据 ） 适合 产品关联&nbsp; <br />
清洗后期的数据（包含用户多维度描述） 适合 &nbsp; 用户关联 <br />
<br />
<br />
4.<br />
专家跟随推荐<br />
描述：<br />
&nbsp; 用户分类 找到黑绵羊 &nbsp;<br />
&nbsp; 找到 一群灰绵羊 和 一只黑绵羊的关联关系<br />
&nbsp; 让 一群灰绵羊 可以看 黑绵羊 动作<br />
<br />
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/305591.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-12-11 16:20 <a href="http://www.blogjava.net/Skynet/archive/2009/12/11/305591.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>文件存储 - 数据结构(  py )</title><link>http://www.blogjava.net/Skynet/archive/2009/11/04/301072.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Wed, 04 Nov 2009 07:16:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/11/04/301072.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/301072.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/11/04/301072.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/301072.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/301072.html</trackback:ping><description><![CDATA[虽然 mysql,oracle 和&nbsp; Berkeley DB，sqlite3 等数据库已经很好 <br />
&nbsp;但是当我初略学习下 数据挖掘方面的一些知识发现，关系数据库远远不够来存储，查询 etl 后的数据 <br />
<br />
比如：我希望原始日志数据进行某一字段的排序,是不是很简单 。 <br />
&nbsp; 有人说&nbsp; - 数据导入数据库 load into table ... ， select order by 。之<br />
&nbsp; 还有人说 - linux sort -n... <br />
<br />
恩！很好，下面我们对大小为 1TB 的数据开始进行这个简单的操作&nbsp;&nbsp; -- 傻眼了 ！！<br />
&nbsp;&nbsp; 关于挖掘 - TB 级别的数量在我目前学习挖掘不到半年，就遇到过3-4次之多<br />
<br />
解决办法:<br />
对于这个问题 - 我现在希望能有个 大的链表 - <strong>（大到内存装不下）</strong>，<br />
&nbsp;<strong> 链表中的struct 结构为 </strong>:<br />
&nbsp;&nbsp; &gt;&gt; 排序属性文件归属<br />
&nbsp;&nbsp; &gt;&gt; 排序属性整条数据在文件中的 起始位置 - 结束位置<br />
&nbsp;&nbsp; &gt;&gt; 在排序中的排位 （ 链表结构,只记入比自己小的 属性在此链表的位置&nbsp; ）<br />
<br />
<br />
比如 : <br />
&nbsp; 1. 文件1内容 =&gt; <br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;">说明:<br />
完整数据描述&nbsp;:&nbsp;<strong>此数据在文件中的&nbsp;起始位置（当然是通过程序取得的，这为了方便我标出）</strong><br />
..c<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;0&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">22</span><span style="color: #000000;"><br />
..a<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;</span><span style="color: #000000;">23</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">55</span><span style="color: #000000;"><br />
..b<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;</span><span style="color: #000000;">56</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">76</span><span style="color: #000000;"><br />
..d<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;</span><span style="color: #000000;">77</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">130</span><span style="color: #000000;"><br />
..f<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;</span><span style="color: #000000;">131</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">220</span><span style="color: #000000;"><br />
..e<img src="http://www.blogjava.net/Images/dot.gif" alt="" />.&nbsp;&nbsp;</span><span style="color: #000000;">221</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">243</span></div>
<br />
&nbsp; 2. 数据结构预开空间 100 byte<br />
&nbsp; 3. 文件存储在描述 : # 链表排序我就不介绍了，数据结构的最基本技能，修改数据结构中的比自己小的指向 <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 我这就给出结果<br />
{ /tmp/文件1, 0-22 ,&nbsp; 300 }&nbsp;&nbsp; #说明 c ： 在链表位置 0 <br />
{ /tmp/文件1, 23-55 , 200 }&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # a ： 100 <br />
{ /tmp/文件1, 56-76 , 0 }&nbsp;&nbsp;&nbsp;&nbsp; # b : 200 <br />
{ /tmp/文件1, 77-130 , 500 }&nbsp; # d : 300 <br />
{ /tmp/文件1, 131-220 ,&nbsp; } # f : 400 <br />
{ /tmp/文件1, 221-243 , 400 } # e : 500 <br />
<br />
4. 倒叙输出 由小到到 <br />
&nbsp;&nbsp;&nbsp;&nbsp; 假设预存最小 为&nbsp; 200 链表位置<br />
&nbsp;&nbsp;&nbsp;&nbsp;
找出 使用 open /tmp/文件1&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 并使用 seek 文件游标 定位&nbsp; 23-55 取出&nbsp; ..a... <br />
&nbsp; &nbsp; &nbsp;&nbsp; 根据 链表中 200 到 seek 56 76 取出 ..b...<br />
&nbsp; &nbsp; &nbsp;&nbsp; 等等 <br />
<br />
当然 上面 <br />
&nbsp; 数据结构你可以使用 双向链表， btree , 红黑 , 斐波那契。。。（ 数据结构终于感觉有用了，不枉费我考的软证啊！）<br />
<br />
<br />
通过说明，我这 给大家提供个 可能需要的 技术细节 (py),不足之处 欢迎拍砖！！<br />
<br />
<strong>1. 二进制文件 结构化 写，修改</strong><br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;"><strong>#指定修改 190 byte 处的 内容</strong><br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;os<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;struct&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
fd&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;os.open(&nbsp;</span><span style="color: #800000;">"</span><span style="color: #800000;">pack1.txt</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;os.O_RDWR</span><span style="color: #000000;">|</span><span style="color: #000000;">os.O_CREAT&nbsp;)<br />
<br />
ss&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;pack(</span><span style="color: #800000;">'</span><span style="color: #800000;">ii11s</span><span style="color: #800000;">'</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">3</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">4</span><span style="color: #000000;">,&nbsp;</span><span style="color: #800000;">'</span><span style="color: #800000;">google</span><span style="color: #800000;">'</span><span style="color: #000000;">)<br />
os.lseek(fs,&nbsp;len(ss)</span><span style="color: #000000;">*</span><span style="color: #000000;">10</span><span style="color: #000000;">,&nbsp;0)&nbsp;<br />
os.write(fs,ss)&nbsp;<br />
os.fsync(fs)<br />
<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">os.close(&nbsp;fs&nbsp;)</span><span style="color: #008000;"><br />
</span></div>
<br />
<br />
<br />
2. seek 指定位置结构化读取<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;"><br />
<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;struct&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
file_object&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;open(</span><span style="color: #800000;">'</span><span style="color: #800000;">pack1.txt</span><span style="color: #800000;">'</span><span style="color: #000000;">,&nbsp;</span><span style="color: #800000;">'</span><span style="color: #800000;">rb</span><span style="color: #800000;">'</span><span style="color: #000000;">)<br />
<br />
</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;ts(si,ss</span><span style="color: #000000;">=</span><span style="color: #000000;">len(ss)):<br />
&nbsp;&nbsp;&nbsp;&nbsp;file_object.seek(si</span><span style="color: #000000;">*</span><span style="color: #000000;">ss)<br />
&nbsp;&nbsp;&nbsp;&nbsp;chunk&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;file_object.read(ss)<br />
&nbsp;&nbsp;&nbsp;&nbsp;a,b,c</span><span style="color: #000000;">=</span><span style="color: #000000;">unpack(</span><span style="color: #800000;">'</span><span style="color: #800000;">ii11s</span><span style="color: #800000;">'</span><span style="color: #000000;">,&nbsp;chunk&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">print</span><span style="color: #000000;">&nbsp;a,b,c<br />
<br />
ts(10)<br />
#输出&nbsp; </span>3 4 google<br />
<br />
</div>
<br />
<br />
<br />
<br />
<strong>1. 其他语言的 使用</strong><br />
struct 结构定义 ,在 python 中 使用&nbsp; struct 包，这样序列出来的数据到文件中其他语言也可以使用 <br />
&nbsp;参考: http://www.pythonid.com/bbs/archiver/?tid-285.html<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><span style="color: #000000;">pack1.py<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;struct&nbsp;</span><span style="color: #0000ff;">import</span>&nbsp;<span style="color: #000000;">*</span><span style="color: #000000;"><br />
<br />
<strong># i 为 int（4）&nbsp; 11s 为预留 11 位置 的 string <br />
# 此数据类型 为 19 byte ss&nbsp;</strong></span><span style="color: #000000;"><strong>=</strong></span><span style="color: #000000;"><strong>&nbsp;pack(</strong></span><span style="color: #800000;"><strong>'</strong></span><span style="color: #800000;"><strong>ii11s</strong></span><span style="color: #800000;"><strong>'</strong></span><span style="color: #000000;"><strong>,&nbsp;</strong></span><span style="color: #000000;"><strong>1</strong></span><span style="color: #000000;"><strong>,&nbsp;</strong></span><span style="color: #000000;"><strong>2</strong></span><span style="color: #000000;"><strong>,&nbsp;</strong></span><span style="color: #800000;"><strong>'</strong></span><span style="color: #800000;"><strong>hello&nbsp;world</strong></span><span style="color: #800000;"><strong>'</strong></span><span style="color: #000000;"><strong>)<br />
</strong><br />
f&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;open(</span><span style="color: #800000;">"</span><span style="color: #800000;">pack1.txt</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #800000;">"</span><span style="color: #800000;">wb</span><span style="color: #800000;">"</span><span style="color: #000000;">)<br />
f.write(ss)<br />
f.close()<br />
<br />
<br />
上面的代码往C的结构中写入数据，结构包括两个整型和一个字符串。<br />
pack1.c<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">include&nbsp;&lt;stdio.h&gt;</span><span style="color: #008000;"><br />
#</span><span style="color: #008000;">include&nbsp;&lt;string.h&gt;</span><span style="color: #008000;"><br />
</span><span style="color: #000000;"><br />
struct&nbsp;AA<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;int&nbsp;a;<br />
&nbsp;&nbsp;&nbsp;&nbsp;int&nbsp;b;<br />
&nbsp;&nbsp;&nbsp;&nbsp;char&nbsp;&nbsp;&nbsp;&nbsp;c[</span><span style="color: #000000;">64</span><span style="color: #000000;">];<br />
};<br />
<br />
int&nbsp;main()<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;struct&nbsp;AA&nbsp;&nbsp;&nbsp;aa;<br />
&nbsp;&nbsp;&nbsp;&nbsp;FILE&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;">fp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;int&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size,&nbsp;readsize;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;memset(</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">aa,&nbsp;0,&nbsp;sizeof(struct&nbsp;AA));<br />
&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;fp&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;fopen(</span><span style="color: #800000;">"</span><span style="color: #800000;">pack1.txt</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #800000;">"</span><span style="color: #800000;">rb</span><span style="color: #800000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;(NULL&nbsp;</span><span style="color: #000000;">==</span><span style="color: #000000;">&nbsp;fp)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #800000;">"</span><span style="color: #800000;">open&nbsp;file&nbsp;error!"n</span><span style="color: #800000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">return</span><span style="color: #000000;">&nbsp;0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;readsize&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;sizeof(struct&nbsp;AA);<br />
&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #800000;">"</span><span style="color: #800000;">readsize:&nbsp;%d"n</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;readsize);<br />
&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;size&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;fread(</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">aa,&nbsp;</span><span style="color: #000000;">1</span><span style="color: #000000;">,&nbsp;readsize,&nbsp;fp);&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #800000;">"</span><span style="color: #800000;">read:&nbsp;%d"n</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;size);<br />
&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #800000;">"</span><span style="color: #800000;">a=%d,&nbsp;b=%d,&nbsp;c=%s"n</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;aa.a,&nbsp;aa.b,&nbsp;aa.c);<br />
&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;fclose(fp);<br />
&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">return</span><span style="color: #000000;">&nbsp;0;<br />
}<br />
<br />
<strong>结果输出:</strong><br />
</span>C:"Documents and Settings"lky"桌面"dataStructure&gt;a<br />
readsize: 72<br />
read: 57<br />
a=1, b=2, c=hello word<strong><br />
</strong></div>
<strong><br />
<br />
</strong><br />
&nbsp; &nbsp; <br />
<strong>最后罗嗦下：</strong><br />
&nbsp; 能用数据结构了，很多东西都可以根据自己逻辑定制 存储很方便 。 不再受 关系数据库 , key 数据库 或 mapreduce 的限制 <br />
&nbsp;&nbsp; <br />
参考:<br />
http://docs.python.org/library/struct.html#module-struct&nbsp;&nbsp;&nbsp; #官方struct 包 说明<br />
http://blog.csdn.net/JGood/archive/2009/06/22/4290158.aspx&nbsp; # 使用 struct&nbsp; 的前辈留下的<br />
http://www.tutorialspoint.com/python/os_lseek.htm #一个小demo <br />
<a id="ctl04_TitleUrl" class="postTitle2" href="http://www.cnblogs.com/coderzh/archive/2008/05/10/1191410.html">Python天天美味(17) - open读写文件</a><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/301072.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-11-04 15:16 <a href="http://www.blogjava.net/Skynet/archive/2009/11/04/301072.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>数据挖掘 流程简介</title><link>http://www.blogjava.net/Skynet/archive/2009/11/03/300946.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Tue, 03 Nov 2009 09:44:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/11/03/300946.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/300946.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/11/03/300946.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/300946.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/300946.html</trackback:ping><description><![CDATA[我们这就是有 企业挖掘中最常用的 《流失用户分析》来说明：<br />
<br />
数据挖掘流程:<br />
1. 定义主题 ：<strong>天啊，我在干什么！</strong>（ 此模块绝大多数主观意识上完成，有少量客观验证）<br />
&nbsp; 1.1 明确主题用户在各用户群中的分布 - 流失用户在各用户群中比例<br />
&nbsp;&nbsp;&nbsp; 不同客户群的流失程度如：某渠道，某软件版本,页面布局，功能等主观上去分析。<br />
&nbsp;&nbsp;&nbsp; 尽量把影响流失比较大的因素详细罗列出来 如： 概率分布，页面布局变化影响等<br />
&nbsp; 1.2 明确主题用户特征 -&nbsp; 流失用户特征<br />
&nbsp;&nbsp;&nbsp;&nbsp; 对流失用户影响比较大的字段如：金额，软件版本（缺少最需要的功能）,客服对问题的处理的时间<br />
&nbsp;<br />
<br />
2. 数据选择 ：<strong>什么样的选民，选出什么样的总统</strong>！<br />
&nbsp;&nbsp; 在此模块中有个比较难把握的地方： 维度越高越能准确的定义数据，但也会越复杂度 。<br />
&nbsp;&nbsp; 你大概不会希望花3天分析出2天前的流失用户吧！！ :)<br />
&nbsp;&nbsp; 2.1 分区收集<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 在用户流失分析中，若采集时间过长，可能在流失判断出来时客户已然流失；若采集时间过于紧密或者实时采集则需要考虑运营商现有系统的支撑能力。因此对数据采集时间间隔的设置显得尤为重要。<br />
&nbsp;&nbsp; 2.2 减少数据噪音<br />
&nbsp;&nbsp; 2.3 剔除部分冗余数据<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 此间要注意的是在客户流失分析上，从数据仓库中采集数据的主要目的是调查客户信息的变化情况。一些不必要的数据就去除掉吧<br />
<br />
<br />
3. 分析数据 : <strong>热身，很重要！</strong><br />
&nbsp;&nbsp; 3.1 数据抽样 <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 多说了，在这信息爆炸的时代，别说你把上百TB的数据放到应用分析库中去！<br />
&nbsp;&nbsp; 3.2 数据转换<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 比如时间方面：可以把上午转换为 1 ，中午转换为 2 等等.便于分析<br />
&nbsp;&nbsp; 3.3 缺损数据处理<br />
&nbsp;&nbsp; 3.4 样本生成<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 建模样本:为下个阶段准备<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 测试样本： 对模型进行修正和检验<br />
<br />
4. 模型建立 : <strong>找个合得来的过这一辈子吧！</strong><br />
&nbsp; 对数据进行分析并利用各种数据挖掘技术和方法在多个可供选择的模型中找出最佳模型,这个过程是一个循环迭代的过程.<br />
&nbsp; 建立模型通常由数据分析专家配合业务专家来完成<br />
&nbsp; 4.1&nbsp; 常用的流失分析模型主要有&nbsp; 决策树 / 贝叶斯网络 / 神经网络等<br />
<br />
<br />
5. 模型的评估与检验 ： <strong>开花！</strong><br />
<br />
6. 应用模型 ： <strong>终于，结出好果（结果）！</strong><br />
<br />
<br />
<br />
<br />
$&gt;流失分析中需要注意的问题<br />
&nbsp;<br />
&gt;&gt;过度抽样<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 国内电信企业每月的客户流失率一般在1%～3%左右，如果直接采用某种模型(比如决策树、人工神经网络等)可能会因为数据概率太小而导致模型的失效<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 因此我们需要加大流失客户在总样本中的比例，但是这种过度抽样必须谨慎小心，要充分考虑它的负面效应<br />
&nbsp; <br />
&gt;&gt; 模型的有效性<br />
&nbsp;&nbsp; 预测出结果，但用户已经流失 ，主要要关注采样时间跨度问题<br />
&nbsp; <br />
&gt;&gt; 模型的流失后分析<br />
&nbsp; <span style="color: black; font-family: 宋体;">数据挖掘在客户流失管理中的重要应用不仅仅应包括对客户流
失的提前预警，还应包括客户流失后的问题分析。按照不同的客户信息纬度，查找最容易流失的客户群，同业务部门人员配合，辅以相关调查，力求发现客户流失的
症结所在。然而，这一部分往往由于过度专注于挖掘模型本身的拟合度而忽略了流失管理的实际价值所在。</span><br />
<span style="background-color: #3844ff;"><span style="background-color: #a8adff;"><span style="background-color: #3844ff;"><span style="background-color: #70e5ff;"><span style="background-color: #3844ff;"><span style="background-color: #e0f4ff;"><span style="background-color: #ffffff;"><span style="background-color: #3844ff;"><span style="background-color: #70e5ff;"><span style="background-color: #3844ff;"><span style="background-color: #e0f4ff;"><br />
</span><span><span style="background-color: #3844ff;"><span style="background-color: #a8adff;"><span style="background-color: #3844ff;"><span style="background-color: #70e5ff;"><span style="background-color: #3844ff;"><span style="background-color: #e0f4ff;"><span style="background-color: #ffffff;"></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><strong><br />
<br />
谢谢 同事 吴 的指导,这他的原话 转出来供大家学习</strong><br />
0. 我觉得做bi和技术最大的一点差别就是<br />
&nbsp;&nbsp;&nbsp; bi是数据导向，需求的优先级要低于数据<br />
<br />
1. 没数据的话，需求就没戏了 &nbsp;<br />
2. 技术是需求导向，只要有需求，技术基本上都能做出来<br />
3. 数据的加载、加工、清洗，叫做etl，其实和你现在做的事情很像<br />
4. etl是挖掘里非常重要的一部分<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
参考：<strong><strong><font color="#a00000">数据挖掘在电信客户流失分析中的应用<br />
<span style="color: #080000;">http://www.teleinfocn.com/html/2007-02-12/3448.html</span></font></strong></strong><br />
<br />
<br />
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/300946.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-11-03 17:44 <a href="http://www.blogjava.net/Skynet/archive/2009/11/03/300946.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>数据挖掘研究内容和本质（转）</title><link>http://www.blogjava.net/Skynet/archive/2009/10/22/299411.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Thu, 22 Oct 2009 10:05:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/10/22/299411.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/299411.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/10/22/299411.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/299411.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/299411.html</trackback:ping><description><![CDATA[<table width="100%" border="0" cellpadding="0" cellspacing="0">
    <tbody>
        <tr>
            <td width="15" height="20"><a href="http://www.huaat.com/images/title_icon1.gif" target="_blank"><br />
            </a></td>
            <td><strong>数据挖掘研究内容和本质</strong></td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>随着DMKD研究逐步走向深入，数据挖掘和知识发现的研究已经形成了三根强大的技术支柱：数据库、人工智能和
            数理统计。因此，KDD大会程序委员会曾经由这三个学科的权威人物同时来任主席。目前DMKD的主要研究内容包括基础理论、发现算法、数据仓库、可视化技
            术、定性定量互换模型、知识表示方法、发现知识的维护和再利用、半结构化和非结构化数据中的知识发现以及网上数据挖掘等。 <br />
            <br />
            数据挖掘所发现的知识最常见的有以下四类： </td>
        </tr>
        <tr>
            <td>-</td>
            <td height="20">广义知识（Generalization）</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>广义知识指类别特征的概括性描述知识。根据数据的微观特性发现其表征的、带有普遍性的、较高层次概念的、中观和宏观的知识，反映同类事物共同性质，是对数据的概括、精炼和抽象。<br />
            <br />
            广
            义知识的发现方法和实现技术有很多，如数据立方体、面向属性的归约等。数据立方体还有其他一些别名，如&#8220;多维数据库&#8221;、&#8220;实现视图&#8221;、&#8220;OLAP"等。该
            方法的基本思想是实现某些常用的代价较高的聚集函数的计算，诸如计数、求和、平均、最大值等，并将这些实现视图储存在多维数据库中。既然很多聚集函数需经
            常重复计算，那么在多维数据立方体中存放预先计算好的结果将能保证快速响应，并可灵活地提供不同角度和不同抽象层次上的数据视图。另一种广义知识发现方法
            是加拿大SimonFraser大学提出的面向属性的归约方法。这种方法以类SQL语言表示数据挖掘查询，收集数据库中的相关数据集，然后在相关数据集上
            应用一系列数据推广技术进行数据推广，包括属性删除、概念树提升、属性阈值控制、计数及其他聚集函数传播等。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>关联知识（Association）</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>它反映一个事件和其他事件之间依赖或关联的知识。如果两项或多项属性之间存在关联，那么其中一项的属性值就可
            以依据其他属性值进行预测。最为著名的关联规则发现方法是R.Agrawal提出的Apriori算法。关联规则的发现可分为两步。第一步是迭代识别所有
            的频繁项目集，要求频繁项目集的支持率不低于用户设定的最低值；第二步是从频繁项目集中构造可信度不低于用户设定的最低值的规则。识别或发现所有频繁项目
            集是关联规则发现算法的核心，也是计算量最大的部分。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>分类知识(Classification＆Clustering)</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>它反映同类事物共同性质的特征型知识和不同事物之间的差异型特征知识。最为典型的分类方法是基于决策树的分类
            方法。它是从实例集中构造决策树，是一种有指导的学习方法。该方法先根据训练子集（又称为窗口）形成决策树。如果该树不能对所有对象给出正确的分类，那么
            选择一些例外加入到窗口中，重复该过程一直到形成正确的决策集。最终结果是一棵树，其叶结点是类名，中间结点是带有分枝的属性，该分枝对应该属性的某一可
            能值。最为典型的决策树学习系统是ID3，它采用自顶向下不回溯策略，能保证找到一个简单的树。算法C4.5和C5.0都是ID3的扩展，它们将分类领域
            从类别属性扩展到数值型属性。 <br />
            <br />
            数据分类还有统计、粗糙集（RoughSet）等方法。线性回归和线性辨别分析是典型的统计模型。为降低决策树生成代价，人们还提出了一种区间分类器。最近也有人研究使用神经网络方法在数据库中进行分类和规则提取。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>预测型知识（Prediction）</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>它根据时间序列型数据，由历史的和当前的数据去推测未来的数据，也可以认为是以时间为关键属性的关联知识。<br />
            <br />
            目
            前，时间序列预测方法有经典的统计方法、神经网络和机器学习等。1968年Box和Jenkins提出了一套比较完善的时间序列建模理论和分析方法，这些
            经典的数学方法通过建立随机模型，如自回归模型、自回归滑动平均模型、求和自回归滑动平均模型和季节调整模型等，进行时间序列的预测。由于大量的时间序列
            是非平稳的，其特征参数和数据分布随着时间的推移而发生变化。因此，仅仅通过对某段历史数据的训练，建立单一的神经网络预测模型，还无法完成准确的预测任
            务。为此，人们提出了基于统计学和基于精确性的再训练方法，当发现现存预测模型不再适用于当前数据时，对模型重新训练，获得新的权重参数，建立新的模型。
            也有许多系统借助并行算法的计算优势进行时间序列预测。 </td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>偏差型知识(Deviation)</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>此外，还可以发现其他类型的知识，如偏差型知识(Deviation)，它是对差异和极端特例的描述，揭示事
            物偏离常规的异常现象，如标准类外的特例，数据聚类外的离群值等。所有这些知识都可以在不同的概念层次上被发现，并随着概念层次的提升，从微观到中观、到
            宏观，以满足不同用户不同层次决策的需要。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20"><a href="http://www.huaat.com/images/title_icon1.gif" target="_blank"><br />
            </a></td>
            <td><strong>数据挖掘的功能</strong></td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>数据挖掘通过预测未来趋势及行为，做出前摄的、基于知识的决策。数据挖掘的目标是从数据库中发现隐含的、有意义的知识，主要有以下五类功能。 </td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>自动预测趋势和行为</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>数据挖掘自动在大型数据库中寻找预测性信息，以往需要进行大量手工分析的问题如今可以迅速直接由数据本身得出
            结论。一个典型的例子是市场预测问题，数据挖掘使用过去有关促销的数据来寻找未来投资中回报最大的用户，其它可预测的问题包括预报破产以及认定对指定事件
            最可能作出反应的群体。 </td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>关联分析</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>数据关联是数据库中存在的一类重要的可被发现的知识。若两个或多个变量的取值之间存在某种规律性，就称为关
            联。关联可分为简单关联、时序关联、因果关联。关联分析的目的是找出数据库中隐藏的关联网。有时并不知道数据库中数据的关联函数，即使知道也是不确定的，
            因此关联分析生成的规则带有可信度。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>聚类</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>数据库中的记录可被化分为一系列有意义的子集，即聚类。聚类增强了人们对客观现实的认识，是概念描述和偏差分
            析的先决条件。聚类技术主要包括传统的模式识别方法和数学分类学。80年代初，Mchalski提出了概念聚类技术牞其要点是，在划分对象时不仅考虑对象
            之间的距离，还要求划分出的类具有某种内涵描述，从而避免了传统技术的某些片面性。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>概念描述</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>概念描述就是对某类对象的内涵进行描述，并概括这类对象的有关特征。概念描述分为特征性描述和区别性描述，前
            者描述某类对象的共同特征，后者描述不同类对象之间的区别。生成一个类的特征性描述只涉及该类对象中所有对象的共性。生成区别性描述的方法很多，如决策树
            方法、遗传算法等。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>偏差检测</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>数据库中的数据常有一些异常记录，从数据库中检测这些偏差很有意义。偏差包括很多潜在的知识，如分类中的反常实例、不满足规则的特例、观测结果与模型预测值的偏差、量值随时间的变化等。偏差检测的基本方法是，寻找观测结果与参照值之间有意义的差别。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20"><a href="http://www.huaat.com/images/title_icon1.gif" target="_blank"><br />
            </a></td>
            <td><strong>数据挖掘常用技术</strong></td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>人工神经网络</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>仿照生理神经网络结构的非线形预测模型，通过学习进行模式识别。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>决策树</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>代表着决策集的树形结构。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>遗传算法</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>基于进化理论，并采用遗传结合、遗传变异、以及自然选择等设计方法的优化技术。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>近邻算法</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>将数据集合中每一个记录进行分类的方法。</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>
        <tr>
            <td height="20">-</td>
            <td>规则推导</td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td>从统计意义上对数据中的&#8220;如果-那么&#8221;规则进行寻找和推导。 <br />
            <br />
            采用上述技术的某些专门的分析工具已经发展了大约十年的历史，不过这些工具所面对的数据量通常较小。而现在这些技术已经被直接集成到许多大型的工业标准的数据仓库和联机分析系统中去了。 </td>
        </tr>
        <tr>
            <td>&nbsp;</td>
            <td align="right">摘自《数据挖掘讨论组》</td>
        </tr>
    </tbody>
</table>
<img src ="http://www.blogjava.net/Skynet/aggbug/299411.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-10-22 18:05 <a href="http://www.blogjava.net/Skynet/archive/2009/10/22/299411.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop streaming( hadoop + perl  )小试</title><link>http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Fri, 25 Sep 2009 06:33:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/296420.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/296420.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/296420.html</trackback:ping><description><![CDATA[参考:<br />
&nbsp; http://hadoop.apache.org/common/docs/r0.15.2/streaming.html<br />
<br />
注意<br />
&nbsp; 目前 streaming 对 linux pipe #也就是 cat |wc -l 这样的管道 不支持，但不妨碍我们使用perl,python 行式命令！！<br />
&nbsp; 原话是 ：<br />
&nbsp; Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?<br />
&nbsp;&nbsp;&nbsp; Currently this does not work and gives an "java.io.IOException: Broken
pipe" error. <br />
&nbsp;&nbsp;&nbsp; This is probably a bug that needs to be investigated.
<br />
&nbsp; 但如果你是强烈的 linux shell pipe 发烧友 ！ 参考下面<br />
&nbsp; $&gt; perl -e 'open( my $fh, "grep -v null <strong>tt </strong>|sed -n 1,5p |");while ( &lt;$fh&gt; ) {print;} '<br />
&nbsp;&nbsp;&nbsp;&nbsp;<strong> #不过我没测试通过 ！！ </strong><br />
<br />
环境 ：hadoop-0.18.3<br />
$&gt; find . -type f -name "*streaming*.jar" <br />
./contrib/streaming/hadoop-0.18.3-streaming.jar<br />
<br />
<br />
测试数据：<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #808080;">-</span><span style="color: #000000;">bash</span><span style="color: #808080;">-</span><span style="color: #800000; font-weight: bold;">3.00</span><span style="color: #000000;">$&nbsp;head&nbsp;tt&nbsp;<br />
</span><span style="color: #0000ff;">null</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">3702</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">208100</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005100</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">70</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">13220</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005127</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">24</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">4640</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005160</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">25</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">4820</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005161</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">20</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">3620</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005164</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">14</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">1280</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005165</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">37</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">7080</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005168</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">104</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">20140</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005169</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">35</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">6680</span><span style="color: #000000;"><br />
</span><span style="color: #800000; font-weight: bold;">6005240</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;false&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">169</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #800000; font-weight: bold;">32140<br />
......<br />
</span></div>
<br />
<br />
运行：<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;">c1</span><span style="color: #000000;">=</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000; font-weight: bold;">&nbsp; perl&nbsp;-ne&nbsp;&nbsp;'if(/.*</span><span style="color: #000000; font-weight: bold;">\</span><span style="color: #000000;">t(</span><span style="color: #000000;">.*</span><span style="color: #000000;">)</span><span style="color: #000000;">/</span><span style="color: #000000;">){</span><span style="color: #000000; font-weight: bold;">\</span><span style="color: #000000; font-weight: bold;">$sum+=</span><span style="color: #000000; font-weight: bold;">\</span><span style="color: #000000;">$</span><span style="color: #800000;">1</span><span style="color: #000000;">;}END{</span><span style="color: #0000ff;">print</span><span style="color: #000000;"> <strong>\"</strong></span><span style="color: #000000; font-weight: bold;">\</span><span style="color: #000000; font-weight: bold;">$sum\";}'</span><span style="color: #000000; font-weight: bold;">&nbsp; "<br />
# 注意 这里 $ 要写成 \$&nbsp;&nbsp;&nbsp; " 写成 \"<br />
</span><span style="color: #000000;">
echo $c1; # 打印输出&nbsp; </span>perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}'<br />
<span style="color: #000000;">hadoop&nbsp;jar&nbsp;hadoop</span><span style="color: #000000;">-</span><span style="color: #800000;">0.18</span><span style="color: #000000;">.</span><span style="color: #800000;">3</span><span style="color: #000000;">-</span><span style="color: #000000;">streaming</span><span style="color: #000000;">.</span><span style="color: #000000;">jar <br />
&nbsp;&nbsp; </span><span style="color: #000000;">-</span><span style="color: #000000;">input&nbsp;file</span><span style="color: #000000;">:///</span><span style="color: #000000;">data</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">/</span><span style="color: #000000;">lky</span><span style="color: #000000;">/</span><span style="color: #000000;">jar</span><span style="color: #000000;">/</span><span style="color: #000000;">tt&nbsp; </span><span style="color: #000000;"><br />
&nbsp;&nbsp; -</span><span style="color: #000000;">mapper&nbsp;&nbsp;&nbsp;</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000; font-weight: bold;">/bin/cat</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000;">&nbsp; </span><span style="color: #000000;"><br />
&nbsp;&nbsp; -</span><span style="color: #000000;">reducer&nbsp;</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000; font-weight: bold;">$c1</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000;">&nbsp; <br />
&nbsp;&nbsp; </span><span style="color: #000000;">-</span><span style="color: #000000;">output&nbsp;file</span><span style="color: #000000;">:///</span><span style="color: #000000;">tmp</span><span style="color: #000000;">/</span><span style="color: #000000;">lky</span><span style="color: #000000;">/</span><span style="color: #000000;">streamingx8</span></div>
<br />
<br />
结果:<br />
cat <span style="color: #000000;">/</span><span style="color: #000000;">tmp</span><span style="color: #000000;">/</span><span style="color: #000000;">lky</span><span style="color: #000000;">/</span><span style="color: #000000;">streamingx8/*<br />
</span>1166480<br />
<br />
本地运行输出:<br />
perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' &lt; tt<br />
1166480<br />
<br />
结果正确!!!!<br />
<br />
<br />
命令自带文档：<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;">-</span><span style="color: #000000;">bash</span><span style="color: #000000;">-</span><span style="color: #000000;">3.00</span><span style="color: #000000;">$&nbsp;hadoop&nbsp;jar&nbsp;hadoop</span><span style="color: #000000;">-</span><span style="color: #000000;">0.18</span><span style="color: #000000;">.</span><span style="color: #000000;">3</span><span style="color: #000000;">-</span><span style="color: #000000;">streaming.jar&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">info<br />
</span><span style="color: #000000;">09</span><span style="color: #000000;">/</span><span style="color: #000000;">09</span><span style="color: #000000;">/</span><span style="color: #000000;">25</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">14</span><span style="color: #000000;">:</span><span style="color: #000000;">50</span><span style="color: #000000;">:</span><span style="color: #000000;">12</span><span style="color: #000000;">&nbsp;ERROR&nbsp;streaming.StreamJob:&nbsp;Missing&nbsp;required&nbsp;option&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">input<br />
Usage:&nbsp;$HADOOP_HOME</span><span style="color: #000000;">/</span><span style="color: #000000;">bin</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop&nbsp;[</span><span style="color: #000000;">--</span><span style="color: #000000;">config&nbsp;dir]&nbsp;jar&nbsp;\<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$HADOOP_HOME</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">-</span><span style="color: #000000;">streaming.jar&nbsp;[options]<br />
Options:<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">input&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">path</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DFS&nbsp;input&nbsp;file(s)&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;the&nbsp;Map&nbsp;step<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">output&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">path</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DFS&nbsp;output&nbsp;directory&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;the&nbsp;Reduce&nbsp;step<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">mapper&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">cmd</span><span style="color: #000000;">|</span><span style="color: #000000;">JavaClassName</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The&nbsp;streaming&nbsp;command&nbsp;to&nbsp;run<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">combiner&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">JavaClassName</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;Combiner&nbsp;has&nbsp;to&nbsp;be&nbsp;a&nbsp;Java&nbsp;</span><span style="color: #0000ff;">class</span><span style="color: #000000;"><br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">reducer&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">cmd</span><span style="color: #000000;">|</span><span style="color: #000000;">JavaClassName</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The&nbsp;streaming&nbsp;command&nbsp;to&nbsp;run<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">file&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">file</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;File</span><span style="color: #000000;">/</span><span style="color: #000000;">dir&nbsp;to&nbsp;be&nbsp;shipped&nbsp;</span><span style="color: #0000ff;">in</span><span style="color: #000000;">&nbsp;the&nbsp;Job&nbsp;jar&nbsp;file<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">dfs&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">h:p</span><span style="color: #000000;">&gt;|</span><span style="color: #000000;">local&nbsp;&nbsp;Optional.&nbsp;Override&nbsp;DFS&nbsp;configuration<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">jt&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">h:p</span><span style="color: #000000;">&gt;|</span><span style="color: #000000;">local&nbsp;&nbsp;Optional.&nbsp;Override&nbsp;JobTracker&nbsp;configuration<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">additionalconfspec&nbsp;specfile&nbsp;&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">inputformat&nbsp;TextInputFormat(</span><span style="color: #0000ff;">default</span><span style="color: #000000;">)</span><span style="color: #000000;">|</span><span style="color: #000000;">SequenceFileAsTextInputFormat</span><span style="color: #000000;">|</span><span style="color: #000000;">JavaClassName&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">outputformat&nbsp;TextOutputFormat(</span><span style="color: #0000ff;">default</span><span style="color: #000000;">)</span><span style="color: #000000;">|</span><span style="color: #000000;">JavaClassName&nbsp;&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">partitioner&nbsp;JavaClassName&nbsp;&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">numReduceTasks&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">num</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">inputreader&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">spec</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;Optional.<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">jobconf&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">n</span><span style="color: #000000;">&gt;=&lt;</span><span style="color: #000000;">v</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;Optional.&nbsp;Add&nbsp;or&nbsp;</span><span style="color: #0000ff;">override</span><span style="color: #000000;">&nbsp;a&nbsp;JobConf&nbsp;property<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">cmdenv&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">n</span><span style="color: #000000;">&gt;=&lt;</span><span style="color: #000000;">v</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;Optional.&nbsp;Pass&nbsp;env.var&nbsp;to&nbsp;streaming&nbsp;commands<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">mapdebug&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">path</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;Optional.&nbsp;To&nbsp;run&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">&nbsp;script&nbsp;when&nbsp;a&nbsp;map&nbsp;task&nbsp;fails&nbsp;<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">reducedebug&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">path</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;&nbsp;Optional.&nbsp;To&nbsp;run&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">&nbsp;script&nbsp;when&nbsp;a&nbsp;reduce&nbsp;task&nbsp;fails&nbsp;<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">cacheFile&nbsp;fileNameURI<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">cacheArchive&nbsp;fileNameURI<br />
&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">verbose</span></div>
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/296420.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-09-25 14:33 <a href="http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop jython join  ( 1 )</title><link>http://www.blogjava.net/Skynet/archive/2009/09/08/294261.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Tue, 08 Sep 2009 02:39:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/09/08/294261.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/294261.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/09/08/294261.html#Feedback</comments><slash:comments>2</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/294261.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/294261.html</trackback:ping><description><![CDATA[<br />
首先 本文中的 hadoop join&nbsp; 在实际开发没有用处！<br />
如果在开发中 请使用 <span style="border-collapse: separate; color: #000000; font-family: Simsun; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="border-collapse: collapse; color: #333333; font-family: arial; font-size: 13px;">cascading&nbsp; groupby, 进行 hadoop join,<br />
本文只是为探讨弄懂 cascading 实现做准备。<br />
</span></span><br />
当然 如果有有人 hadoop join 过 请联系我，大家交流下 ！<br />
<br />
文件可能需要的一些参考：<br />
<a id="homepage1_HomePageDays_DaysList_ctl01_DayItem_DayList_ctl01_TitleUrl" href="../../Skynet/archive/2009/09/04/293914.html">hadoop jython ( windows )</a><br />
jython ，jython 编译以及jar 包 <br />
少量 linux shell <br />
<br />
<br />
本文介绍 hadoop 可能使用到的 join 接口测试 ，已经参考：<br />
<font style="font-size: 14px;"><strong>使用Hadoop实现Inner Join操作的方法【from淘宝】</strong></font>：http://labs.chinamobile.com/groups/58_547 <br />
<br />
下面 测试后 ，我这大体上 对 hadoop&nbsp; join 的方式是这样理解的 （猜想）：<br />
数据1 ; 数据2<br />
<strong>job1</strong>.map( 数据1 ) =（临时文件1）&gt;&nbsp; 文件标示1+需要join列&nbsp; 数据<br />
<strong>job2</strong>.map( 数据2 ) =（临时文件2）&gt;&nbsp; 文件标示2+需要join列&nbsp; 数据<br />
<br />
临时文件 <strong><span style="color: #800000;">mapred.join.expr </span></strong>生成<br />
<strong>job3.map -&gt; </strong><br />
文件标示1+需要join列 : 数据<br />
文件标示2+需要join列 : 数据<br />
......<br />
<strong>job3.Combiner - &gt;</strong><br />
需要join列 : 文件标示1+数据<br />
需要join列 : 文件标示2+数据<br />
<strong>job3.Reducer-&gt;</strong><br />
需要join列 : 使用 java-list &gt; 生成 <strong><br />
</strong><strong>&nbsp; 文件2-列x</strong> [&nbsp; 数据,数据... ]<br />
&nbsp; <strong>文件1-列x</strong> [&nbsp; 数据,数据... ]<br />
然后 你这 left join ,或 inner join 或 xxx join 逻辑 就自己来吧<br />
<br />
<br />
结果集合<br />
[root@localhost python]# cat /home/megajobs/del/jobs/tools/hadoop-0.18.3/data/090907/1<br />
1<br />
2<br />
3<br />
4<br />
5<br />
[root@localhost python]# cat /home/megajobs/del/jobs/tools/hadoop-0.18.3/data/090907/2<br />
2<br />
4<br />
3<br />
1<br />
<br />
修改 ..../hadoop-0.18.3/src/examples/python/compile<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env&nbsp;bash</span><span style="color: #008000;"><br />
</span><span style="color: #000000;"><br />
export&nbsp;HADOOP_HOME</span><span style="color: #000000;">=/</span><span style="color: #000000;">home</span><span style="color: #000000;">/xx</span><span style="color: #000000;">/</span><span style="color: #000000;">del</span><span style="color: #000000;">/</span><span style="color: #000000;">jobs</span><span style="color: #000000;">/</span><span style="color: #000000;">tools</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">-</span><span style="color: #800000;">0.18</span><span style="color: #000000;">.</span><span style="color: #800000;">3</span><span style="color: #000000;"><br />
export&nbsp;CASCADING_HOME</span><span style="color: #000000;">=/</span><span style="color: #000000;">home</span><span style="color: #000000;">/xx</span><span style="color: #000000;">/</span><span style="color: #000000;">del</span><span style="color: #000000;">/</span><span style="color: #000000;">jobs</span><span style="color: #000000;">/</span><span style="color: #000000;">tools</span><span style="color: #000000;">/</span><span style="color: #000000;">cascading</span><span style="color: #000000;">-</span><span style="color: #800000;">1.0</span><span style="color: #000000;">.</span><span style="color: #800000;">16</span><span style="color: #000000;">-</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">-</span><span style="color: #800000;">0.18</span><span style="color: #000000;">.</span><span style="color: #800000;">3</span><span style="color: #000000;"><br />
export&nbsp;JYTHON_HOME</span><span style="color: #000000;">=/</span><span style="color: #000000;">home</span><span style="color: #000000;">/xx</span><span style="color: #000000;">/</span><span style="color: #000000;">del</span><span style="color: #000000;">/</span><span style="color: #000000;">jobs</span><span style="color: #000000;">/</span><span style="color: #000000;">tools</span><span style="color: #000000;">/</span><span style="color: #000000;">jython2</span><span style="color: #000000;">.</span><span style="color: #800000;">2.1</span><span style="color: #000000;"><br />
<br />
export&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000; font-weight: bold;">$HADOOP_HOME/hadoop-0.18.3-core.jar</span><span style="color: #000000; font-weight: bold;">"</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">&nbsp;so&nbsp;that&nbsp;filenames&nbsp;w/&nbsp;spaces&nbsp;are&nbsp;handled&nbsp;correctly&nbsp;in&nbsp;loops&nbsp;below</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">IFS</span><span style="color: #000000;">=</span><span style="color: #000000;"><br />
<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">&nbsp;add&nbsp;libs&nbsp;to&nbsp;CLASSPATH</span><span style="color: #008000;"><br />
</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;f&nbsp;in&nbsp;</span><span style="color: #800080;">$HADOOP_HOME</span><span style="color: #000000;">/</span><span style="color: #000000;">lib</span><span style="color: #000000;">/*.</span><span style="color: #000000;">jar;&nbsp;</span><span style="color: #0000ff;">do</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000;">${CLASSPATH}</span><span style="color: #000000;">:</span><span style="color: #800080;">$f</span><span style="color: #000000;">;<br />
done<br />
<br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;f&nbsp;in&nbsp;</span><span style="color: #800080;">$HADOOP_HOME</span><span style="color: #000000;">/</span><span style="color: #000000;">lib</span><span style="color: #000000;">/</span><span style="color: #000000;">jetty</span><span style="color: #000000;">-</span><span style="color: #000000;">ext</span><span style="color: #000000;">/*.</span><span style="color: #000000;">jar;&nbsp;</span><span style="color: #0000ff;">do</span><span style="color: #000000;"><br />
&nbsp;&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000;">${CLASSPATH}</span><span style="color: #000000;">:</span><span style="color: #800080;">$f</span><span style="color: #000000;">;<br />
done<br />
<br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;f&nbsp;in&nbsp;</span><span style="color: #800080;">$CASCADING_HOME</span><span style="color: #000000;">/*.</span><span style="color: #000000;">jar;&nbsp;</span><span style="color: #0000ff;">do</span><span style="color: #000000;"><br />
&nbsp;&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000;">${CLASSPATH}</span><span style="color: #000000;">:</span><span style="color: #800080;">$f</span><span style="color: #000000;">;<br />
done<br />
<br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;f&nbsp;in&nbsp;</span><span style="color: #800080;">$CASCADING_HOME</span><span style="color: #000000;">/</span><span style="color: #000000;">lib</span><span style="color: #000000;">/*.</span><span style="color: #000000;">jar;&nbsp;</span><span style="color: #0000ff;">do</span><span style="color: #000000;"><br />
&nbsp;&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000;">${CLASSPATH}</span><span style="color: #000000;">:</span><span style="color: #800080;">$f</span><span style="color: #000000;">;<br />
done<br />
<br />
<br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;f&nbsp;in&nbsp;</span><span style="color: #800080;">$JYTHON_HOME</span><span style="color: #000000;">/*.</span><span style="color: #000000;">jar;&nbsp;</span><span style="color: #0000ff;">do</span><span style="color: #000000;"><br />
&nbsp;&nbsp;CLASSPATH</span><span style="color: #000000;">=</span><span style="color: #000000;">${CLASSPATH}</span><span style="color: #000000;">:</span><span style="color: #800080;">$f</span><span style="color: #000000;">;<br />
done<br />
<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">&nbsp;restore&nbsp;ordinary&nbsp;behaviour</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">unset&nbsp;IFS<br />
<br />
</span><span style="color: #000000;">/</span><span style="color: #000000;">home</span><span style="color: #000000;">/xx</span><span style="color: #000000;">/</span><span style="color: #000000;">del</span><span style="color: #000000;">/</span><span style="color: #000000;">jobs</span><span style="color: #000000;">/</span><span style="color: #000000;">tools</span><span style="color: #000000;">/</span><span style="color: #000000;">jython2</span><span style="color: #000000;">.</span><span style="color: #800000;">2.1</span><span style="color: #000000;">/</span><span style="color: #000000;">jythonc&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">p&nbsp;org</span><span style="color: #000000;">.</span><span style="color: #000000;">apache</span><span style="color: #000000;">.</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">.</span><span style="color: #000000;">examples&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">d&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">j&nbsp;$</span><span style="color: #800000;">1</span><span style="color: #000000;">.</span><span style="color: #000000;">jar&nbsp;&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">c&nbsp;$</span><span style="color: #800000;">1</span><span style="color: #000000;">.</span><span style="color: #000000;">py&nbsp;<br />
</span><span style="color: #000000;">/</span><span style="color: #000000;">home</span><span style="color: #000000;">/xx</span><span style="color: #000000;">/</span><span style="color: #000000;">del</span><span style="color: #000000;">/</span><span style="color: #000000;">jobs</span><span style="color: #000000;">/</span><span style="color: #000000;">tools</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop</span><span style="color: #000000;">-</span><span style="color: #800000;">0.18</span><span style="color: #000000;">.</span><span style="color: #800000;">3</span><span style="color: #000000;">/</span><span style="color: #000000;">bin</span><span style="color: #000000;">/</span><span style="color: #000000;">hadoop&nbsp;jar&nbsp;$</span><span style="color: #800000;">1</span><span style="color: #000000;">.</span><span style="color: #000000;">jar&nbsp;$</span><span style="color: #800000;">2</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">3</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">4</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">5</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">6</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">7</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">8</span><span style="color: #000000;">&nbsp;$</span><span style="color: #800000;">9</span><span style="color: #000000;">&nbsp; <br />
</span></div>
<br />
<br />
简单 <strong>数据 链接 </strong>:<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.fs&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;Path&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.io&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred.lib&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred.join&nbsp;&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;sys&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;getopt&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">class</span><span style="color: #000000;">&nbsp;tMap(Mapper,&nbsp;MapReduceBase):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;map(self,&nbsp;key,&nbsp;value,&nbsp;output,&nbsp;reporter):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(&nbsp;Text(&nbsp;str(key)&nbsp;)&nbsp;,&nbsp;Text(&nbsp;value.toString()&nbsp;))&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;main(args):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;JobConf(tMap)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setJobName(</span><span style="color: #800000;">"</span><span style="color: #800000;">wordcount</span><span style="color: #800000;">"</span><span style="color: #000000;">)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapperClass(&nbsp;tMap&nbsp;)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileInputFormat.setInputPaths(conf,[&nbsp;Path(sp)&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;sp&nbsp;</span><span style="color: #0000ff;">in</span><span style="color: #000000;">&nbsp;args[</span><span style="color: #000000;">1</span><span style="color: #000000;">:</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;">]])&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputKeyClass(&nbsp;Text&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputValueClass(&nbsp;Text&nbsp;)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputPath(Path(args[</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;">]))&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;JobClient.runJob(conf)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;</span><span style="color: #800080;">__name__</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">==</span><span style="color: #000000;">&nbsp;</span><span style="color: #800000;">"</span><span style="color: #800000;">__main__</span><span style="color: #800000;">"</span><span style="color: #000000;">:main(sys.argv)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
</span></div>
<br />
运行 <br />
./compile test file:///home/xx/del/jobs/tools/hadoop-0.18.3/data/090907/1 file:///home/xx/del/jobs/tools/hadoop-0.18.3/data/090907/2&nbsp;&nbsp; file:///home/xx/del/jobs/tools/hadoop-0.18.3/tmp/wc78<br />
结果:<br />
[xx@localhost wc78]$ cat ../wc78/part-00000 <br />
0&nbsp;&nbsp; &nbsp;1<br />
0&nbsp;&nbsp; &nbsp;2<br />
2&nbsp;&nbsp; &nbsp;4<br />
2&nbsp;&nbsp; &nbsp;2<br />
4&nbsp;&nbsp; &nbsp;3<br />
4&nbsp;&nbsp; &nbsp;3<br />
6&nbsp;&nbsp; &nbsp;1<br />
6&nbsp;&nbsp; &nbsp;4<br />
8&nbsp;&nbsp; &nbsp;5<br />
<br />
<br />
简单的数据 join :<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.fs&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;Path<br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.io&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred.lib&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred.join&nbsp;&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">from</span><span style="color: #000000;">&nbsp;org.apache.hadoop.mapred&nbsp;</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">*</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;sys<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;getopt<br />
<br />
</span><span style="color: #0000ff;">class</span><span style="color: #000000;">&nbsp;tMap(Mapper,&nbsp;MapReduceBase):</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;map(self,&nbsp;key,&nbsp;value,&nbsp;output,&nbsp;reporter):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(&nbsp;Text(&nbsp;str(key)&nbsp;)&nbsp;,&nbsp;Text(&nbsp;value.toString()&nbsp;))<br />
</span><span style="color: #000000;"><br />
</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;main(args):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;JobConf(tMap)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setJobName(</span><span style="color: #800000;">"</span><span style="color: #800000;">wordcount</span><span style="color: #800000;">"</span><span style="color: #000000;">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapperClass(&nbsp;tMap&nbsp;)</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.set(</span><span style="color: #800000;">"</span><span style="color: #800000;">mapred.join.expr</span><span style="color: #800000;">"</span><span style="color: #000000;">,&nbsp;CompositeInputFormat.compose(</span><span style="color: #800000;">"</span><span style="color: #800000;">override</span><span style="color: #800000;">"</span><span style="color: #000000;">,TextInputFormat,&nbsp;args[</span><span style="color: #000000;">1</span><span style="color: #000000;">:</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;">]&nbsp;)&nbsp;)</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputKeyClass(&nbsp;Text&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputValueClass(&nbsp;Text&nbsp;)</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setInputFormat(CompositeInputFormat)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputPath(Path(args[</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;">]))<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;JobClient.runJob(conf)<br />
<br />
</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;</span><span style="color: #800080;">__name__</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">==</span><span style="color: #000000;">&nbsp;</span><span style="color: #800000;">"</span><span style="color: #800000;">__main__</span><span style="color: #800000;">"</span><span style="color: #000000;">:main(sys.argv)<br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
</span></div>
<br />
运行结果 (&nbsp; ) :<br />
./compile test file:///home/xx/del/jobs/tools/hadoop-0.18.3/data/090907/1 file:///home/xx/del/jobs/tools/hadoop-0.18.3/data/090907/2&nbsp;&nbsp; file:///home/xx/del/jobs/tools/hadoop-0.18.3/tmp/wc79<br />
[xx@localhost wc78]$ cat ../wc79/part-00000 <br />
0&nbsp;&nbsp; &nbsp;2<br />
2&nbsp;&nbsp; &nbsp;4<br />
4&nbsp;&nbsp; &nbsp;3<br />
6&nbsp;&nbsp; &nbsp;1<br />
8&nbsp;&nbsp; &nbsp;5<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/294261.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-09-08 10:39 <a href="http://www.blogjava.net/Skynet/archive/2009/09/08/294261.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop jython ( windows )</title><link>http://www.blogjava.net/Skynet/archive/2009/09/04/293914.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Fri, 04 Sep 2009 09:14:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/09/04/293914.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/293914.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/09/04/293914.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/293914.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/293914.html</trackback:ping><description><![CDATA[参考 ： <a id="viewpost1_TitleUrl" href="../../Skynet/archive/2009/07/08/285919.html">hadoop window  搭建</a> 后,由于对 py 的语法喜欢 ，一直想 把hadoop,改成jython 的 <br />
这次 在 自己电脑上&nbsp; 终于 完成,下面介绍过程:<br />
<br />
测试环境：<br />
依然的 windows + cygwin <br />
hadoop 0.18&nbsp; # C:/cygwin/home/lky/tools/java/hadoop-0.18.3<br />
jython 2.2.1 # C:/jython2.2.1<br />
<br />
参考: <a title="Click to do a full-text search for this title" href="http://wiki.apache.org/hadoop/PythonWordCount?action=fullsearch&amp;value=linkto%3A%22PythonWordCount%22&amp;context=180">PythonWordCount</a><br />
<br />
启动 hadoop 并到 hdoop_home 下<br />
<div style="border: 1px solid rgb(204, 204, 204); padding: 4px 5px 4px 4px; background-color: rgb(238, 238, 238); font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: rgb(0, 0, 0);"><strong># 在云环境中创建 input 目录</strong><br />
$&gt;bin/hadoop&nbsp;dfs -mkdir input</span><br />
<strong># 在 包 hadoop 的 NOTICE.txt 拷贝到 input 目录下</strong><br />
<span style="color: rgb(0, 0, 0);">$&gt;bin/hadoop&nbsp;dfs&nbsp;-copyFromLocal&nbsp;c:/cygwin/home/lky/tools/java/hadoop-0.18.3/NOTICE.txt&nbsp; hdfs:///user/lky/input<br />
<br />
$&gt;cd </span>src/examples/python<br />
<br />
<strong># 创建 个 脚本 ( jy-&gt;jar-&gt;hd run&nbsp; ) 一步完成!<br />
# 当然 在 linux 写个脚本比这 好看 呵呵！<br />
</strong>$&gt;vim run.bat<br />
<div style="border: 1px solid rgb(204, 204, 204); padding: 4px 5px 4px 4px; background-color: rgb(238, 238, 238); font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: rgb(0, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">C:\Program&nbsp;Files\Java\jdk1.6.0_11\bin\java.exe</span><span style="color: rgb(0, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">&nbsp;&nbsp;-classpath&nbsp;</span><span style="color: rgb(0, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">C:\jython2.2.1\jython.jar;%CLASSPATH%</span><span style="color: rgb(0, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">&nbsp;org.python.util.jython&nbsp;C:\jython2</span><span style="color: rgb(0, 0, 0);">.2.1</span><span style="color: rgb(0, 0, 0);">\Tools\jythonc\jythonc.py&nbsp;&nbsp;&nbsp;-p&nbsp;org.apache.hadoop.examples&nbsp;-d&nbsp;-j&nbsp;wc.jar&nbsp;-c&nbsp;%</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);"><br />
<br />
<strong>sh&nbsp;</strong>C:\cygwin\home\lky\tools\java\hadoop-</span><span style="color: rgb(0, 0, 0);">0.18.3</span><span style="color: rgb(0, 0, 0);">\bin\hadoop&nbsp;jar&nbsp;wc.jar&nbsp;&nbsp;%</span><span style="color: rgb(0, 0, 0);">2</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">3</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">4</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">5</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">6</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">7</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">8</span><span style="color: rgb(0, 0, 0);">&nbsp;%</span><span style="color: rgb(0, 0, 0);">9</span><span style="color: rgb(0, 0, 0);"> <br />
</span></div>
<br />
<strong># 修改 jythonc 打包 环境 。 +hadoop jar </strong><br />
$&gt;vim C:\jython2.2.1\Tools\jythonc\jythonc.py<br />
<div style="border: 1px solid rgb(204, 204, 204); padding: 4px 5px 4px 4px; background-color: rgb(238, 238, 238); font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: rgb(0, 128, 0);">#</span><span style="color: rgb(0, 128, 0);">&nbsp;Copyright&nbsp;(c)&nbsp;Corporation&nbsp;for&nbsp;National&nbsp;Research&nbsp;Initiatives</span><span style="color: rgb(0, 0, 0);"><br />
</span><span style="color: rgb(0, 128, 0);">#</span><span style="color: rgb(0, 128, 0);">&nbsp;Driver&nbsp;script&nbsp;for&nbsp;jythonc2.&nbsp;&nbsp;See&nbsp;module&nbsp;main.py&nbsp;for&nbsp;details</span><span style="color: rgb(0, 128, 0);"><br />
</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;sys,os,glob<br />
<strong><br />
</strong>
</span><strong><span style="color: rgb(0, 0, 255);">for</span><span style="color: rgb(0, 0, 0);">&nbsp;fn&nbsp;</span><span style="color: rgb(0, 0, 255);">in</span><span style="color: rgb(0, 0, 0);">&nbsp;glob.glob(</span><span style="color: rgb(128, 0, 0);">'</span><span style="color: rgb(128, 0, 0);">c:/cygwin/home/lky/tools/java/hadoop-0.18.3/*.jar</span><span style="color: rgb(128, 0, 0);">'</span><span style="color: rgb(0, 0, 0);">)&nbsp;:sys.path.append(fn)<br />
</span><span style="color: rgb(0, 0, 255);">for</span><span style="color: rgb(0, 0, 0);">&nbsp;fn&nbsp;</span><span style="color: rgb(0, 0, 255);">in</span><span style="color: rgb(0, 0, 0);">&nbsp;glob.glob(</span><span style="color: rgb(128, 0, 0);">'</span><span style="color: rgb(128, 0, 0);">c:/jython2.2.1/*.jar</span><span style="color: rgb(128, 0, 0);">'</span><span style="color: rgb(0, 0, 0);">)&nbsp;:sys.path.append(fn)<br />
</span><span style="color: rgb(0, 0, 255);">for</span><span style="color: rgb(0, 0, 0);">&nbsp;fn&nbsp;</span><span style="color: rgb(0, 0, 255);">in</span><span style="color: rgb(0, 0, 0);">&nbsp;glob.glob(</span><span style="color: rgb(128, 0, 0);">'</span><span style="color: rgb(128, 0, 0);">c:/cygwin/home/lky/tools/java/hadoop-0.18.3/lib/*.jar</span><span style="color: rgb(128, 0, 0);">'</span></strong><span style="color: rgb(0, 0, 0);"><strong>)&nbsp;:sys.path.append(fn)<br />
</strong>
<br />
</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;main<br />
main.main()<br />
<br />
</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;os<br />
os._exit(0)<br />
</span></div>
<br />
<br />
<strong># 运行</strong><br />
C:/cygwin/home/lky/tools/java/hadoop-0.18.3/src/examples/python&gt;<br />
&nbsp; run.bat WordCount.py&nbsp; hdfs:///user/lky/input&nbsp; file:///c:/cygwin/home/lky/tools/java/hadoop-0.18.3/tmp2<br />
<br />
<br />
</div>
<strong><br />
<br />
结果输出：</strong><br />
cat c:/cygwin/home/lky/tools/java/hadoop-0.18.3/tmp2/part-00000<br />
(http://www.apache.org/).&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
Apache&nbsp; 1<br />
Foundation&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
Software&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
The&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
This&nbsp;&nbsp;&nbsp; 1<br />
by&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
developed&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
includes&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
product 1<br />
software&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
<br />
<strong>下面重头来了 ：（简洁的 jy hdoop 代码）</strong><br />
<div style="border: 1px solid rgb(204, 204, 204); padding: 4px 5px 4px 4px; background-color: rgb(238, 238, 238); font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: rgb(0, 128, 0);">#<br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;Licensed&nbsp;to&nbsp;the&nbsp;Apache&nbsp;Software&nbsp;Foundation&nbsp;(ASF)&nbsp;under&nbsp;one</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;or&nbsp;more&nbsp;contributor&nbsp;license&nbsp;agreements.&nbsp;&nbsp;See&nbsp;the&nbsp;NOTICE&nbsp;file</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;distributed&nbsp;with&nbsp;this&nbsp;work&nbsp;for&nbsp;additional&nbsp;information</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;regarding&nbsp;copyright&nbsp;ownership.&nbsp;&nbsp;The&nbsp;ASF&nbsp;licenses&nbsp;this&nbsp;file</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;to&nbsp;you&nbsp;under&nbsp;the&nbsp;Apache&nbsp;License,&nbsp;Version&nbsp;2.0&nbsp;(the</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;"License");&nbsp;you&nbsp;may&nbsp;not&nbsp;use&nbsp;this&nbsp;file&nbsp;except&nbsp;in&nbsp;compliance</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;with&nbsp;the&nbsp;License.&nbsp;&nbsp;You&nbsp;may&nbsp;obtain&nbsp;a&nbsp;copy&nbsp;of&nbsp;the&nbsp;License&nbsp;at</span><span style="color: rgb(0, 128, 0);"><br />
#<br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http://www.apache.org/licenses/LICENSE-2.0</span><span style="color: rgb(0, 128, 0);"><br />
#<br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;Unless&nbsp;required&nbsp;by&nbsp;applicable&nbsp;law&nbsp;or&nbsp;agreed&nbsp;to&nbsp;in&nbsp;writing,&nbsp;software</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;distributed&nbsp;under&nbsp;the&nbsp;License&nbsp;is&nbsp;distributed&nbsp;on&nbsp;an&nbsp;"AS&nbsp;IS"&nbsp;BASIS,</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;WITHOUT&nbsp;WARRANTIES&nbsp;OR&nbsp;CONDITIONS&nbsp;OF&nbsp;ANY&nbsp;KIND,&nbsp;either&nbsp;express&nbsp;or&nbsp;implied.</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;See&nbsp;the&nbsp;License&nbsp;for&nbsp;the&nbsp;specific&nbsp;language&nbsp;governing&nbsp;permissions&nbsp;and</span><span style="color: rgb(0, 128, 0);"><br />
#</span><span style="color: rgb(0, 128, 0);">&nbsp;limitations&nbsp;under&nbsp;the&nbsp;License.</span><span style="color: rgb(0, 128, 0);"><br />
#<br />
</span><span style="color: rgb(0, 0, 0);"><br />
</span><span style="color: rgb(0, 0, 255);">from</span><span style="color: rgb(0, 0, 0);">&nbsp;org.apache.hadoop.fs&nbsp;</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;Path<br />
</span><span style="color: rgb(0, 0, 255);">from</span><span style="color: rgb(0, 0, 0);">&nbsp;org.apache.hadoop.io&nbsp;</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(0, 0, 0);">*</span><span style="color: rgb(0, 0, 0);"><br />
</span><span style="color: rgb(0, 0, 255);">from</span><span style="color: rgb(0, 0, 0);">&nbsp;org.apache.hadoop.mapred&nbsp;</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(0, 0, 0);">*</span><span style="color: rgb(0, 0, 0);"><br />
<br />
</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;sys<br />
</span><span style="color: rgb(0, 0, 255);">import</span><span style="color: rgb(0, 0, 0);">&nbsp;getopt<br />
<br />
</span><span style="color: rgb(0, 0, 255);">class</span><span style="color: rgb(0, 0, 0);">&nbsp;WordCountMap(Mapper,&nbsp;MapReduceBase):<br />
&nbsp;&nbsp;&nbsp;&nbsp;one&nbsp;</span><span style="color: rgb(0, 0, 0);">=</span><span style="color: rgb(0, 0, 0);">&nbsp;IntWritable(</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">def</span><span style="color: rgb(0, 0, 0);">&nbsp;map(self,&nbsp;key,&nbsp;value,&nbsp;output,&nbsp;reporter):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">for</span><span style="color: rgb(0, 0, 0);">&nbsp;w&nbsp;</span><span style="color: rgb(0, 0, 255);">in</span><span style="color: rgb(0, 0, 0);">&nbsp;value.toString().split():<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(Text(w),&nbsp;self.one)<br />
<br />
</span><span style="color: rgb(0, 0, 255);">class</span><span style="color: rgb(0, 0, 0);">&nbsp;Summer(Reducer,&nbsp;MapReduceBase):<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">def</span><span style="color: rgb(0, 0, 0);">&nbsp;reduce(self,&nbsp;key,&nbsp;values,&nbsp;output,&nbsp;reporter):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum&nbsp;</span><span style="color: rgb(0, 0, 0);">=</span><span style="color: rgb(0, 0, 0);">&nbsp;0<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">while</span><span style="color: rgb(0, 0, 0);">&nbsp;values.hasNext():<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum&nbsp;</span><span style="color: rgb(0, 0, 0);">+=</span><span style="color: rgb(0, 0, 0);">&nbsp;values.next().get()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;output.collect(key,&nbsp;IntWritable(sum))<br />
<br />
</span><span style="color: rgb(0, 0, 255);">def</span><span style="color: rgb(0, 0, 0);">&nbsp;printUsage(code):<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">print</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">wordcount&nbsp;[-m&nbsp;&lt;maps&gt;]&nbsp;[-r&nbsp;&lt;reduces&gt;]&nbsp;&lt;input&gt;&nbsp;&lt;output&gt;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);"><br />
&nbsp;&nbsp;&nbsp;&nbsp;sys.exit(code)<br />
<br />
</span><span style="color: rgb(0, 0, 255);">def</span><span style="color: rgb(0, 0, 0);">&nbsp;main(args):<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf&nbsp;</span><span style="color: rgb(0, 0, 0);">=</span><span style="color: rgb(0, 0, 0);">&nbsp;JobConf(WordCountMap);<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setJobName(</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">wordcount</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">);<br />
&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputKeyClass(Text);<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputValueClass(IntWritable);<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setMapperClass(WordCountMap);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setCombinerClass(Summer);<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setReducerClass(Summer);<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">try</span><span style="color: rgb(0, 0, 0);">:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;flags,&nbsp;other_args&nbsp;</span><span style="color: rgb(0, 0, 0);">=</span><span style="color: rgb(0, 0, 0);">&nbsp;getopt.getopt(args[</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);">:],&nbsp;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">m:r:</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">except</span><span style="color: rgb(0, 0, 0);">&nbsp;getopt.GetoptError:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printUsage(</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">if</span><span style="color: rgb(0, 0, 0);">&nbsp;len(other_args)&nbsp;</span><span style="color: rgb(0, 0, 0);">!=</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(0, 0, 0);">2</span><span style="color: rgb(0, 0, 0);">:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printUsage(</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">for</span><span style="color: rgb(0, 0, 0);">&nbsp;f,v&nbsp;</span><span style="color: rgb(0, 0, 255);">in</span><span style="color: rgb(0, 0, 0);">&nbsp;flags:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">if</span><span style="color: rgb(0, 0, 0);">&nbsp;f&nbsp;</span><span style="color: rgb(0, 0, 0);">==</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">-m</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setNumMapTasks(int(v))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: rgb(0, 0, 255);">elif</span><span style="color: rgb(0, 0, 0);">&nbsp;f&nbsp;</span><span style="color: rgb(0, 0, 0);">==</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">-r</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;conf.setNumReduceTasks(int(v))<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setInputPath(Path(other_args[0]))<br />
&nbsp;&nbsp;&nbsp;&nbsp;conf.setOutputPath(Path(other_args[</span><span style="color: rgb(0, 0, 0);">1</span><span style="color: rgb(0, 0, 0);">]))<br />
&nbsp;&nbsp;&nbsp;&nbsp;JobClient.runJob(conf);<br />
<br />
</span><span style="color: rgb(0, 0, 255);">if</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(128, 0, 128);">__name__</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(0, 0, 0);">==</span><span style="color: rgb(0, 0, 0);">&nbsp;</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(128, 0, 0);">__main__</span><span style="color: rgb(128, 0, 0);">"</span><span style="color: rgb(0, 0, 0);">:<br />
&nbsp;&nbsp;&nbsp;&nbsp;main(sys.argv)<br />
</span></div>
<br />
<br />
<br />
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/293914.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-09-04 17:14 <a href="http://www.blogjava.net/Skynet/archive/2009/09/04/293914.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>