﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-paulwong-随笔分类-PIG</title><link>http://www.blogjava.net/paulwong/category/53479.html</link><description /><language>zh-cn</language><lastBuildDate>Tue, 23 Apr 2013 21:36:25 GMT</lastBuildDate><pubDate>Tue, 23 Apr 2013 21:36:25 GMT</pubDate><ttl>60</ttl><item><title>一个PIG脚本例子分析</title><link>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 13 Apr 2013 07:21:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397791.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397791.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397791.html</trackback:ping><description><![CDATA[执行脚本：<br />
<div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->PIGGYBANK_PATH=$PIG_HOME/contrib/piggybank/java/piggybank.jar<br />
INPUT=pig/input/test-pig-full.txt<br />
OUTPUT=pig/output/test-pig-output-$(date&nbsp;&nbsp;+%Y%m%d%H%M%S)<br />
PIGSCRIPT=analyst_status_logs.pig<br />
<br />
<span style="color: #008000; ">#</span><span style="color: #008000; ">analyst_500_404_month.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_500_404_day.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_404_percentage.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_500_percentage.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_unique_path.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_user_logs.pig</span><span style="color: #008000; "><br />
#</span><span style="color: #008000; ">analyst_status_logs.pig</span><span style="color: #008000; "><br />
</span><br />
<br />
pig&nbsp;-p&nbsp;PIGGYBANK_PATH=$PIGGYBANK_PATH&nbsp;-p&nbsp;INPUT=$INPUT&nbsp;-p&nbsp;OUTPUT=$OUTPUT&nbsp;$PIGSCRIPT</div><br /><br />要分析的数据源，LOG 文件<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->46.20.45.18&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:00:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;302&nbsp;-&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;46.20.45.18&nbsp;<span style="color: #800000; ">""</span>&nbsp;11011AEC9542DB0983093A100E8733F8&nbsp;0<br />46.20.45.18&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:00:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/sign-in.jspx&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;200&nbsp;3926&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;46.20.45.18&nbsp;<span style="color: #800000; ">""</span>&nbsp;11011AEC9542DB0983093A100E8733F8&nbsp;0<br />69.59.28.19&nbsp;-&nbsp;-&nbsp;[25/Dec/2012:23:01:25&nbsp;+0100]&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">GET&nbsp;/&nbsp;HTTP/1.0</span><span style="color: #800000; ">"</span>&nbsp;302&nbsp;-&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;<span style="color: #800000; ">"</span><span style="color: #800000; ">-</span><span style="color: #800000; ">"</span>&nbsp;69.59.28.19&nbsp;<span style="color: #800000; ">""</span>&nbsp;36D80DE7FE52A2D89A8F53A012307B0A&nbsp;15</div><br /><br />PIG脚本：<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->--注册JAR包，因为要用到DateExtractor<br />register&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$PIGGYBANK_PATH</span><span style="color: #800000; ">'</span>;<br /><br />--声明一个短函数名<br />DEFINE&nbsp;DATE_EXTRACT_MM&nbsp;<br />org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(<span style="color: #800000; ">'</span><span style="color: #800000; ">yyyy-MM</span><span style="color: #800000; ">'</span>);<br /><br />DEFINE&nbsp;DATE_EXTRACT_DD&nbsp;<br />org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(<span style="color: #800000; ">'</span><span style="color: #800000; ">yyyy-MM-dd</span><span style="color: #800000; ">'</span>);<br /><br />--&nbsp;pig/input/test-pig-full.txt<br />--把数据从变量所指的文件加载到PIG中，并定义数据列名，此时的数据集为数组(a,b,c)<br />raw_logs&nbsp;=&nbsp;load&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$INPUT</span><span style="color: #800000; ">'</span>&nbsp;USING&nbsp;org.apache.pig.piggybank.storage.MyRegExLoader(<span style="color: #800000; ">'</span><span style="color: #800000; ">^(\\S+)&nbsp;(\\S+)&nbsp;(\\S+)&nbsp;\\[([\\w:/]+\\s[+\\-]\\d{4})\\]&nbsp;"(\\S+)&nbsp;(\\S+)&nbsp;(HTTP[^"]+)"&nbsp;(\\S+)&nbsp;(\\S+)&nbsp;"([^"]*)"&nbsp;"([^"]*)"&nbsp;"(\\S+)"&nbsp;"(\\S+)"&nbsp;(\\S+)&nbsp;"(.*)"&nbsp;(\\S+)&nbsp;(\\S+)</span><span style="color: #800000; ">'</span>)<br />as&nbsp;(remoteAddr:&nbsp;chararray,&nbsp;<br />n2:&nbsp;chararray,&nbsp;<br />n3:&nbsp;chararray,&nbsp;<br />time:&nbsp;chararray,&nbsp;<br />method:&nbsp;chararray,<br />path:chararray,<br />protocol:chararray,<br />status:&nbsp;int,&nbsp;<br />bytes_string:&nbsp;chararray,&nbsp;<br />referrer:&nbsp;chararray,&nbsp;<br />browser:&nbsp;chararray,&nbsp;<br />n10:chararray,<br />remoteLogname:&nbsp;chararray,&nbsp;<br />remoteAddr12:&nbsp;chararray,&nbsp;<br />path2:&nbsp;chararray,&nbsp;<br />sessionid:&nbsp;chararray,&nbsp;<br />n15:&nbsp;chararray<br />);<br /><br />--过滤数据<br />filter_logs&nbsp;=&nbsp;FILTER&nbsp;raw_logs&nbsp;BY&nbsp;<span style="color: #0000FF; ">not</span>&nbsp;(browser&nbsp;matches&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">.*pingdom.*</span><span style="color: #800000; ">'</span>);<br />--item_logs&nbsp;=&nbsp;FOREACH&nbsp;raw_logs&nbsp;GENERATE&nbsp;browser;<br /><br />--percent&nbsp;500&nbsp;logs<br />--重定义数据项，数据集只取2项status,month<br />reitem_percent_500_logs&nbsp;=&nbsp;FOREACH&nbsp;filter_logs&nbsp;GENERATE&nbsp;status,DATE_EXTRACT_MM(time)&nbsp;as&nbsp;month;<br />--分组数据集，此时的数据结构为MAP(a{(aa,bb,cc),(dd,ee,ff)},b{(bb,cc,dd),(ff,gg,hh)})<br />group_month_percent_500_logs&nbsp;=&nbsp;GROUP&nbsp;reitem_percent_500_logs&nbsp;BY&nbsp;(month);<br />--重定义分组数据集数据项，进行分组统计，此时要联合分组数据集和原数据集统计<br />final_month_500_logs&nbsp;=&nbsp;FOREACH&nbsp;group_month_percent_500_logs&nbsp;<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;--对原数据集做count，因为是在foreachj里做count的，即使是对原数据集，也会自动会加month==group的条件<br />&nbsp;&nbsp;&nbsp;&nbsp;--从这里可以看出对于group里的数据集，完全没用到<br />&nbsp;&nbsp;&nbsp;&nbsp;--这时是以每一行为单位的，统计MAP中的KEY-a对应的数组在原数据集中的个数<br />&nbsp;&nbsp;&nbsp;&nbsp;total&nbsp;=&nbsp;COUNT(reitem_percent_500_logs);<br />&nbsp;&nbsp;&nbsp;&nbsp;--对原数据集做filter，因为是在foreachj里做count的，即使是对原数据集，也会自动会加month==group的条件<br />&nbsp;&nbsp;&nbsp;&nbsp;--重新过滤一下原数据集，得到status==500,month==group的数据集<br />&nbsp;&nbsp;&nbsp;&nbsp;t&nbsp;=&nbsp;filter&nbsp;reitem_percent_500_logs&nbsp;by&nbsp;status==&nbsp;500;&nbsp;--create&nbsp;a&nbsp;bag&nbsp;which&nbsp;contains&nbsp;only&nbsp;T&nbsp;values<br />&nbsp;&nbsp;&nbsp;&nbsp;--重定义数据项，取group，统计结果<br />&nbsp;&nbsp;&nbsp;&nbsp;generate&nbsp;flatten(group)&nbsp;as&nbsp;col1,&nbsp;100*(double)COUNT(t)/(double)total;<br />}<br />STORE&nbsp;final_month_500_logs&nbsp;into&nbsp;<span style="color: #800000; ">'</span><span style="color: #800000; ">$OUTPUT</span><span style="color: #800000; ">'</span>&nbsp;using&nbsp;PigStorage(<span style="color: #800000; ">'</span><span style="color: #800000; ">,</span><span style="color: #800000; ">'</span>);</div><br /><img src ="http://www.blogjava.net/paulwong/aggbug/397791.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-13 15:21 <a href="http://www.blogjava.net/paulwong/archive/2013/04/13/397791.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>把命令行中的值传进PIG中</title><link>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 07:32:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397645.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397645.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397645.html</trackback:ping><description><![CDATA[<a href="http://wiki.apache.org/pig/ParameterSubstitution" target="_blank">http://wiki.apache.org/pig/ParameterSubstitution<br />
<br />
<br />
</a>
<div>
<div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
-->%pig&nbsp;-param&nbsp;input=/user/paul/sample.txt&nbsp;-param&nbsp;output=/user/paul/output/</div>
</div><br /><br />PIG中获取<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->records&nbsp;=&nbsp;LOAD&nbsp;<span style="color: #800080; ">$input</span>;</div><img src ="http://www.blogjava.net/paulwong/aggbug/397645.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-10 15:32 <a href="http://www.blogjava.net/paulwong/archive/2013/04/10/397645.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG中的分组统计百分比</title><link>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 06:13:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397642.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397642.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397642.html</trackback:ping><description><![CDATA[<a href="http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field" target="_blank">http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field<br /><br /></a><a href="http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query" target="_blank">http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query</a><img src ="http://www.blogjava.net/paulwong/aggbug/397642.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-10 14:13 <a href="http://www.blogjava.net/paulwong/archive/2013/04/10/397642.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>CombinedLogLoader</title><link>http://www.blogjava.net/paulwong/archive/2013/04/08/397510.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 08 Apr 2013 03:28:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/08/397510.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397510.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/08/397510.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397510.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397510.html</trackback:ping><description><![CDATA[PIG中的LOAD函数，可以在LOAD数据的同时，进行正则表达式的筛选。<br /><br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008000; ">/*</span><span style="color: #008000; "><br />&nbsp;*&nbsp;Licensed&nbsp;to&nbsp;the&nbsp;Apache&nbsp;Software&nbsp;Foundation&nbsp;(ASF)&nbsp;under&nbsp;one&nbsp;or&nbsp;more&nbsp;contributor&nbsp;license&nbsp;agreements.&nbsp;See&nbsp;the<br />&nbsp;*&nbsp;NOTICE&nbsp;file&nbsp;distributed&nbsp;with&nbsp;this&nbsp;work&nbsp;for&nbsp;additional&nbsp;information&nbsp;regarding&nbsp;copyright&nbsp;ownership.&nbsp;The&nbsp;ASF<br />&nbsp;*&nbsp;licenses&nbsp;this&nbsp;file&nbsp;to&nbsp;you&nbsp;under&nbsp;the&nbsp;Apache&nbsp;License,&nbsp;Version&nbsp;2.0&nbsp;(the&nbsp;"License");&nbsp;you&nbsp;may&nbsp;not&nbsp;use&nbsp;this&nbsp;file<br />&nbsp;*&nbsp;except&nbsp;in&nbsp;compliance&nbsp;with&nbsp;the&nbsp;License.&nbsp;You&nbsp;may&nbsp;obtain&nbsp;a&nbsp;copy&nbsp;of&nbsp;the&nbsp;License&nbsp;at<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;</span><span style="color: #008000; text-decoration: underline; ">http://www.apache.org/licenses/LICENSE-2.0</span><span style="color: #008000; "><br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;Unless&nbsp;required&nbsp;by&nbsp;applicable&nbsp;law&nbsp;or&nbsp;agreed&nbsp;to&nbsp;in&nbsp;writing,&nbsp;software&nbsp;distributed&nbsp;under&nbsp;the&nbsp;License&nbsp;is<br />&nbsp;*&nbsp;distributed&nbsp;on&nbsp;an&nbsp;"AS&nbsp;IS"&nbsp;BASIS,&nbsp;WITHOUT&nbsp;WARRANTIES&nbsp;OR&nbsp;CONDITIONS&nbsp;OF&nbsp;ANY&nbsp;KIND,&nbsp;either&nbsp;express&nbsp;or&nbsp;implied.<br />&nbsp;*&nbsp;See&nbsp;the&nbsp;License&nbsp;for&nbsp;the&nbsp;specific&nbsp;language&nbsp;governing&nbsp;permissions&nbsp;and&nbsp;limitations&nbsp;under&nbsp;the&nbsp;License.<br />&nbsp;</span><span style="color: #008000; ">*/</span><br /><br /><span style="color: #0000FF; ">package</span>&nbsp;org.apache.pig.piggybank.storage.apachelog;<br /><br /><span style="color: #0000FF; ">import</span>&nbsp;java.util.regex.Pattern;<br /><br /><span style="color: #0000FF; ">import</span>&nbsp;org.apache.pig.piggybank.storage.RegExLoader;<br /><br /><span style="color: #008000; ">/**</span><span style="color: #008000; "><br />&nbsp;*&nbsp;CombinedLogLoader&nbsp;is&nbsp;used&nbsp;to&nbsp;load&nbsp;logs&nbsp;based&nbsp;on&nbsp;Apache's&nbsp;combined&nbsp;log&nbsp;format,&nbsp;based&nbsp;on&nbsp;a&nbsp;format&nbsp;like<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;LogFormat&nbsp;"%h&nbsp;%l&nbsp;%u&nbsp;%t&nbsp;\"%r\"&nbsp;%&gt;s&nbsp;%b&nbsp;\"%{Referer}i\"&nbsp;\"%{User-Agent}i\""&nbsp;combined<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;The&nbsp;log&nbsp;filename&nbsp;ends&nbsp;up&nbsp;being&nbsp;access_log&nbsp;from&nbsp;a&nbsp;line&nbsp;like<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;CustomLog&nbsp;logs/combined_log&nbsp;combined<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;Example:<br />&nbsp;*&nbsp;<br />&nbsp;*&nbsp;raw&nbsp;=&nbsp;LOAD&nbsp;'combined_log'&nbsp;USING&nbsp;org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader&nbsp;AS<br />&nbsp;*&nbsp;(remoteAddr,&nbsp;remoteLogname,&nbsp;user,&nbsp;time,&nbsp;method,&nbsp;uri,&nbsp;proto,&nbsp;status,&nbsp;bytes,&nbsp;referer,&nbsp;userAgent);<br />&nbsp;*&nbsp;<br />&nbsp;</span><span style="color: #008000; ">*/</span><br /><br /><span style="color: #0000FF; ">public</span>&nbsp;<span style="color: #0000FF; ">class</span>&nbsp;CombinedLogLoader&nbsp;<span style="color: #0000FF; ">extends</span>&nbsp;RegExLoader&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;1.2.3.4&nbsp;-&nbsp;-&nbsp;[30/Sep/2008:15:07:53&nbsp;-0400]&nbsp;"GET&nbsp;/&nbsp;HTTP/1.1"&nbsp;200&nbsp;3190&nbsp;"-"<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">&nbsp;"Mozilla/5.0&nbsp;(Macintosh;&nbsp;U;&nbsp;Intel&nbsp;Mac&nbsp;OS&nbsp;X&nbsp;10_5_4;&nbsp;en-us)&nbsp;AppleWebKit/525.18&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Version/3.1.2&nbsp;Safari/525.20.1"</span><span style="color: #008000; "><br /></span>&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">private</span>&nbsp;<span style="color: #0000FF; ">final</span>&nbsp;<span style="color: #0000FF; ">static</span>&nbsp;Pattern&nbsp;combinedLogPattern&nbsp;=&nbsp;Pattern<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.compile("^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+\"(\\S+)\\s+(.+?)\\s+(HTTP[^\"]+)\"\\s+(\\S+)\\s+(\\S+)\\s+\"([^\"]*)\"\\s+\"(.*)\"$");<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">public</span>&nbsp;Pattern&nbsp;getPattern()&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;combinedLogPattern;<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />}</div><img src ="http://www.blogjava.net/paulwong/aggbug/397510.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-08 11:28 <a href="http://www.blogjava.net/paulwong/archive/2013/04/08/397510.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Analyzing Apache logs with Pig </title><link>http://www.blogjava.net/paulwong/archive/2013/04/08/397489.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 07 Apr 2013 18:06:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/08/397489.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397489.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/08/397489.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397489.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397489.html</trackback:ping><description><![CDATA[<h3><br /></h3><div style="line-height: 1.6; margin: 0px 0px 1.5em; font-size: 13px; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; background-color: #ffffff;"><div></div></div><div entry-content"="" id="post-body-207687198827114066" itemprop="description articleBody" style="width: 520px; font-size: 15px; line-height: 1.4; position: relative; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; background-color: #ffffff;"><div dir="ltr" trbidi="on"><br /><div>Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don&#8217;t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I&#8217;d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under&nbsp;<em>pig/contrib/piggybank/java/</em>&nbsp;directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin</div><div style="margin-left: 0.5in; text-indent: -0.25in;">1.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>Register PiggyBank jar</div><div style="margin-left: 0.5in;"><em><span style="color: #943634;">REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;</span></em></div><div>Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as</div><div style="margin-left: 0.5in; text-indent: -0.25in;">2.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>Define a log loader</div><div style="margin-left: 0.5in;"><em><span style="color: #943634;">DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();</span></em></div><div>(Piggy Bank has other log loaders as well)</div><div>In apache log files the default format of date is &#8216;dd/MMM/yyyy:HH:mm:ss Z&#8217; . But such a date won&#8217;t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()</div><div style="margin-left: 0.5in; text-indent: -0.25in;">3.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>Define Date Extractor</div><div style="margin-left: 0.5in;"><em><span style="color: #943634;">DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');</span></em></div><div>Once we have the required functionalities with us we need to first load the log file into pig</div><div style="margin-left: 0.5in; text-indent: -0.25in;">4.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>Load apachelog file into pig</div><div style="margin-left: 0.5in;"><strong>--</strong>load the log files from hdfs into pig using CommonLogLoader</div><div style="margin-left: 0.5in;"><em><span style="color: #943634;">logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);</span></em></div><div></div><div>Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we&#8217;d see a few of those common requirements out here</div><div></div><div><strong>Note:</strong>&nbsp;you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below</div><div></div><div><strong>Requirement 1:</strong>&nbsp;Find unique hits per day</div><div><strong>PIG Latin</strong></div><div>--Extracting the day alone and grouping records based on days</div><div><em><span style="color: #943634;">grpd = GROUP logs BY DayExtractor(dt) as day;</span></em></div><div>--looping through each group to get the unique no of userIds</div><div><em><span style="color: #943634;">cntd = FOREACH grpd</span></em></div><div><em><span style="color: #943634;">{</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tempId =&nbsp; logs.userId;</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; uniqueUserId = DISTINCT tempId;</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GENERATE group AS day,COUNT(uniqueUserId) AS cnt;</span></em></div><div><em><span style="color: #943634;">}</span></em></div><div>--sorting the processed records based on no of unique user ids in descending order</div><div><em><span style="color: #943634;">srtd = ORDER cntd BY cnt desc;</span></em></div><div>--storing the final result into a hdfs directory</div><div><em><span style="color: #943634;">STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';</span></em></div><div></div><div><strong>Requirement 1:</strong>&nbsp;Find unique hits to websites (IPs) per day</div><div><strong>PIG Latin</strong></div><div></div><div>--Extracting the day alone and grouping records based on days and ip address</div><div><em><span style="color: #943634;">grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);</span></em></div><div>--looping through each group to get the unique no of userIds</div><div><em><span style="color: #943634;">cntd = FOREACH grpd</span></em></div><div><em><span style="color: #943634;">{</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tempId =&nbsp; logs.userId;</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; uniqueUserId = DISTINCT tempId;</span></em></div><div><em><span style="color: #943634;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GENERATE group AS day,COUNT(uniqueUserId) AS cnt;</span></em></div><div><em><span style="color: #943634;">}</span></em></div><div>--sorting the processed records based on no of unique user ids in descending order</div><div><em><span style="color: #943634;">srtd = ORDER cntd BY cnt desc;</span></em></div><div>--storing the final result into a hdfs directory</div><div><em><span style="color: #943634;">STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';</span></em></div><div></div><div>Note: When you use pig latin in grunt shell we need to know a few factors</div><div style="margin-left: 0.5in; text-indent: -0.25in;">1.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.</div><div style="margin-left: 0.5in; text-indent: -0.25in;">2.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted</div><div style="margin-left: 0.5in; text-indent: -0.25in;">3.<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>Also in this case you don&#8217;t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you&#8217;d have to perform the register and log file loading steps all over again.</div></div></div><img src ="http://www.blogjava.net/paulwong/aggbug/397489.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-08 02:06 <a href="http://www.blogjava.net/paulwong/archive/2013/04/08/397489.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG小议</title><link>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 05 Apr 2013 13:33:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397411.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397411.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397411.html</trackback:ping><description><![CDATA[<div><strong>什么是PIG</strong></div><div>是一种设计语言，通过设计数据怎么流动，然后由相应的引擎将此变成MAPREDUCE JOB去HADOOP中运行。</div><div></div><div></div><div><strong>PIG与SQL</strong></div><div>两者有相同之处，执行一个或多个语句，然后出来一些结果。</div><div>但不同的是，SQL要先把数据导到表中才能执行，SQL不关心中间如何做，即发一个SQL语句过去，就有结果出来。</div><div>PIG，无须导数据到表中，但要设计直到出结果的中间过程，步骤如何等等。</div><img src ="http://www.blogjava.net/paulwong/aggbug/397411.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-05 21:33 <a href="http://www.blogjava.net/paulwong/archive/2013/04/05/397411.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG资源</title><link>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 05 Apr 2013 10:19:00 GMT</pubDate><guid>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html</guid><wfw:comment>http://www.blogjava.net/paulwong/comments/397406.html</wfw:comment><comments>http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/paulwong/comments/commentRss/397406.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/paulwong/services/trackbacks/397406.html</trackback:ping><description><![CDATA[Hadoop Pig学习笔记(一) 各种SQL在PIG中实现<br />
<a href="http://guoyunsky.iteye.com/blog/1317084" target="_blank">http://guoyunsky.iteye.com/blog/1317084<br />
<br />
</a><a href="http://guoyunsky.iteye.com/category/196632" target="_blank">http://guoyunsky.iteye.com/category/196632<br />
<br />
</a>Hadoop学习笔记(9) Pig简介<br />
<a href="http://www.distream.org/?p=385" target="_blank">http://www.distream.org/?p=385</a><br />
<br />
<br />
[hadoop系列]Pig的安装和简单示例<br />
<a href="http://blog.csdn.net/inkfish/article/details/5205999" target="_blank">http://blog.csdn.net/inkfish/article/details/5205999</a><br />
<br />
<br />
Hadoop and Pig for Large-Scale Web Log Analysis<br />
<a href="http://www.devx.com/Java/Article/48063" target="_blank">http://www.devx.com/Java/Article/48063</a>
<br />
<br />
<br />
Pig实战<br />
<a href="http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html" target="_blank">http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html</a><br />
<br />
<br />
[原创]Apache Pig中文教程（进阶）<br />
<a href="http://www.codelast.com/?p=4249" target="_blank">http://www.codelast.com/?p=4249</a><br />
<br />
<br />
基于hadoop平台的pig语言对apache日志系统的分析<br />
<a href="http://goodluck-wgw.iteye.com/blog/1107503" target="_blank">http://goodluck-wgw.iteye.com/blog/1107503</a><br />
<br />
<br />
!!Pig语言<br />
<a href="http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318" target="_blank">http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318</a><br />
<br />
<br />
Embedding Pig In Java Programs<br />
<a href="http://wiki.apache.org/pig/EmbeddedPig" target="_blank">http://wiki.apache.org/pig/EmbeddedPig</a><br />
<br />
<br />
一个pig事例(REGEX_EXTRACT_ALL, DBStorage，结果存进数据库)<br />
<a href="http://www.myexception.cn/database/1256233.html" target="_blank">http://www.myexception.cn/database/1256233.html</a><br />
<br />
<br />
Programming Pig<br />
<a href="http://ofps.oreilly.com/titles/9781449302641/index.html" target="_blank">http://ofps.oreilly.com/titles/9781449302641/index.html</a><br />
<br />
<br />
[原创]Apache Pig的一些基础概念及用法总结（1）<br />
<a href="http://www.codelast.com/?p=3621" target="_blank">http://www.codelast.com/?p=3621<br />
<br /></a><br />
!PIG手册<br /><a href="http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions" target="_blank">http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions</a><img src ="http://www.blogjava.net/paulwong/aggbug/397406.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/paulwong/" target="_blank">paulwong</a> 2013-04-05 18:19 <a href="http://www.blogjava.net/paulwong/archive/2013/04/05/397406.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>