﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-Skynet-随笔分类-数据清洗</title><link>http://www.blogjava.net/Skynet/category/42815.html</link><description /><language>zh-cn</language><lastBuildDate>Thu, 26 Nov 2009 06:01:16 GMT</lastBuildDate><pubDate>Thu, 26 Nov 2009 06:01:16 GMT</pubDate><ttl>60</ttl><item><title>shell txt 分析小结</title><link>http://www.blogjava.net/Skynet/archive/2009/11/26/303750.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Thu, 26 Nov 2009 03:27:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/11/26/303750.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/303750.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/11/26/303750.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/303750.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/303750.html</trackback:ping><description><![CDATA[<br />
<br />
<br />
<br />

<div style="width:425px;text-align:left" id="__ss_2587122"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/liukaiyi/shell-2587122" title="Shell脚本">Shell脚本</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=shell-091125211831-phpapp02&stripped_title=shell-2587122" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=shell-091125211831-phpapp02&stripped_title=shell-2587122" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object><div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">presentations</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/liukaiyi">liukaiyi</a>.</div></div>
<img src ="http://www.blogjava.net/Skynet/aggbug/303750.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-11-26 11:27 <a href="http://www.blogjava.net/Skynet/archive/2009/11/26/303750.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>大文件切割，top倒序排列（方法比奔逸,大家可绕行）</title><link>http://www.blogjava.net/Skynet/archive/2009/11/23/303340.html</link><dc:creator>刘凯毅</dc:creator><author>刘凯毅</author><pubDate>Mon, 23 Nov 2009 06:43:00 GMT</pubDate><guid>http://www.blogjava.net/Skynet/archive/2009/11/23/303340.html</guid><wfw:comment>http://www.blogjava.net/Skynet/comments/303340.html</wfw:comment><comments>http://www.blogjava.net/Skynet/archive/2009/11/23/303340.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/Skynet/comments/commentRss/303340.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/Skynet/services/trackbacks/303340.html</trackback:ping><description><![CDATA[<br />
<br />
数据说明: <br />
knnuu_...txt 文件大小 3.2G 数据格式是&nbsp; <br />
user1 &nbsp; user2 &nbsp;&nbsp; score<br />
..<br />
usern&nbsp;&nbsp; userm&nbsp;&nbsp;&nbsp; score<br />
<br />
我这里希望通过清洗得到：<br />
与 user1 关系最近的 top 100 人<br />
<br />
由于数据并非需要百分之百准确，我放弃在分隔出的数据<br />
if len(dr)!=3&nbsp; : continue<br />
开了 7 个线程 也就是 会有 7 个 用户 的&nbsp; uid 对 top 100 uid 会出现问题 <br />
对应&nbsp; 总用户数几十万来说&nbsp; 呵呵 ! 我就用这 完善7个特殊人的列表时间写个 blog 吧<br />
<br />
<br />
并结合 linux split , awk 等 快速实现的 猥琐 多线程 哈哈!!<br />
怎么修改下&nbsp; 速度提升 5倍，原来的 一小时 到 10多分钟 。。。。。<br />
<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #008000;"><br />
</span>
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #008000;">#</span><span style="color: #008000;">&nbsp;split&nbsp;&nbsp;--bytes=500m&nbsp;&nbsp;knnuu_20091123.txt&nbsp;knnuu/</span><span style="color: #008000;"><br />
#</span><span style="color: #008000;">&nbsp;ls&nbsp;a*&nbsp;|&nbsp;awk&nbsp;'{system(&nbsp;"&nbsp;&nbsp;python&nbsp;uu.py&nbsp;"$0"&nbsp;&amp;&nbsp;"&nbsp;)}'</span><span style="color: #008000;"><br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;bsddb,sys<br />
db&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;bsddb.hashopen(</span><span style="color: #800000;">'</span><span style="color: #800000;">../id-item-y-09-10-11.db</span><span style="color: #800000;">'</span><span style="color: #000000;">,</span><span style="color: #800000;">'</span><span style="color: #800000;">c</span><span style="color: #800000;">'</span><span style="color: #000000;">)<br />
<br />
uid&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;"><br />
arr</span><span style="color: #000000;">=</span><span style="color: #000000;">[]<br />
arrsc</span><span style="color: #000000;">=</span><span style="color: #000000;">[]<br />
fw&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;open(</span><span style="color: #800000;">'</span><span style="color: #800000;">tc/</span><span style="color: #800000;">'</span><span style="color: #000000;">+</span><span style="color: #000000;">sys.argv[</span><span style="color: #000000;">1</span><span style="color: #000000;">]</span><span style="color: #000000;">+</span><span style="color: #800000;">'</span><span style="color: #800000;">uid-uid-sc.txt</span><span style="color: #800000;">'</span><span style="color: #000000;">,</span><span style="color: #800000;">'</span><span style="color: #800000;">w</span><span style="color: #800000;">'</span><span style="color: #000000;">)<br />
ii</span><span style="color: #000000;">=</span><span style="color: #000000;">0<br />
<br />
</span><span style="color: #0000ff;">def</span><span style="color: #000000;">&nbsp;insertion_sort(arr,arrsc,uid,sc):<br />
&nbsp;&nbsp;&nbsp;&nbsp;ls&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;min(</span><span style="color: #000000;">100</span><span style="color: #000000;">,len(arrsc))<br />
&nbsp;&nbsp;&nbsp; if ls!=0 and sc &lt; arrsc[ls-1] : return <br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;i&nbsp;</span><span style="color: #0000ff;">in</span><span style="color: #000000;">&nbsp;xrange(ls):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;arrsc[i]</span><span style="color: #000000;">&lt;=</span><span style="color: #000000;">sc&nbsp;&nbsp;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arrsc.insert(i,sc)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arr.insert(i,uid)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">return</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">elif</span><span style="color: #000000;">&nbsp;arrsc[i]&nbsp;</span><span style="color: #000000;">&gt;</span><span style="color: #000000;">&nbsp;sc&nbsp;:&nbsp;&nbsp;</span><span style="color: #0000ff;">continue</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;ls&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">99</span><span style="color: #000000;">&nbsp;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arr.append(uid)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arrsc.append(sc)<br />
<br />
</span><span style="color: #008000;">#</span><span style="color: #008000;">for&nbsp;row&nbsp;in&nbsp;open('knnuu_20091123.txt')&nbsp;:</span><span style="color: #008000;"><br />
</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;row&nbsp;</span><span style="color: #0000ff;">in</span><span style="color: #000000;">&nbsp;open(sys.argv[</span><span style="color: #000000;">1</span><span style="color: #000000;">]):<br />
&nbsp;&nbsp;&nbsp;&nbsp;dr&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;row.split(</span><span style="color: #800000;">'</span><span style="color: #800000;">\n</span><span style="color: #800000;">'</span><span style="color: #000000;">)[0].split(</span><span style="color: #800000;">'</span><span style="color: #800000;">\t</span><span style="color: #800000;">'</span><span style="color: #000000;">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;len(dr)</span><span style="color: #000000;">!=</span><span style="color: #000000;">3</span><span style="color: #000000;">&nbsp;:&nbsp;</span><span style="color: #0000ff;">continue</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;u1,u2,strsc&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;dr[0],dr[</span><span style="color: #000000;">1</span><span style="color: #000000;">],dr[</span><span style="color: #000000;">2</span><span style="color: #000000;">]<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;sc&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;float(strsc)<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;uid&nbsp;</span><span style="color: #000000;">==</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">-</span><span style="color: #000000;">1</span><span style="color: #000000;">&nbsp;:&nbsp;uid&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;u1<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;u1&nbsp;</span><span style="color: #000000;">!=</span><span style="color: #000000;">&nbsp;uid&nbsp;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;c&nbsp;</span><span style="color: #0000ff;">in</span><span style="color: #000000;">&nbsp;xrange(&nbsp;min(</span><span style="color: #000000;">100</span><span style="color: #000000;">,len(arrsc))&nbsp;):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tu&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;arr[c]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ts&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;arrsc[c]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">print</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">&gt;&gt;</span><span style="color: #000000;">fw,</span><span style="color: #800000;">"</span><span style="color: #800000;">%s\t%s\t%s</span><span style="color: #800000;">"</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">%</span><span style="color: #000000;">&nbsp;(&nbsp;db[u1],db[tu],ts&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">print</span><span style="color: #000000;">&nbsp;uid<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fw.flush()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arr</span><span style="color: #000000;">=</span><span style="color: #000000;">[u1]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;arrsc</span><span style="color: #000000;">=</span><span style="color: #000000;">[sc]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;uid</span><span style="color: #000000;">=</span><span style="color: #000000;">u1<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">else</span><span style="color: #000000;">&nbsp;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;insertion_sort(arr,arrsc,u2,sc)<br />
&nbsp;&nbsp;&nbsp;&nbsp;ii</span><span style="color: #000000;">+=</span><span style="color: #000000;">1</span><span style="color: #000000;"><br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">#</span><span style="color: #008000;">print&nbsp;ii,u1,uid,u2,strsc,len(arr),len(arrsc)</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">#</span><span style="color: #008000;">if&nbsp;ii&gt;10&nbsp;:&nbsp;break</span><span style="color: #008000;"><br />
</span><span style="color: #000000;"><br />
fw.close()</span></div>
<span style="color: #008000;">
</span><span style="color: #000000;">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
</span></div>
<br />
<br />
<img src ="http://www.blogjava.net/Skynet/aggbug/303340.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/Skynet/" target="_blank">刘凯毅</a> 2009-11-23 14:43 <a href="http://www.blogjava.net/Skynet/archive/2009/11/23/303340.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>