Skynet

---------- ---------- 我的新 blog : liukaiyi.cublog.cn ---------- ----------

:: 管理

112 Posts :: 1 Stories :: 49 Comments :: 0 Trackbacks

参考:
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html

注意
目前 streaming 对 linux pipe #也就是 cat |wc -l 这样的管道不支持，但不妨碍我们使用perl,python 行式命令！！
原话是：
Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
    Currently this does not work and gives an "java.io.IOException: Broken pipe" error.
    This is probably a bug that needs to be investigated.
但如果你是强烈的 linux shell pipe 发烧友！参考下面
$> perl -e 'open( my $fh, "grep -v null tt |sed -n 1,5p |");while ( <$fh> ) {print;} '
     #不过我没测试通过！！

环境：hadoop-0.18.3
$> find . -type f -name "*streaming*.jar"
./contrib/streaming/hadoop-0.18.3-streaming.jar

测试数据：

-bash-3.00$ head tt
null    false    3702    208100
6005100    false    70    13220
6005127    false    24    4640
6005160    false    25    4820
6005161    false    20    3620
6005164    false    14    1280
6005165    false    37    7080
6005168    false    104    20140
6005169    false    35    6680
6005240    false    169    32140
......

运行：

c1=" perl -ne  'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}' "
# 注意这里 $ 要写成 \$    " 写成 \"
echo $c1; # 打印输出 perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}'
hadoop jar hadoop-0.18.3-streaming.jar
   -input file:///data/hadoop/lky/jar/tt
   -mapper   "/bin/cat"
   -reducer "$c1"
   -output file:///tmp/lky/streamingx8

结果:
cat /tmp/lky/streamingx8/*
1166480

本地运行输出:
perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt
1166480

结果正确!!!!

命令自带文档：

-bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info
09/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <JavaClassName> Combiner has to be a Java class
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -dfs    <h:p>|local  Optional. Override DFS configuration
  -jt     <h:p>|local  Optional. Override JobTracker configuration
  -additionalconfspec specfile  Optional.
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks <num>  Optional.
  -inputreader <spec>  Optional.
  -jobconf  <n>=<v>    Optional. Add or override a JobConf property
  -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
  -mapdebug <path>  Optional. To run this script when a map task fails
  -reducedebug <path>  Optional. To run this script when a reduce task fails
  -cacheFile fileNameURI
  -cacheArchive fileNameURI
  -verbose

整理 www.blogjava.net/Good-Game

posted on 2009-09-25 14:33 刘凯毅阅读(3379) 评论(0) 编辑收藏所属分类: perl 、集群开发、数据挖掘

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: shell txt 分析小结 perl 使用小结 hadoop streaming( hadoop + perl )小试部分高级正则特性使用 perl vs php (转) perl 使用 perl 一些有用的 util perl 工具小脚本 soap (java,perl,要写代码还不过 100 char) Memcached 对话 Google ProtocolBuffers (perl)

Skynet

常用链接

留言簿(13)

我参与的团队

随笔分类

随笔档案

相册

搜索

最新评论

阅读排行榜

评论排行榜