﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-SIMONE-随笔分类-spark</title><link>http://www.blogjava.net/wangxinsh55/category/55043.html</link><description /><language>zh-cn</language><lastBuildDate>Thu, 26 May 2016 06:17:05 GMT</lastBuildDate><pubDate>Thu, 26 May 2016 06:17:05 GMT</pubDate><ttl>60</ttl><item><title>Spark History Server配置使用</title><link>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430665.html</link><dc:creator>SIMONE</dc:creator><author>SIMONE</author><pubDate>Thu, 26 May 2016 06:12:00 GMT</pubDate><guid>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430665.html</guid><wfw:comment>http://www.blogjava.net/wangxinsh55/comments/430665.html</wfw:comment><comments>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430665.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/wangxinsh55/comments/commentRss/430665.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/wangxinsh55/services/trackbacks/430665.html</trackback:ping><description><![CDATA[<div>http://www.cnblogs.com/luogankun/p/3981645.html</div><br /><div><div id="cnblogs_post_body"><p><strong><span style="font-size: 14pt;">Spark history Server产生背景</span></strong></p> <p>以standalone运行模式为例，在运行Spark  Application的时候，Spark会提供一个WEBUI列出应用程序的运行时信息；但该WEBUI随着Application的完成(成功/失 败)而关闭，也就是说，Spark Application运行完(成功/失败)后，将无法查看Application的历史记录；</p> <p>Spark history  Server就是为了应对这种情况而产生的，通过配置可以在Application执行的过程中记录下了日志事件信息，那么在Application执行 结束后，WEBUI就能重新渲染生成UI界面展现出该Application在执行过程中的运行时信息；</p> <p>Spark运行在yarn或者mesos之上，通过spark的history server仍然可以重构出一个已经完成的Application的运行时参数信息（假如Application运行的事件日志信息已经记录下来）；</p> <p>&nbsp;</p> <p><strong><span style="font-size: 14pt;">配置&amp;使用Spark History Server</span></strong></p> <p>以默认配置的方式启动spark history server：</p> <div> <pre>cd $SPARK_HOME/<span style="color: #000000;">sbin <span style="color: #ff0000;"><strong>start</strong></span></span><span style="color: #ff0000;"><strong>-history-server.sh</strong></span></pre> </div> <p>报错：</p> <div> <pre>starting org.apache.spark.deploy.history.HistoryServer, logging to /home/spark/software/source/compile/deploy_spark/sbin/../logs/spark-spark-org.apache.spark.deploy.history.HistoryServer-<span style="color: #800080;">1</span>-<span style="color: #000000;">hadoop000.out failed to launch org.apache.spark.deploy.history.HistoryServer:         at org.apache.spark.deploy.history.FsHistoryProvider.</span>&lt;init&gt;(FsHistoryProvider.scala:<span style="color: #800080;">44</span><span style="color: #000000;">)         ... </span><span style="color: #800080;">6</span> <span style="color: #0000ff;">more</span></pre> </div> <p>需要在启动时指定目录：</p> <div> <pre><span style="color: #ff0000;"><strong>start-history-server.sh</strong></span> <span style="color: #0000ff;"><strong>hdfs://hadoop000:8020/directory</strong></span></pre> </div> <p>hdfs://hadoop000:8020/directory可以配置在配置文件中，那么在启动history-server时就不需要指定，后续介绍怎么配置；</p> <p><span style="color: #ff0000;"><strong>注：该目录需要事先在hdfs上创建好，否则history-server启动报错。</strong></span></p> <p>启动完成之后可以通过WEBUI访问，默认端口是18080：http://hadoop000:18080</p> <p>默认界面列表信息是空的，下面截图是我跑了几次spark-sql测试后出现的。</p> <p><img src="http://images.cnitblog.com/blog/635295/201409/191450158003642.png" alt="" height="337" width="962" /></p> <p>&nbsp;</p> <p><strong>history server相关的配置参数描述</strong></p> <p>1） spark.history.updateInterval<br />	　　默认值：10<br />	　　以秒为单位，更新日志相关信息的时间间隔</p> <p>2）spark.history.retainedApplications<br />	　　默认值：50<br />	　　在内存中保存Application历史记录的个数，如果超过这个值，旧的应用程序信息将被删除，当再次访问已被删除的应用信息时需要重新构建页面。</p> <p>3）<span style="color: #ff0000;"><strong>spark.history.ui.port</strong></span><br />	　　默认值：18080<br />	　　HistoryServer的web端口</p> <p>4）spark.history.kerberos.enabled<br />　　默认值：false<br />	　　是否使用kerberos方式登录访问HistoryServer，对于持久层位于安全集群的HDFS上是有用的，如果设置为true，就要配置下面的两个属性</p> <p>5）spark.history.kerberos.principal<br />　　默认值：用于HistoryServer的kerberos主体名称</p> <p>6）spark.history.kerberos.keytab<br />	　　用于HistoryServer的kerberos keytab文件位置</p> <p>7）spark.history.ui.acls.enable<br />	　　默认值：false<br />	　　授权用户查看应用程序信息的时候是否检查acl。如果启用，只有应用程序所有者和spark.ui.view.acls指定的用户可以查看应用程序信息;否则，不做任何检查</p> <p>8）<span style="color: #ff0000;"><strong>spark.eventLog.enabled	</strong></span><br />	　　默认值：false	<br />	　　是否记录Spark事件，用于应用程序在完成后重构webUI</p> <p>9）<span style="color: #ff0000;"><strong>spark.eventLog.dir</strong></span><br />	　　默认值：file:///tmp/spark-events<br />	　　保存日志相关信息的路径，可以是hdfs://开头的HDFS路径，也可以是file://开头的本地路径，都需要提前创建</p> <p>10）<span style="color: #ff0000;"><strong>spark.eventLog.compress	</strong></span><br />	　　默认值：false	<br />	　　是否压缩记录Spark事件，前提spark.eventLog.enabled为true，默认使用的是snappy</p> <p><strong>以spark.history开头的需要配置在spark-env.sh中的SPARK_HISTORY_OPTS，以spark.eventLog开头的配置在spark-defaults.conf</strong></p> <p>&nbsp;</p> <p>我在测试过程中的配置如下：</p> <p>spark-defaults.conf</p> <div> <pre>spark.eventLog.enabled  <span style="color: #0000ff;">true</span><span style="color: #000000;"> spark.eventLog.dir      hdfs:</span><span style="color: #008000;">//</span><span style="color: #008000;">hadoop000:8020/directory</span> spark.eventLog.compress <span style="color: #0000ff;">true</span></pre> </div> <p>spark-env.sh</p> <div> <pre>export SPARK_HISTORY_OPTS="-D<strong><span style="color: #ff0000;">spark.history.ui.port</span></strong>=7777 -D<span style="color: #ff0000;"><strong>spark.history.retainedApplications</strong></span>=3 -D<span style="color: #ff0000;"><strong>spark.history.fs.logDirectory</strong></span>=hdfs://had oop000:8020/directory"</pre> </div> <p>参数描述：</p> <p>spark.history.ui.port=7777 &nbsp;调整WEBUI访问的端口号为7777</p> <p>spark.history.fs.logDirectory=hdfs://hadoop000:8020/directory &nbsp;配置了该属性后，在start-history-server.sh时就无需再显示的指定路径</p> <p>spark.history.retainedApplications=3 &nbsp; 指定保存Application历史记录的个数，如果超过这个值，旧的应用程序信息将被删除</p> <p>&nbsp;</p> <p>调整参数后启动start-history-server.sh</p> <div> <pre>start-history-server.<span style="color: #0000ff;">sh</span> </pre> </div> <p>访问WEBUI： http://hadoop000:7777</p> <p><img src="http://images.cnitblog.com/blog/635295/201409/191457178315444.png" alt="" height="215" width="843" /></p> <p>&nbsp;</p> <p>在使用spark history server的过程中产生的几个疑问：</p> <p><strong>疑问1：spark.history.fs.logDirectory和spark.eventLog.dir指定目录有啥区别？</strong></p> <p>经测试后发现：</p> <p>spark.eventLog.dir：Application在运行过程中所有的信息均记录在该属性指定的路径下；</p> <p>spark.history.fs.logDirectory：Spark History Server页面只展示该指定路径下的信息；</p> <p>比如：spark.eventLog.dir刚开始时指定的是hdfs://hadoop000:8020/directory，而后修改成hdfs://hadoop000:8020/directory2</p> <p>那么spark.history.fs.logDirectory如果指定的是hdfs://hadoop000:8020/directory，就只能显示出该目录下的所有Application运行的日志信息；反之亦然。</p> <p>&nbsp;</p> <p><strong>疑问2：spark.history.retainedApplications=3 貌似没生效？？？？？？</strong></p> <p>The History Server will list all applications. It will just retain a  max number of them in memory. That option does not control how many  applications are show, it controls how much memory the HS will need.</p> <p>注意：该参数并不是也页面中显示的application的记录数，而是存放在内存中的个数，内存中的信息在访问页面时直接读取渲染既可；</p> <p>比如说该参数配置了10个，那么内存中就最多只能存放10个applicaiton的日志信息，当第11个加入时，第一个就会被踢除，当再次访问第1个application的页面信息时就<span style="font-size: 14px; line-height: 1.5;">需要重新读取指定路径上的日志信息来渲染展示页面。&nbsp;</span></p> <p>详见官方文档：http://spark.apache.org/docs/latest/monitoring.html</p></div></div><img src ="http://www.blogjava.net/wangxinsh55/aggbug/430665.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/wangxinsh55/" target="_blank">SIMONE</a> 2016-05-26 14:12 <a href="http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430665.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Spark On Yarn中spark.yarn.jar属性的使用</title><link>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430664.html</link><dc:creator>SIMONE</dc:creator><author>SIMONE</author><pubDate>Thu, 26 May 2016 06:11:00 GMT</pubDate><guid>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430664.html</guid><wfw:comment>http://www.blogjava.net/wangxinsh55/comments/430664.html</wfw:comment><comments>http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430664.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/wangxinsh55/comments/commentRss/430664.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/wangxinsh55/services/trackbacks/430664.html</trackback:ping><description><![CDATA[<div>http://www.cnblogs.com/luogankun/p/4191796.html</div><br /><div><p>今天在测试spark-sql运行在yarn上的过程中，无意间从日志中发现了一个问题：</p> <div> <pre>spark-sql --master yarn</pre> </div> <div><div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div> <pre><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: Requesting a new application from cluster with <span style="color: #800080;">1</span><span style="color: #000000;"> NodeManagers </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: Verifying our application has not requested <span style="color: #0000ff;">more</span> than the maximum memory capability of the cluster (<span style="color: #800080;">8192</span><span style="color: #000000;"> MB per container) </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: Will allocate AM container, with <span style="color: #800080;">896</span> MB memory including <span style="color: #800080;">384</span><span style="color: #000000;"> MB overhead </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: Setting up container launch context <span style="color: #0000ff;">for</span><span style="color: #000000;"> our AM </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: Preparing resources <span style="color: #0000ff;">for</span><span style="color: #000000;"> our AM container </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">17</span> INFO Client: <span style="color: #ff0000;"><strong>Uploading resource file:/home/spark/software/source/compile/deploy_spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar -&gt; hdfs://hadoop000:8020/user/spark/.sparkStaging/<span style="color: #00ff00;">application_1416381870014_0093</span>/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar</strong></span> <span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span>:<span style="color: #800080;">18</span> INFO Client: Setting up the launch environment <span style="color: #0000ff;">for</span> our AM container</pre> <div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div></div> <p>再开启一个spark-sql命令行，从日志中再次发现：</p> <div><div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div> <pre><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: Requesting a new application from cluster with <span style="color: #800080;">1</span><span style="color: #000000;"> NodeManagers </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: Verifying our application has not requested <span style="color: #0000ff;">more</span> than the maximum memory capability of the cluster (<span style="color: #800080;">8192</span><span style="color: #000000;"> MB per container) </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: Will allocate AM container, with <span style="color: #800080;">896</span> MB memory including <span style="color: #800080;">384</span><span style="color: #000000;"> MB overhead </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: Setting up container launch context <span style="color: #0000ff;">for</span><span style="color: #000000;"> our AM </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: Preparing resources <span style="color: #0000ff;">for</span><span style="color: #000000;"> our AM container </span><span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">03</span> INFO Client: <span style="color: #ff0000;"><strong>Uploading resource file:/home/spark/software/source/compile/deploy_spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar -&gt; hdfs://hadoop000:8020/user/spark/.sparkStaging/<span style="color: #00ff00;">application_1416381870014_0094</span>/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar</strong></span> <span style="color: #800080;">14</span>/<span style="color: #800080;">12</span>/<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span>:<span style="color: #800080;">05</span> INFO Client: Setting up the launch environment <span style="color: #0000ff;">for</span> our AM container</pre> <div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div></div> <p>然后查看HDFS上的文件：</p> <div> <pre>hadoop fs -<span style="color: #0000ff;">ls</span> hdfs:<span style="color: #008000;">//</span><span style="color: #008000;">hadoop000:8020/user/spark/.sparkStaging/</span></pre> </div> <div> <pre>drwx------   - spark supergroup          <span style="color: #800080;">0</span> <span style="color: #800080;">2014</span>-<span style="color: #800080;">12</span>-<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">23</span> hdfs:<span style="color: #008000;">//</span><span style="color: #008000;">hadoop000:8020/user/spark/.sparkStaging/<span style="color: #00ff00;">application_1416381870014_0093</span></span> drwx------   - spark supergroup          <span style="color: #800080;">0</span> <span style="color: #800080;">2014</span>-<span style="color: #800080;">12</span>-<span style="color: #800080;">29</span> <span style="color: #800080;">15</span>:<span style="color: #800080;">24</span> hdfs:<span style="color: #008000;">//</span><span style="color: #008000;">hadoop000:8020/user/spark/.sparkStaging/<span style="color: #00ff00;">application_1416381870014_0094</span></span></pre> </div> <p>每个Application都会上传一个spark-assembly-x.x.x-SNAPSHOT-hadoopx.x.x-cdhx.x.x.jar的jar包，影响HDFS的性能以及占用HDFS的空间。</p> <p>&nbsp;</p> <p>在Spark文档(http://spark.apache.org/docs/latest/running-on-yarn.html)中发现<span style="color: #ff0000;"><strong>spark.yarn.jar</strong></span>属性，将spark-assembly-xxxxx.jar存放在hdfs://hadoop000:8020/spark_lib/下</p> <p>在spark-defaults.conf添加属性配置：</p> <div> <pre><strong><span style="color: #ff0000;">spark.yarn.jar</span></strong> hdfs://hadoop000:8020/spark_lib/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar</pre> </div> <p>再次启动spark-sql --master yarn观察日志：</p> <div><div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div> <pre><span style="color: #000000;">14/12/29 15:39:02 INFO Client: Requesting a new application from cluster with 1 NodeManagers 14/12/29 15:39:02 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 14/12/29 15:39:02 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 14/12/29 15:39:02 INFO Client: Setting up container launch context for our AM 14/12/29 15:39:02 INFO Client: Preparing resources for our AM container 14/12/29 15:39:02 INFO Client: <strong><span style="color: #ff0000;">Source and destination file systems are the same. Not copying hdfs://hadoop000:8020/spark_lib/spark-assembly-1.3.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar</span></strong> 14/12/29 15:39:02 INFO Client: Setting up the launch environment for our AM container</span></pre> <div><a title="复制代码"><img src="http://common.cnblogs.com/images/copycode.gif" alt="复制代码" /></a></div></div> <p>观察HDFS上文件</p> <div> <pre>hadoop fs -ls hdfs://hadoop000:8020/user/spark/.sparkStaging/application_1416381870014_0097</pre> </div> <p>该Application对应的目录下没有spark-assembly-xxxxx.jar了，从而节省assembly包上传的过程以及HDFS空间占用。</p> <p>&nbsp;</p> <p>我在测试过程中遇到了类似如下的错误：</p> <div> <pre>Application application_xxxxxxxxx_yyyy failed 2 times due to AM Container for application_xxxxxxxxx_yyyy <br /><br />exited with exitCode: -1000 due to: java.io.FileNotFoundException: File /tmp/hadoop-spark/nm-local-dir/filecache does not exist</pre> </div> <p>在/tmp/hadoop-spark/nm-local-dir路径下创建filecache文件夹即可解决报错问题。</p></div><img src ="http://www.blogjava.net/wangxinsh55/aggbug/430664.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/wangxinsh55/" target="_blank">SIMONE</a> 2016-05-26 14:11 <a href="http://www.blogjava.net/wangxinsh55/archive/2016/05/26/430664.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>