While messing around with MapReduce code, I’ve found it to be a bit tedious having to generate the jarfile, copy it to the machine running the JobTracker, and then run the job every time the job has been altered. I should be able to run my jobs directly from my development environment, as illustrated in the figure below. This post explains how I’ve “solved” this problem. This may also help when integrating Hadoop with other applications. I do by no means claim that this is the proper way to do it, but it does the trick for me.

My Hadoop infrastructure
I assume that you have a (single-node) Hadoop 1.0.3 cluster properly installed on a dedicated or virtual machine. In this example, the JobTracker and HDFS resides on IP address 192.168.102.131.Let’s start out with a simple job that does nothing except to start up and terminate:
package com.pcbje.hadoopjobs;

import java.io.IOException;
import java.util.Date;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapred.Reducer;


public class MyFirstJob
{

public static void main(String[] args) throws Exception
{
Configuration config = new Configuration();

JobConf job = new JobConf(config);
job.setJarByClass(MyFirstJob.class);
job.setJobName("My first job");

FileInputFormat.setInputPaths(job, new Path(args[0));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MyFirstJob.MyFirstMapper.class);
job.setReducerClass(MyFirstJob.MyFirstReducer.class);

JobClient.runJob(job);
}


private static class MyFirstMapper extends MapReduceBase implements Mapper
{

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException
{

}
}


private static class MyFirstReducer extends MapReduceBase implements Reducer
{

public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException
{

}
}
}

Now, most of the examples you find online typically shows you a local mode setup where all the components of Hadoop (HDFS, JobTracker, etc) run on the same machine. A typical mapred-site.xml configuration might look like:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>As far as I can tell, such a configuration requires that jobs are submitted from the same node as the JobTracker. This is what I want to avoid. The first thing to do is to change the fs.default.name attribute to the IP address of my NameNode.
Configuration conf = new Configuration();
conf.set("fs.default.name", "192.168.102.131:9000");And in core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>192.168.102.131:9000</value>
</property>
</configuration>This tells the job to connect to the HDFS residing on a different machine. Running the job with this configuration will read from and write to the remote HDFS correctly, but the JobTracker at 192.168.102.131:9001 will not notice it. This means that the admin panel at 192.168.102.131:50030 wont list the job either. So the next thing to do is to tell the job configuration to submit the job to the appropriate JobTracker like this:
config.set("mapred.job.tracker", "192.168.102.131:9001");You also need to change mapred-site.xml to allow external connections, this can be done by replacing “localhost” with the JobTracker’s IP address:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.102.131:9001</value>
</property>
</configuration>Restart hadoop.Upon trying to run your job, you may get an exception like this:
SEVERE: PriviledgedActionException as:[user] cause:org.apache.hadoop.security.AccessControlException:
org.apache.hadoop.security.AccessControlException: Permission denied: user=[user], access=WRITE, inode="mapred":root:supergroup:rwxr-xr-x
If you do, this may be solved by adding the following mapred-site.xml:
<configuration>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
</configuration>And then execute the following commands:
stop-mapred.sh
start-mapred.sh
When you now submit your job, it should be picked up by the admin page over at :50030. However, it will most probably fail and the log will be telling you something like:
java.lang.ClassNotFoundException: com.pcbje.hadoopjobs.MyFirstJob$MyFirstMapper
In order to fix this, you have to ensure that all dependencies of the submitted job are available to the JobTracker. This can be achieved by exporting the project in as a runnable jar, and then execute something like:
java -jar myfirstjob-jar-with-dependencies.jar /input/path /output/path
If your user has the appropriate permissions to the input and out directory on HDFS, the job should now run successfully. This can be verified in the console and on the administration panel.
Manually exporting runnable jars requires a lot of clicks in IDEs such as Eclipse. If you are using Maven, you can tell it to build the jar with its dependencies (See
this answer for details). This would make the process a whole lot easier.Finally, to make it even easier, place a tiny bash-script in the same folder as pom.xml for building the maven project and executing the jar:
#!/bin/sh
mvn assembly:assembly
java -jar $1 $2 $3
After making the script executable, you can build and submit the job with the following command:
./build-and-run-job target/myfirstjob-jar-with-dependencies.jar /input/path
如果是在WINDOWS的ECLIPSE中,运行HBASE的MAPREDUCE,会出现异常,这是由于默认运行MAPREDUCE任务是在本地运行,而由于会建立文件赋权限是按照UNIX的方式进行,因此会报错:
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "ls": CreateProcess error=2, 解决办法是将任务发到运程主机,通常是LINUX上运行,在hbase-site.xml中加入:
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>同时需把HDFS的权限机制关掉:
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>另外由于是在远程上执行任务,自定义的类文件,如Maper/Reducer等需打包成jar文件上传,具体见方案:
Hadoop作业提交分析(五)
http://www.cnblogs.com/spork/archive/2010/04/21/1717592.html
研究了好几天,终于搞清楚,CONFIGUARATION就是JOB的配置信息,远程JOBTRACKER就是以此为参数构建JOB去执行,由于远程主机并没有自定义的MAPREDUCE类,需打成JAR包后,上传到主机处,但无需每次都手动传,可以代码设置:
conf.set("tmpjars", "d:/aaa.jar");另注意,如果在WINDOWS系统中,文件分隔号是“;”,生成的JAR包信息是以“;”间隔的,在远程主机的LINUX上是无法辨别,需改为:
System.setProperty("path.separator", ":");参考文章:
http://www.cnblogs.com/xia520pi/archive/2012/05/20/2510723.html使用hadoop eclipse plugin提交Job并添加多个第三方jar(完美版)
http://heipark.iteye.com/blog/1171923
1.在host中加入master 127.0.0.1
2.实现无需密码登录ssh
3.hadoop配置文件
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/paul/Documents/PAUL/DOWNLOAD/SOFTWARE/DEVELOP/HADOOP/hadoop-tmp-data</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<!--
<property>
<name>dfs.name.dir</name>
<value>/Users/paul/Documents/PAUL/DOWNLOAD/SOFTWARE/DEVELOP/HADOOP/hadoop-tmp-data/hdfs-data-name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/Users/paul/Documents/PAUL/DOWNLOAD/SOFTWARE/DEVELOP/HADOOP/hadoop-tmp-data/hdfs-data</value>
</property>
-->
</configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
</configuration>
master
4. 格式化namenode
5. 启动hadoop
6. hbase配置文件
hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
* Copyright 2010 The Apache Software Foundation
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/Users/paul/Documents/PAUL/DOWNLOAD/SOFTWARE/DEVELOP/HADOOP/hadoop-tmp-data
*/
-->
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value><!--单机配这个-->
</property>
</configuration>
7. 启动hbase
1、JVM的启动参数
我是这样设置的:
java -Xmx1024m -Xms1024m -Xss128k -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:MaxPermSize=16m
启动tomcat之后,使用 jmap -heap `pgrep -u root java`,得到如下信息:
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 1073741824 (1024.0MB)
NewSize = 1048576 (1.0MB)
MaxNewSize = 4294901760 (4095.9375MB)
OldSize = 4194304 (4.0MB)
NewRatio = 4
SurvivorRatio = 4
PermSize = 12582912 (12.0MB)
MaxPermSize = 16777216 (16.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 178913280 (170.625MB)
used = 51533904 (49.14656066894531MB)
free = 127379376 (121.47843933105469MB)
28.80384508070055% used
Eden Space:
capacity = 143130624 (136.5MB)
used = 51533904 (49.14656066894531MB)
free = 91596720 (87.35343933105469MB)
36.00480635087569% used
From Space:
capacity = 35782656 (34.125MB)
used = 0 (0.0MB)
free = 35782656 (34.125MB)
0.0% used
To Space:
capacity = 35782656 (34.125MB)
used = 0 (0.0MB)
free = 35782656 (34.125MB)
0.0% used
tenured generation:
capacity = 859045888 (819.25MB)
used = 1952984 (1.8625106811523438MB)
free = 857092904 (817.3874893188477MB)
0.22734338494383202% used
Perm Generation:
capacity = 12582912 (12.0MB)
used = 6656024 (6.347679138183594MB)
free = 5926888 (5.652320861816406MB)
52.897326151529946% used
------------------------------------------华丽的分割线---------------------------------------
-Xmx1024m -Xms1024m -Xss128k -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:MaxPermSize=16m
-Xmx1024m 最大堆内存为 1024M
-Xms1024m 初始堆内存为 1024M
-XX:NewRatio=4
则 年轻代:年老代=1:4 1024M/5=204.8M
故 年轻代=204.8M 年老代=819.2M
-XX:SurvivorRatio=4
则年轻代中 2Survivor:1Eden=2:4 204.8M/6=34.13333333333333M
故 Eden=136.5333333333333M 1Suivivor=34.13333333333333M
用 jmap -heap <pid>
查看的结果 与我们计算的结果一致
-----------------------------------华丽的分割线-------------------------------------------
3、编写测试页面
在网站根目录里新建页面perf.jsp,内容如下:
<%intsize = (int)(1024 * 1024 * m);byte[] buffer = new byte[size];Thread.sleep(s);%>
注:m值用来设置每次申请内存的大小,s 表示睡眠多少ms
4、使用jstat来监控内存变化
这里使用 jstat -gcutil `pgrep -u root java` 1500 10
再解释一下,这里有三个参数:
·pgrep -u root java --> 得到java的进程ID号
·1500 --> 表示每隔1500ms取一次数据
·10 --> 表示一共取10次数据
5、用ab来进行压测
压测的命令:[root@CentOS ~]# ab -c150 -n50000 "http://localhost/perf.jsp?m=1&s=10"
注:这里使用150个线程并发访问,一共访问50000次。
默认情况下你可以使用 http://localhost:8080/perf.jsp?m=1&s=10 来访问。
--------------------------------------------华丽的分割线----------------------------------------
下面开始进行实验:
·先启动Java内存的监听:
[root@CentOS ~]# jstat -gcutil 8570 1500 10
·在开启一个终端,开始压测:
[root@CentOS ~]# ab -c150 -n50000 "http://localhost/perf.jsp?m=1&s=10"
两个命令结束之后的结果如下:
jstat:
[root@CentOS ~]# jstat -gcutil 8570 1500 10
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.06 0.00 53.15 2.03 67.18 52 0.830 1 0.218 1.048
0.00 0.04 18.46 2.03 67.18 55 0.833 1 0.218 1.052
0.03 0.00 28.94 2.03 67.18 56 0.835 1 0.218 1.053
0.00 0.04 34.02 2.03 67.18 57 0.836 1 0.218 1.054
0.04 0.00 34.13 2.03 67.18 58 0.837 1 0.218 1.055
0.00 0.04 38.62 2.03 67.18 59 0.838 1 0.218 1.056
0.04 0.00 8.39 2.03 67.18 60 0.839 1 0.218 1.058
0.04 0.00 8.39 2.03 67.18 60 0.839 1 0.218 1.058
0.04 0.00 8.39 2.03 67.18 60 0.839 1 0.218 1.058
0.04 0.00 8.39 2.03 67.18 60 0.839 1 0.218 1.058
结果简单解析:
可以看到JVM里S0和S1始终有一个是空的,Eden区达到一定比例之后就会产生Minor GC,由于我这里的Old Generation 区设置的比较大,所以没有产生Full GC。
ab
[root@CentOS ~]# ab -c150 -n50000 "http://localhost/perf.jsp?m=1&s=10"
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
Completed 20000 requests
Completed 25000 requests
Completed 30000 requests
Completed 35000 requests
Completed 40000 requests
Completed 45000 requests
Finished 50000 requests
Server Software: Apache/2.2.3
Server Hostname: localhost
Server Port: 80
Document Path: /perf.jsp?m=1&s=10
Document Length: 979 bytes
Concurrency Level: 150
Time taken for tests: 13.467648 seconds
Complete requests: 50000
Failed requests: 0
Write errors: 0
Non-2xx responses: 50005
Total transferred: 57605760 bytes
HTML transferred: 48954895 bytes
Requests per second: 3712.60 [#/sec] (mean)
Time per request: 40.403 [ms] (mean) #平均请求时间
Time per request: 0.269 [ms] (mean, across all concurrent requests)
Transfer rate: 4177.05 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 46.5 0 3701
Processing: 10 38 70.3 36 6885
Waiting: 3 35 70.3 33 6883
Total: 10 39 84.4 37 6901
Percentage of the requests served within a certain time (ms)
50% 37
66% 38
75% 39
80% 39
90% 41
95% 43
98% 50
99% 58
100% 6901 (longest request)
step1:安装JDK
1.1 sudo sh jdk-6u10-linux-i586.bin
1.2 sudo gedit /etc/environment
export JAVA_HOME=/home/linkin/Java/jdk1.6.0_23
export JRE_Home=/home/linkin/Java/jdk1.6.0_23/jre
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
1.3 sudo gedit /etc/profile
在umask 022之前添加以下语句:
export JAVA_HOME=/home/linkin/Java/jdk1.6.0_23
export JRE_HOME=/home/linkin/Java/jdk1.6.0_23/jre
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOME/bin
更改时区:
cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
安装NTP:
yum install ntp
安装后执行
ntpdate cn.pool.ntp.org
即可同步国际时间..
开机后自动同步时间:
vi /etc/rc.d/rc.local中,最下面添加
ntpdate cn.pool.ntp.org
关闭IPV6
在/etc/sysctl.conf结尾添加
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
重启服务器
删除IPV6的DNS服务器
step2:SSH免密码登陆
2.1 首先在master主机上,linkin
@master :~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
2.2 linkin
@master :~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 将id_dsa.pub写入authorized_keys
2.3 linkin
@master :~/.ssh$ scp id_dsa.pub linkin@192.168.149.2:/home/linkin
2.4 登陆到linkin主机 $cat id_dsa.pub >> .ssh/authorized_keys
authorized_keys的权限要是600。chmod 600 .ssh/authorized_keys 2.5 在Datenode上执行同样的操作就能实现彼此无密码登陆
step3:安装hadoop
3.1 设置hadoop-env.sh
export JAVA_HOME=/home/linkin/jdk1.6.0_10
3.2 配置core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/linkin/hadoop-0.20.2/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>//要写主机名
</property> 3.3 配置hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property> 3.4 配置mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>//要写主机名
</property>

3.5 配置master和slaves
master:master(主机名)slaves:linkin(主机名)这2个配置文件可以不拷贝到其它机器上,只在master上保存即可。
3.6 配置hosts文件
127.0.0.1 localhost (注意这里不能放其他的如机器名,否则会使hbase的master名称变成localhost) 192.168.149.7 master
192.168.149.2 linkin
3.7 配置profile,在末尾追加以下内容,并输入source/etc/profile使之生效
export JAVA_HOME=/home/linkin/jdk1.6.0_10
export JRE_HOME=/home/linkin/jdk1.6.0_10/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$PATH
HADOOP设置
export HADOOP_HOME=/home/linkin/hadoop-0.20.2
export PATH=$HADOOP_HOME/bin:$PATH
//export PATH=$PATH:$HIVE_HOME/bin
3.8 将hadoop-0.20.2拷贝到其它主机对应的目录下。将/ect/profile和/etc/hosts也拷贝到其它机器上。profile需要做生效操作。
step4 格式化HDFS
bin/hadoop namenode -format
bin/hadoop dfs -ls
step5 启动hadoop
bin/start-all.sh
查看HDFS http://192.168.149.7:50070
查看JOB状态 http://192.168.149.7:50030/jobtracker.jsp
参考资源:
http://wiki.ubuntu.org.cn/%E5%88%A9%E7%94%A8Cloudera%E5%AE%9E%E7%8E%B0Hadoop