运行MAPREDUCE JOB时,如果输入的文件比较小而多时,默认情况下会生成很多的MAP JOB,即一个文件一个MAP JOB,因此需要优化,使多个文件能合成一个MAP JOB的输入。
具体的原理是下述三步:
1.根据输入目录下的每个文件,如果其长度超过mapred.max.split.size,以block为单位分成多个split(一个split是一个map的输入),每个split的长度都大于mapred.max.split.size, 因为以block为单位, 因此也会大于blockSize, 此文件剩下的长度如果大于mapred.min.split.size.per.node, 则生成一个split, 否则先暂时保留.
2. 现在剩下的都是一些长度效短的碎片,把每个rack下碎片合并, 只要长度超过mapred.max.split.size就合并成一个split, 最后如果剩下的碎片比mapred.min.split.size.per.rack大, 就合并成一个split, 否则暂时保留.
3. 把不同rack下的碎片合并, 只要长度超过mapred.max.split.size就合并成一个split, 剩下的碎片无论长度, 合并成一个split.
举例: mapred.max.split.size=1000
mapred.min.split.size.per.node=300
mapred.min.split.size.per.rack=100
输入目录下五个文件,rack1下三个文件,长度为2050,1499,10, rack2下两个文件,长度为1010,80. 另外blockSize为500.
经过第一步, 生成五个split: 1000,1000,1000,499,1000. 剩下的碎片为rack1下:50,10; rack2下10:80
由于两个rack下的碎片和都不超过100, 所以经过第二步, split和碎片都没有变化.
第三步,合并四个碎片成一个split, 长度为150.
如果要减少map数量, 可以调大mapred.max.split.size, 否则调小即可.
其特点是: 一个块至多作为一个map的输入,一个文件可能有多个块,一个文件可能因为块多分给做为不同map的输入, 一个map可能处理多个块,可能处理多个文件。
注:CombineFileInputFormat是一个抽象类,需要编写一个继承类。
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
import org.apache.hadoop.mapred.lib.CombineFileSplit;
@SuppressWarnings("deprecation")
public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> {
@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {
return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class);
}
public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> {
private final LineRecordReader linerecord;
public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException {
FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations());
linerecord = new LineRecordReader(conf, filesplit);
}
@Override
public void close() throws IOException {
linerecord.close();
}
@Override
public LongWritable createKey() {
// TODO Auto-generated method stub
return linerecord.createKey();
}
@Override
public Text createValue() {
// TODO Auto-generated method stub
return linerecord.createValue();
}
@Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return linerecord.getPos();
}
@Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return linerecord.getProgress();
}
@Override
public boolean next(LongWritable key, Text value) throws IOException {
// TODO Auto-generated method stub
return linerecord.next(key, value);
}
}
}
在运行时这样设置:
if (argument !=
null) {
conf.set("mapred.max.split.size", argument);
}
else {
conf.set("mapred.max.split.size", "134217728");
// 128 MB
}
//
conf.setInputFormat(CombinedInputFormat.
class);
By Tzu-Cheng Chuang 1-28-2011
Requires: Ubuntu10.04, hadoop0.20.2, zookeeper 3.3.2 HBase0.90.0
1. Download Ubuntu 10.04 desktop 32 bit from Ubuntu website.
2. Install Ubuntu 10.04 with username: hadoop, password: password, disk size: 20GB, memory: 2048MB, 1 processor, 2 cores
3. Install build-essential (for GNU C, C++ compiler) $ sudo apt-get install build-essential
4. Install sun-jave-6-jdk
(1) Add the Canonical Partner Repository to your apt repositories
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
(2) Update the source list
$ sudo apt-get update
(3) Install sun-java-6-jdk and make sure Sun’s java is the default jvm
$ sudo apt-get install sun-java6-jdk
(4) Set environment variable by modifying ~/.bashrc file, put the following two lines in the end of the file
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export PATH=$PATH:$JAVA_HOME/bin
5. Configure SSH server so that ssh to localhost doesn’t need a passphrase
(1) Install openssh server
$ sudo apt-get install openssh-server
(2) Generate RSA pair key
$ ssh-keygen –t ras –P ""
(3) Enable SSH access to local machine
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
6. Disable IPv6 by modifying /etc/sysctl.conf file, put the following two lines in the end of the file
#disable
ipv6 net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
7. Install hadoop
(1) Download hadoop-0.20.2.tar.gz(stable release on 1/25/2011) from Apache hadoop website
(2) Extract hadoop archive file to /usr/local/
(3) Make symbolic link
(4) Modify /usr/local/hadoop/conf/hadoop-env.sh
Change from # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
(5)Create /usr/local/hadoop-datastore folder
$ sudo mkdir /usr/local/hadoop-datastore
$ sudo chown hadoop:hadoop /usr/local/hadoop-datastore
$ sudo chmod 750 /usr/local/hadoop-datastore
(6)Put the following code in /usr/local/hadoop/conf/core-site.xml
hadoop.tmp.dir/usr/local/hadoop/tmp/dir/hadoop-${user.name}A base for other temporary directories.fs.default.namehdfs://master:54310The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
(7) Put the following code in /usr/local/hadoop/conf/mapred-site.xml
mapred.job.trackermaster:54311The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
(8) Put the following code in /usr/local/hadoop/conf/hdfs-site.xml
dfs.replication1Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
(9) Add hadoop to environment variable by modifying ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH
8. Restart Ubuntu Linux
9. Copy this virtual machine to another folder. At least we have 2 copies of Ubuntu linux
10. Modify /etc/hosts on both Linux Virtual Image machines, add in the following lines in the file. The IP address depends on each machine. We can use (ifconfig) to find out IP address.
# /etc/hosts (for master AND slave) 192.168.0.1 master 192.168.0.2 slave Modify the following line, because it might cause Hbase to find out wrong ip.
192.168.0.1 ubuntu
11. Check hadoop user access on both machines.
The hadoop user on the master (aka hadoop@master) must be able to connect a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and b) to the hadoop user account on the slave (aka hadoop@slave) via a password-less SSH login. On both machines, make sure each one can connect to master, slave without typing passwords.
12. Cluster configuration
(1) Modify /usr/local/hadoop/conf/masters
only on master machine master
(2) Modify /usr/local/hadoop/conf/slaves
only on master machine master slave
(3) Change “localhost” to “master” in /usr/local/conf/hadoop/conf/core-site.xml and /usr/local/hadoop/conf/mapred-site.xml
only on master machine
(4) Change dfs.replication to “1” in /usr/local/conf/hadoop/conf/hdfs-site.xml
only on master machine
13. Format the namenode only once and only on master machine
$ /usr/local/hadoop/bin/hadoop namenode –format
14. Later on, start the multi-node cluster by typing following code only on master. So far, please don’t start hadoop yet.
$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh
15. Install zookeeper only on master node
(1) download zookeeper-3.3.2.tar.gz from Apache hadoop website
(2) Extract zookeeper-3.3.2.tar.gz $ tar –xzf zookeeper-3-3.2.tar.gz
(3) Move folder zookeeper-3.3.2 to /home/hadoop/ and create a symbloink link
$ mv zookeeper-3.3.2 /home/hadoop/ ; ln –s /home/hadoop/zookeeper-3.3.2 /home/hadoop/zookeeper
(4) copy conf/zoo_sample.cfg to conf/zoo.cfg
$ cp conf/zoo_sample.cfg confg/zoo.cfg
(5) Modify conf/zoo.cfg dataDir=/home/hadoop/zookeeper/snapshot
16. Install Hbase on both master and slave nodes, configure it as fully-distributed
(1) Download hbase-0.90.0.tar.gz from Apache hadoop website
(2) Extract hbase-0.90.0.tar.gz $ tar –xzf hbase-0.90.0.tar.gz
(3) Move folder hbase-0.90.0 to /home/hadoop/ and create a symbloink link $ mv hbase-0.90.0 /home/hadoop/ ; ln –s /home/hadoop/hbase-0.90.0 /home/hadoop/hbase
(4) Edit /home/hadoop/hbase/conf/hbase-site.xml, put the following in between and hbase.rootdirhdfs://master:54310/hbase The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR hbase.cluster.distributedtrueThe mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) hbase.zookeeper.quorummasterComma separated list of servers in the ZooKeeper Quorum. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on.
(5) modify environment variables in /home/hadoop/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
export HBASE_IDENT_STRING=$HOSTNAME
export HBASE_MANAGES_ZK=false
(6)Overwrite /home/hadoop/hbase/conf/regionservers
on both machines master slave
(7)copy /usr/local/hadoop-0.20.2/haoop-0.20.2-core.jar to /home/hadoop/hbase/lib/ on both machines.
This is very important to fix version difference issue. Pay attention to its ownership and mode(755).
17. Start zookeeper. It seems the zookeeper bundled with Hbase is not set up correctly.
$ /home/hadoop/zookeeper/bin/zkServer.sh start (Optional)We can test if zookeeper is running correctly by typing $ /home/hadoop/zookeeper/bin/zkCli.sh –server 127.0.0.1:2181
18. Start hadoop cluster
$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh
19. Start Hbase
$ /home/hadoop/hbase/bin/start-hbase.sh
20. Use Hbase shell
$ /home/hadoop/hbase/bin/hbase shell Check if hbase is running smoothly
Open your browser, and type in the following.
http://localhost:60010
21. Later on, stop the multi-node cluster by typing following code only on master
(1) Stop Hbase $ /home/hadoop/hbase/bin/stop-hbase.sh
(2) Stop hadoop file system (HDFS)
$ /usr/local/hadoop/bin/stop-mapred.sh
$ /usr/local/hadoop/bin/stop-dfs.sh
(3) Stop zookeeper
$ /home/hadoop/zookeeper/bin/zkServer.sh stop
Reference
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://wiki.apache.org/hadoop/Hbase/10Minutes
http://hbase.apache.org/book/quickstart.html
http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/
Author
Tzu-Cheng Chuang
Appendix- Install Pig and Hive
1. Install Pig 0.8.0 on this cluster
(1) Download pig-0.8.0.tar.gz from Apache pig project website. Then extract the file and move it to /home/hadoop/
$ tar –xzf pig-0.8.0.tar.gz ; mv pig-0.8.0 /home/hadoop/
(2) Make symbolink link under pig-0.8.0/conf/
$ ln -s /usr/local/hadoop/conf/core-site.xml /home/hadoop/pig-0.8.0/conf/core-site.xml
$ ln -s /usr/local/hadoop/conf/mapred-site.xml /home/hadoop/pig-0.8.0/conf/mapred-site.xml
$ ln -s /usr/local/hadoop/conf/hdfs-site.xml /home/hadoop/pig-0.8.0/conf/hdfs-site.xml
3) Start pig in map-reduce mode: $ /home/hadoop/pig-0.8.0/bin/pig
(4) Exit pig from grunt> quit
2. Install Hive on this cluster
(1) Download hive-0.6.0.tar.gz from Apache hive project website, and then extract the file and move it to /home/hadoop/ $ tar –xzf hive-0.6.0.tar.gz ; mv hive-0.6.0 ~/
(2) Modify java heap size in hive-0.6.0/bin/ext/execHiveCmd.sh Change 4096 to 1024
(3) Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before a table can be created in Hive $ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse
(4) start Hive $ /home/hadoop/hive-0.6.0/bin/hive
3. (Optional)Load data by using Hive
Create a file /home/hadoop/customer.txt 1, Kevin 2, David 3, Brian 4, Jane 5, Alice After hive shell is started, type in > CREATE TABLE IF NOT EXISTS customer(id INT, name STRING) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; >LOAD DATA INPATH '/home/hadoop/customer.txt' OVERWRITE INTO TABLE customer; >SELECT customer.id, customer.name from customer;
http://chuangtc.info/ParallelComputing/SetUpHadoopClusterOnVmwareWorkstation.htm