Install hadoop+hbase+nutch+elasticsearch

This document is for Anyela Chavarro.
Only these version of each framework work together

Hadoop 1.2.1

Hbase 0.90.4

Nutch 2.2.1

Elasticsearch 0.19.4

Linux version : Ubuntu 12.04.2 LTS

Hadoop cluster environment:

Name node/Job tracker
192.168.1.100 master

Data node/Task tracker
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3

Install Hadoop(pseudo-distributed mode)

add user hadoop
useradd -s /bin/bash -d /home/hadoop -m hadoop
set password
passwd hadoop
login as hadoop
su hadoop
add a data folder
mkdir data
uninstall openjdk on centos
[hadoop@netfox ~] rpm -qa | grep java
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
install JDK 1.6

apt-get update
apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java6-installer
get hadoop tar file

[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
untar tar file
[hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz
install ssh-server

apt-get install openssh-server
setup ssh key(ssh-keygen is the built in tool in linux)
[hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
make public key file
[hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
change public key file authoriate mode
[hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys
find the ip of local machine

[hadoop@netfox hadoop]$ ifconfig
the ip can be found in this string:
inet addr:192.168.1.100
add to hosts, this line should be at the first line.

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
add to /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-6-oracle

export HADOOP_HOME=/home/hadoop/hadoop-1.2.1

export HBASE_HOME=/home/hadoop/hbase-0.90.4

export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
source it
[hadoop@netfox hadoop]$ source /etc/profile
create folder
hadoop@netfox:~$ mkdir /home/hadoop/data
edit /home/hadoop/hadoop-1.2.1/conf/hdfs-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/mapred-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>master:9002</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/core-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/data</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:9001</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>
add to /home/hadoop/hadoop-1.2.1/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
add to /home/hadoop/hadoop-1.2.1/conf/slaves and masters
master
format hdoop namenode
[hadoop@netfox ~]$ hadoop namenode -format
start hadoop
[hadoop@netfox hadoop]$ start-all.sh
check if hdoop install correctly
[hadoop@netfox hadoop]$ hadoop dfs -ls /
for example, it will show the following output without error message.

Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2013-08-28 14:02 /chukwa
drwxr-xr-x   - hadoop supergroup          0 2013-08-29 09:53 /hbase
drwxr-xr-x   - hadoop supergroup          0 2013-08-27 10:36 /opt
drwxr-xr-x   - hadoop supergroup          0 2013-09-01 15:22 /tmp

Install Hadoop(fully-distributed mode)
repeat step1-23 on slave1-3, but some steps will be different:

changet step 9 as below:
don't make the public key, just transfer the public key from master to each slave.

[hadoop@netfox hadoop]$ scp ~/.ssh/id_dsa.pub hadoop@slave1:/home/hadoop
change step 12 as below:
add to host

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3
step 20, add to /home/hadoop/hadoop-1.2.1/conf/masters

master

add to /home/hadoop/hadoop-1.2.1/conf/slaves

slave1
slave2
slave3
step 22, start hadoop only on master

[hadoop@netfox hadoop]$ start-all.sh

Install Hbase

get hbase tar file
[hadoop@netfox ~]$ wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz
untar the file
[hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz
change /home/hadoop/hbase-0.90.4/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/hbase-0.90.4/conf/regionservers as below
master
add JAVA_HOME to /home/hadoop/hbase-0.90.4/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
replace with the new hadoop jar
[hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib
start hbse
[hadoop@netfox ~]$ start-hbase.sh
check if hbase install correctly
[hadoop@netfox ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

hbase(main):001:0> list
TABLE webpage
1 row(s) in 0.5270 seconds

Install Nutch

install ant
[root@netfox ~]# apt-get install ant
switch user and folder
[root@netfox ~]# su hadoop
[hadoop@netfox root]$ cd ~
get nutch tar file
[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
untar this file
[hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz
add to /etc/profile

export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
export PATH=$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/nutch-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>http.agent.name</name>
        <value>NutchCrawler</value>
    </property>
    <property>
        <name>http.robots.agents</name>
        <value>NutchCrawler,*</value>
    </property>

</configuration>
Uncomment the following in the /home/hadoop/webcrawer/apache-nutch-2.2.1/ivy/ivy.xml file
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"
conf="*->default" />
add to /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/gora.properties file
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
go to nutch installation folder(/home/hadoop/webcrawer/apache-nutch-2.2.1) and run
ant clean
ant runtime
Create a directory in HDFS to upload the seed urls.
[hadoop@netfox ~]$ hadoop dfs -mkdir urls
Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step
[hadoop@netfox ~]$ hadoop dfs -put seed.txt urls
Issue the following command from inside the copied deploy directory in the
JobTracker node to inject the seed URLs to the Nutch database and to generate the
initial fetch list(-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE )

[hadoop@netfox ~]$ nutch inject urls
[hadoop@netfox ~]$ nutch generate -topN 3
Issue the following commands from inside the copied deploy directory in the
JobTracker node
[hadoop@netfox ~]$ nutch fetch -all
[hadoop@netfox ~]$ nutch parse -all
[hadoop@netfox ~]$ nutch updatedb
[hadoop@netfox ~]$ nutch generate -topN 10

Install ElasticSearch

get the tar file
[hadoop@netfox ~]$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.19.4.tar.gz
untar file
[hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz
add to /etc/profile

export ELAST_HOME=/home/hadoop/webcrawer/elasticsearch-0.19.4

export PATH=$ELAST_HOME/bin:$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
Go to the extracted ElasticSearch directory and execute the following command to
start the ElasticSearch server in the foreground
> bin/elasticsearch -f
Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local
in case you are running Nutch in the local mode) directory. Execute the following
command to index the data crawled by Nutch in to the ElasticSearch server.
> bin/nutch elasticindex elasticsearch -all
install curl
[hadoop@netfox ~]$ sudo apt-get install curl
check if elasticsearch installation correct
[hadoop@netfox ~]$ curl master:9200
check query
[hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop'

posted on 2013-08-31 01:17 paulwong 阅读(6310) 评论(3) 编辑收藏所属分类: 分布式、HADOOP 、云计算、分布式搜索

Feedback

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-23 14:19 ap

nutch2.2.1默认支持hbase0.90.4 和 elasticsearch0.19.4 ，能否将其支持elasticsearch0.90.x以上版本呢（尝试使用elasticsearch0.90.x.jar包替换nutch2.2.1 lib目录下elasticsearch0.19.4.jar,但nutch elasticindex时报错）？
Nutch1.7 默认是支持elasticsearch0.90.1的。
回复更多评论

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-24 18:27 paulwong

@ap
我试过换0.90以上的版本不行的。
Nutch1.7 不整合HBASE，就不试了回复更多评论

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-25 15:34 ap

@paulwong
我在Nutch1.7 的lib目录下确实是没找到HBASE的jar包，要是整合就好了。谢谢。
回复更多评论

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问
相关文章: !!!架构网站内容不错 SPRING CACHE资源使用WILDFLY中的分布式缓存INFISHPAN SPRING-SESSION 分布式调度QUARTZ+SPRING 樂視 TV 載入 4K 片點解咁快？CDN 網絡解構 Java并行处理框架 JPPF 腾讯CKV海量分布式存储系统【转载】经典漫画讲解HDFS原理一些数据切分、缓存、rpc框架、nosql方案资料

paulwong

My Links

Blog Stats

常用链接

留言簿(67)

随笔分类(1393)

随笔档案(1151)

文章分类(7)

文章档案(10)

相册

收藏夹(2)

AI

Develop

E-BOOK

Other

养生

微服务

搜索

最新评论

阅读排行榜

评论排行榜

60天内阅读排行