paulwong

Install hadoop+hbase+nutch+elasticsearch

This document is for Anyela Chavarro.
Only these version of each framework work together
Hadoop 1.2.1
Hbase 
0.90.4
Nutch 
2.2.1
Elasticsearch 
0.19.4
Linux version : Ubuntu 12.04.2 LTS 

Hadoop cluster environment:
Name node/Job tracker
192.168.1.100 master

Data node/Task tracker
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3

Install Hadoop(pseudo-distributed mode)
  1. add user hadoop
    useradd  -s /bin/bash -d /home/hadoop -m hadoop
  2. set password
    passwd hadoop
  3. login as hadoop
    su hadoop
  4. add a data folder
    mkdir data
  5. uninstall openjdk on centos
    [hadoop@netfox ~] rpm -qa | grep java
    [hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
    [hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
    [hadoop@netfox ~] rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
    [hadoop@netfox ~] rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
  6. install JDK 1.6
    apt-get update
    apt-get install python-software-properties
    add-apt-repository ppa:webupd8team/java
    apt-get update
    apt-get install oracle-java6-installer
  7. get hadoop tar file
  8. untar tar file
    [hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz
  9. install ssh-server
    apt-get install openssh-server
  10. setup ssh key(ssh-keygen is the built in tool in linux)
    [hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  11. make public key file
    [hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  12. change public key file authoriate mode
    [hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys
  13. find the ip of local machine
    [hadoop@netfox hadoop]$ ifconfig
    the ip can be found in this string:
    inet addr:192.168.1.100
  14. add to hosts, this line should be at the first line.
    [hadoop@netfox hadoop]$ vi /etc/hosts
    192.168.1.100 master
  15. add to /etc/profile
    export JAVA_HOME=/usr/lib/jvm/java-6-oracle

    export HADOOP_HOME
    =/home/hadoop/hadoop-1.2.1

    export HBASE_HOME
    =/home/hadoop/hbase-0.90.4

    export PATH
    =$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
  16. source it
    [hadoop@netfox hadoop]$ source /etc/profile
  17. create folder
    hadoop@netfox:~$ mkdir /home/hadoop/data
  18. edit /home/hadoop/hadoop-1.2.1/conf/hdfs-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
      
    <name>dfs.replication</name>
      
    <value>1</value>
      
    <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      
    </description>
    </property>

    <property>
     
    <name>dfs.permissions</name>
     
    <value>false</value>
    </property>

    </configuration>
  19. edit /home/hadoop/hadoop-1.2.1/conf/mapred-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
      
    <name>mapred.job.tracker</name>
      
    <value>master:9002</value>
      
    <description>The host and port that the MapReduce job tracker runs
      at. If "local", then jobs are run in-process as a single map
      and reduce task.
      
    </description>
    </property>


    </configuration>
  20. edit /home/hadoop/hadoop-1.2.1/conf/core-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
      
    <name>hadoop.tmp.dir</name>
      
    <value>/home/hadoop/data</value>
      
    <description>A base for other temporary directories.</description>
    </property>
     
    <property>
      
    <name>fs.default.name</name>
      
    <value>hdfs://master:9001</value>
      
    <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.
    </description>
    </property>


    </configuration>
  21. add to /home/hadoop/hadoop-1.2.1/conf/hadoop-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-6-oracle
  22. add to /home/hadoop/hadoop-1.2.1/conf/slaves and masters
    master
  23. format hdoop namenode
    [hadoop@netfox ~]$ hadoop namenode -format
  24. start hadoop
    [hadoop@netfox hadoop]$ start-all.sh 
  25. check if hdoop install correctly
    [hadoop@netfox hadoop]$ hadoop dfs -ls / 
    for example, it will show the following output without error message.
    Found 4 items
    drwxr-xr-x   - hadoop supergroup          0 2013-08-28 14:02 /chukwa
    drwxr-xr-x   - hadoop supergroup          0 2013-08-29 09:53 /hbase
    drwxr-xr-x   - hadoop supergroup          0 2013-08-27 10:36 /opt
    drwxr-xr-x   - hadoop supergroup          0 2013-09-01 15:22 /tmp

Install Hadoop(fully-distributed mode)
repeat step1-23 on slave1-3, but some steps will be different:
  1. changet step 9 as below:
    don't make the public key, just transfer the public key from master to each slave.
    [hadoop@netfox hadoop]$ scp ~/.ssh/id_dsa.pub hadoop@slave1:/home/hadoop
  2. change step 12 as below:
    add to host
    [hadoop@netfox hadoop]$ vi /etc/hosts
    192.168.1.100 master
    192.168.1.101 slave1
    192.168.1.102 slave2
    192.168.1.103 slave3
  3. step 20, add to /home/hadoop/hadoop-1.2.1/conf/masters
    master
    add to /home/hadoop/hadoop-1.2.1/conf/slaves
    slave1
    slave2
    slave3
  4. step 22, start hadoop only on master
    [hadoop@netfox hadoop]$ start-all.sh 


Install Hbase
  1. get hbase tar file
  2. untar the file
    [hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz
  3. change /home/hadoop/hbase-0.90.4/conf/hbase-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
    /**
     * Copyright 2010 The Apache Software Foundation
     *
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    -->
    <configuration>

      
    <property>
        
    <name>hbase.rootdir</name>
        
    <value>hdfs://master:9001/hbase</value>
      
    </property>

      
    <property>
        
    <name>hbase.cluster.distributed</name>
        
    <value>true</value>
      
    </property>

      
    <property>
        
    <name>hbase.zookeeper.quorum</name>
        
    <value>localhost</value>
      
    </property>

    </configuration>
  4. change /home/hadoop/hbase-0.90.4/conf/regionservers as below
    master
  5. add JAVA_HOME to /home/hadoop/hbase-0.90.4/conf/hbase-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-6-oracle
  6. replace with the new hadoop jar
    [hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar
    [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib
    [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib
    [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib
  7. start hbse
    [hadoop@netfox ~]$ start-hbase.sh  
  8. check if hbase install correctly
    [hadoop@netfox ~]$ hbase shell
    HBase Shell
    ; enter 'help<RETURN>' for list of supported commands.
    Type "exit<RETURN>" to leave the HBase Shell
    Version 
    0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

    hbase(main):
    001:0> list
    TABLE                                          webpage                                         
    1 row(s) in 0.5270 seconds


Install Nutch
  1. install ant
    [root@netfox ~]# apt-get install ant
  2. switch user and folder
    [root@netfox ~]# su hadoop          
    [hadoop@netfox root]$ cd ~
  3. get nutch tar file
  4. untar this file
    [hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz
  5. add to /etc/profile
    export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
    export PATH=$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
  6. change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/hbase-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
    /**
     * Copyright 2009 The Apache Software Foundation
     *
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    -->
    <configuration>

      
    <property>
        
    <name>hbase.rootdir</name>
        
    <value>hdfs://master:9001/hbase</value>
      
    </property>

      
    <property>
        
    <name>hbase.cluster.distributed</name>
        
    <value>true</value>
      
    </property>

      
    <property>
        
    <name>hbase.zookeeper.quorum</name>
        
    <value>localhost</value>
      
    </property>

    </configuration>
  7. change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/nutch-site.xml as below
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

        
    <property>
            
    <name>storage.data.store.class</name>
            
    <value>org.apache.gora.hbase.store.HBaseStore</value>
            
    <description>Default class for storing data</description>
        
    </property>
        
    <property>
            
    <name>http.agent.name</name>
            
    <value>NutchCrawler</value>
        
    </property>
        
    <property>
            
    <name>http.robots.agents</name>
            
    <value>NutchCrawler,*</value>
        
    </property>

    </configuration>
  8. Uncomment the following in the /home/hadoop/webcrawer/apache-nutch-2.2.1/ivy/ivy.xml file   
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2"
    conf
    ="*->default" />
  9. add to /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/gora.properties file
    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  10. go to nutch installation folder(/home/hadoop/webcrawer/apache-nutch-2.2.1) and run
    ant clean
    ant runtime
  11. Create a directory in HDFS to upload the seed urls.
    [hadoop@netfox ~]$ hadoop dfs -mkdir urls
  12. Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step
    [hadoop@netfox ~]$ hadoop dfs -put seed.txt urls
  13. Issue the following command from inside the copied deploy directory in the
    JobTracker node to inject the seed URLs to the Nutch database and to generate the
    initial fetch list(-topN <N>  - number of top URLs to be selected, default is Long.MAX_VALUE )
    [hadoop@netfox ~]$ nutch inject urls
    [hadoop@netfox ~]$ nutch generate  -topN 3
  14. Issue the following commands from inside the copied deploy directory in the
    JobTracker node
    [hadoop@netfox ~]$ nutch fetch -all
    [hadoop@netfox ~]$ nutch parse -all
    [hadoop@netfox ~]$ nutch updatedb
    [hadoop@netfox ~]$ nutch generate -topN 10



Install ElasticSearch
  1. get the tar file
  2. untar file
    [hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz
  3. add to /etc/profile
    export ELAST_HOME=/home/hadoop/webcrawer/elasticsearch-0.19.4

    export PATH=$ELAST_HOME/bin:$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
  4. Go to the extracted ElasticSearch directory and execute the following command to
    start the ElasticSearch server in the foreground
    > bin/elasticsearch -f
  5. Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local
    in case you are running Nutch in the local mode) directory. Execute the following
    command to index the data crawled by Nutch in to the ElasticSearch server.  
    > bin/nutch elasticindex elasticsearch -all
  6. install curl 
    [hadoop@netfox ~]$ sudo apt-get install curl
  7. check if elasticsearch installation correct
    [hadoop@netfox ~]$ curl master:9200
  8. check query 
    [hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop'

posted on 2013-08-31 01:17 paulwong 阅读(6020) 评论(3)  编辑  收藏 所属分类: 分布式HADOOP云计算分布式搜索

Feedback

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-23 14:19 ap

nutch2.2.1默认支持hbase0.90.4 和 elasticsearch0.19.4 , 能否将其支持elasticsearch0.90.x以上版本呢(尝试使用elasticsearch0.90.x.jar包替换nutch2.2.1 lib目录下elasticsearch0.19.4.jar,但nutch elasticindex时报错)?
Nutch1.7 默认是支持elasticsearch0.90.1的。
  回复  更多评论   

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-24 18:27 paulwong

@ap
我试过换0.90以上的版本不行的。
Nutch1.7 不整合HBASE,就不试了  回复  更多评论   

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-25 15:34 ap

@paulwong
我在Nutch1.7 的lib目录下确实是没找到HBASE的jar包,要是整合就好了。谢谢。
  回复  更多评论   



只有注册用户登录后才能发表评论。


网站导航: