This document is for Anyela Chavarro.
Only these version of each framework work together
 Hadoop 1.2.1
Hadoop 1.2.1
 Hbase 0.90.4
Hbase 0.90.4
 Nutch 2.2.1
Nutch 2.2.1
 Elasticsearch 0.19.4
Elasticsearch 0.19.4
Ubuntu 12.04.2 LTS 
Hadoop cluster environment:
Name node/Job tracker
192.168.1.100 master
Data node/Task tracker
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3
Install Hadoop(pseudo-distributed mode)
     - add user hadoop
      useradd  -s /bin/bash -d /home/hadoop -m hadoop useradd  -s /bin/bash -d /home/hadoop -m hadoop
- set password
      passwd hadoop passwd hadoop
- login as hadoop
      su hadoop su hadoop
- add a data folder
      mkdir data mkdir data
- 
    uninstall openjdk on centos
 [hadoop@netfox ~] rpm -qa | grep java
 [hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
 [hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
 [hadoop@netfox ~] rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
 [hadoop@netfox ~] rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
 
- install JDK 1.6
 apt-get update
 apt-get install python-software-properties
 add-apt-repository ppa:webupd8team/java
 apt-get update
 apt-get install oracle-java6-installer
 
- get hadoop tar file
 
- untar tar file
      [hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz [hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz
- install ssh-server
 apt-get install openssh-server 
- setup ssh key(ssh-keygen is the built in tool in linux)
      [hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa [hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- make public key file
      [hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys [hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- change public key file authoriate mode
      [hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys [hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys
- find the ip of local machine
 [hadoop@netfox hadoop]$ ifconfig the ip can be found in this string:
 inet addr:192.168.1.100
- add to hosts, this line should be at the first line.
  [hadoop@netfox hadoop]$ vi /etc/hosts [hadoop@netfox hadoop]$ vi /etc/hosts
  192.168.1.100 master 192.168.1.100 master
- add to /etc/profile
      export JAVA_HOME=/usr/lib/jvm/java-6-oracle export JAVA_HOME=/usr/lib/jvm/java-6-oracle
  
  export HADOOP_HOME=/home/hadoop/hadoop-1.2.1 export HADOOP_HOME=/home/hadoop/hadoop-1.2.1
  
  export HBASE_HOME=/home/hadoop/hbase-0.90.4 export HBASE_HOME=/home/hadoop/hbase-0.90.4
  
  export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
- source it
      [hadoop@netfox hadoop]$ source /etc/profile [hadoop@netfox hadoop]$ source /etc/profile
- create folder
      hadoop@netfox:~$ mkdir /home/hadoop/data hadoop@netfox:~$ mkdir /home/hadoop/data
- edit /home/hadoop/hadoop-1.2.1/conf/hdfs-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
  <!-- Put site-specific property overrides in this file. --> <!-- Put site-specific property overrides in this file. -->
  
  <configuration> <configuration>
  
  <property> <property>
  <name>dfs.replication</name> <name>dfs.replication</name>
  <value>1</value> <value>1</value>
  <description>Default block replication. <description>Default block replication.
  The actual number of replications can be specified when the file is created. The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time. The default is used if replication is not specified in create time.
  </description> </description>
  </property> </property>
  
  <property> <property>
  <name>dfs.permissions</name> <name>dfs.permissions</name>
  <value>false</value> <value>false</value>
  </property> </property>
  
  </configuration> </configuration>
- edit /home/hadoop/hadoop-1.2.1/conf/mapred-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
  <!-- Put site-specific property overrides in this file. --> <!-- Put site-specific property overrides in this file. -->
  
  <configuration> <configuration>
  
  <property> <property>
  <name>mapred.job.tracker</name> <name>mapred.job.tracker</name>
  <value>master:9002</value> <value>master:9002</value>
  <description>The host and port that the MapReduce job tracker runs <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map at. If "local", then jobs are run in-process as a single map
  and reduce task. and reduce task.
  </description> </description>
  </property> </property>
  
  
  </configuration> </configuration>
- edit /home/hadoop/hadoop-1.2.1/conf/core-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
  <!-- Put site-specific property overrides in this file. --> <!-- Put site-specific property overrides in this file. -->
  
  <configuration> <configuration>
  
  <property> <property>
  <name>hadoop.tmp.dir</name> <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/data</value> <value>/home/hadoop/data</value>
  <description>A base for other temporary directories.</description> <description>A base for other temporary directories.</description>
  </property> </property>
   
  <property> <property>
  <name>fs.default.name</name> <name>fs.default.name</name>
  <value>hdfs://master:9001</value> <value>hdfs://master:9001</value>
  <description>The name of the default file system.  A URI whose <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description> determine the host, port, etc. for a filesystem.</description>
  </property> </property>
  
  
  </configuration> </configuration>
- add to /home/hadoop/hadoop-1.2.1/conf/hadoop-env.sh
      export JAVA_HOME=/usr/lib/jvm/java-6-oracle export JAVA_HOME=/usr/lib/jvm/java-6-oracle
- add to /home/hadoop/hadoop-1.2.1/conf/slaves and masters
      master master
- format hdoop namenode
      [hadoop@netfox ~]$ hadoop namenode -format [hadoop@netfox ~]$ hadoop namenode -format
- start hadoop
      [hadoop@netfox hadoop]$ start-all.sh [hadoop@netfox hadoop]$ start-all.sh
- check if hdoop install correctly
      [hadoop@netfox hadoop]$ hadoop dfs -ls / [hadoop@netfox hadoop]$ hadoop dfs -ls /
 Found 4 items
 drwxr-xr-x   - hadoop supergroup          0 2013-08-28 14:02 /chukwa
 drwxr-xr-x   - hadoop supergroup          0 2013-08-29 09:53 /hbase
 drwxr-xr-x   - hadoop supergroup          0 2013-08-27 10:36 /opt
 drwxr-xr-x   - hadoop supergroup          0 2013-09-01 15:22 /tmp
 
 
Install Hadoop(fully-distributed mode)
repeat step1-23 on slave1-3, but some steps will be different:
     - changet step 9 as below:
 don't make the public key, just transfer the public key from master to each slave.
 [hadoop@netfox hadoop]$ scp ~/.ssh/id_dsa.pub hadoop@slave1:/home/hadoop
 
 
- change step 12 as below:
 add to host
 [hadoop@netfox hadoop]$ vi /etc/hosts
 192.168.1.100 master
 192.168.1.101 slave1
 192.168.1.102 slave2
 192.168.1.103 slave3
 
- step 20, add to /home/hadoop/hadoop-1.2.1/conf/masters
     
 master add to /home/hadoop/hadoop-1.2.1/conf/slaves slave1
 slave2
 slave3
 
- step 22, start hadoop only on master
 [hadoop@netfox hadoop]$ start-all.sh  
Install Hbase
     - get hbase tar file
     
     
- untar the file
      [hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz [hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz
- change /home/hadoop/hbase-0.90.4/conf/hbase-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <!-- <!--
  /** /**
  * Copyright 2010 The Apache Software Foundation * Copyright 2010 The Apache Software Foundation
  * *
  * Licensed to the Apache Software Foundation (ASF) under one * Licensed to the Apache Software Foundation (ASF) under one
  * or more contributor license agreements.  See the NOTICE file * or more contributor license agreements.  See the NOTICE file
  * distributed with this work for additional information * distributed with this work for additional information
  * regarding copyright ownership.  The ASF licenses this file * regarding copyright ownership.  The ASF licenses this file
  * to you under the Apache License, Version 2.0 (the * to you under the Apache License, Version 2.0 (the
  * "License"); you may not use this file except in compliance * "License"); you may not use this file except in compliance
  * with the License.  You may obtain a copy of the License at * with the License.  You may obtain a copy of the License at
  * *
  *     http://www.apache.org/licenses/LICENSE-2.0 *     http://www.apache.org/licenses/LICENSE-2.0
  * *
  * Unless required by applicable law or agreed to in writing, software * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS, * distributed under the License is distributed on an "AS IS" BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and * See the License for the specific language governing permissions and
  * limitations under the License. * limitations under the License.
  */ */
  --> -->
  <configuration> <configuration>
  
  <property> <property>
  <name>hbase.rootdir</name> <name>hbase.rootdir</name>
  <value>hdfs://master:9001/hbase</value> <value>hdfs://master:9001/hbase</value>
  </property> </property>
  
  <property> <property>
  <name>hbase.cluster.distributed</name> <name>hbase.cluster.distributed</name>
  <value>true</value> <value>true</value>
  </property> </property>
  
  <property> <property>
  <name>hbase.zookeeper.quorum</name> <name>hbase.zookeeper.quorum</name>
  <value>localhost</value> <value>localhost</value>
  </property> </property>
  
  </configuration> </configuration>
- change /home/hadoop/hbase-0.90.4/conf/regionservers as below
      master master
- add JAVA_HOME to /home/hadoop/hbase-0.90.4/conf/hbase-env.sh
      export JAVA_HOME=/usr/lib/jvm/java-6-oracle export JAVA_HOME=/usr/lib/jvm/java-6-oracle
- replace with the new hadoop jar
      [hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar [hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar
  [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib
  [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib
  [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib [hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib
- start hbse
      [hadoop@netfox ~]$ start-hbase.sh [hadoop@netfox ~]$ start-hbase.sh
- check if hbase install correctly
      [hadoop@netfox ~]$ hbase shell [hadoop@netfox ~]$ hbase shell
  HBase Shell; enter 'help<RETURN>' for list of supported commands. HBase Shell; enter 'help<RETURN>' for list of supported commands.
  Type "exit<RETURN>" to leave the HBase Shell Type "exit<RETURN>" to leave the HBase Shell
  Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011 Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011
  
  hbase(main):001:0> list hbase(main):001:0> list
  TABLE                                          webpage TABLE                                          webpage
  1 row(s) in 0.5270 seconds 1 row(s) in 0.5270 seconds
 
Install Nutch
     - install ant
      [root@netfox ~]# apt-get install ant [root@netfox ~]# apt-get install ant
- switch user and folder
      [root@netfox ~]# su hadoop [root@netfox ~]# su hadoop
  [hadoop@netfox root]$ cd ~ [hadoop@netfox root]$ cd ~
- get nutch tar file
     
     
- untar this file
      [hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz [hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz
- add to /etc/profile
 export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
 export PATH=$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
 
- change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/hbase-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <!-- <!--
  /** /**
  * Copyright 2009 The Apache Software Foundation * Copyright 2009 The Apache Software Foundation
  * *
  * Licensed to the Apache Software Foundation (ASF) under one * Licensed to the Apache Software Foundation (ASF) under one
  * or more contributor license agreements.  See the NOTICE file * or more contributor license agreements.  See the NOTICE file
  * distributed with this work for additional information * distributed with this work for additional information
  * regarding copyright ownership.  The ASF licenses this file * regarding copyright ownership.  The ASF licenses this file
  * to you under the Apache License, Version 2.0 (the * to you under the Apache License, Version 2.0 (the
  * "License"); you may not use this file except in compliance * "License"); you may not use this file except in compliance
  * with the License.  You may obtain a copy of the License at * with the License.  You may obtain a copy of the License at
  * *
  *     http://www.apache.org/licenses/LICENSE-2.0 *     http://www.apache.org/licenses/LICENSE-2.0
  * *
  * Unless required by applicable law or agreed to in writing, software * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS, * distributed under the License is distributed on an "AS IS" BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and * See the License for the specific language governing permissions and
  * limitations under the License. * limitations under the License.
  */ */
  --> -->
  <configuration> <configuration>
  
  <property> <property>
  <name>hbase.rootdir</name> <name>hbase.rootdir</name>
  <value>hdfs://master:9001/hbase</value> <value>hdfs://master:9001/hbase</value>
  </property> </property>
  
  <property> <property>
  <name>hbase.cluster.distributed</name> <name>hbase.cluster.distributed</name>
  <value>true</value> <value>true</value>
  </property> </property>
  
  <property> <property>
  <name>hbase.zookeeper.quorum</name> <name>hbase.zookeeper.quorum</name>
  <value>localhost</value> <value>localhost</value>
  </property> </property>
  
  </configuration> </configuration>
- change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/nutch-site.xml as below
      <?xml version="1.0"?> <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
  <!-- Put site-specific property overrides in this file. --> <!-- Put site-specific property overrides in this file. -->
  
  <configuration> <configuration>
  
  <property> <property>
  <name>storage.data.store.class</name> <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value> <value>org.apache.gora.hbase.store.HBaseStore</value>
  <description>Default class for storing data</description> <description>Default class for storing data</description>
  </property> </property>
  <property> <property>
  <name>http.agent.name</name> <name>http.agent.name</name>
  <value>NutchCrawler</value> <value>NutchCrawler</value>
  </property> </property>
  <property> <property>
  <name>http.robots.agents</name> <name>http.robots.agents</name>
  <value>NutchCrawler,*</value> <value>NutchCrawler,*</value>
  </property> </property>
  
  </configuration> </configuration>
- Uncomment the following in the /home/hadoop/webcrawer/apache-nutch-2.2.1/ivy/ivy.xml file   
      <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" <dependency org="org.apache.gora" name="gora-hbase" rev="0.2"
  conf="*->default" /> conf="*->default" />
- add to /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/gora.properties file
      gora.datastore.default=org.apache.gora.hbase.store.HBaseStore gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
- go to nutch installation folder(/home/hadoop/webcrawer/apache-nutch-2.2.1) and run
      ant clean ant clean
  ant runtime ant runtime
- Create a directory in HDFS to upload the seed urls.
      [hadoop@netfox ~]$ hadoop dfs -mkdir urls [hadoop@netfox ~]$ hadoop dfs -mkdir urls
- Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step
      [hadoop@netfox ~]$ hadoop dfs -put seed.txt urls [hadoop@netfox ~]$ hadoop dfs -put seed.txt urls
- Issue the following command from inside the copied deploy directory in the
 JobTracker node to inject the seed URLs to the Nutch database and to generate the
 initial fetch list(-topN <N>  - number of top URLs to be selected, default is Long.MAX_VALUE )
  [hadoop@netfox ~]$ nutch inject urls [hadoop@netfox ~]$ nutch inject urls
  [hadoop@netfox ~]$ nutch generate  -topN 3 [hadoop@netfox ~]$ nutch generate  -topN 3
- Issue the following commands from inside the copied deploy directory in the
 JobTracker node [hadoop@netfox ~]$ nutch fetch -all [hadoop@netfox ~]$ nutch fetch -all
  [hadoop@netfox ~]$ nutch parse -all [hadoop@netfox ~]$ nutch parse -all
  [hadoop@netfox ~]$ nutch updatedb [hadoop@netfox ~]$ nutch updatedb
  [hadoop@netfox ~]$ nutch generate -topN 10 [hadoop@netfox ~]$ nutch generate -topN 10
 
 
Install ElasticSearch
     - get the tar file
     
     
- untar file
      [hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz [hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz
- add to /etc/profile
 export ELAST_HOME=/home/hadoop/webcrawer/elasticsearch-0.19.4
 
 export PATH=$ELAST_HOME/bin:$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
 
- Go to the extracted ElasticSearch directory and execute the following command to
 start the ElasticSearch server in the foreground > bin/elasticsearch -f > bin/elasticsearch -f
- Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local
 in case you are running Nutch in the local mode) directory. Execute the following
 command to index the data crawled by Nutch in to the ElasticSearch server. > bin/nutch elasticindex elasticsearch -all > bin/nutch elasticindex elasticsearch -all
- install curl 
      [hadoop@netfox ~]$ sudo apt-get install curl [hadoop@netfox ~]$ sudo apt-get install curl
- check if elasticsearch installation correct
      [hadoop@netfox ~]$ curl master:9200 [hadoop@netfox ~]$ curl master:9200
- check query 
      [hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop' [hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop'