hadoop setup

Step 1: Install Java or validate Java install

Step 2: Create user and group in the system

1
2
3
4
5
6
7
8
9
10
sudo addgroup hadoop
[sudo] password for xxx:
Adding group `hadoop' (GID 1001) ...
Done.

sudo adduser --ingroup hadoop hduser
Adding user `hduser'
...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel'
...

Step 3. Validate if SSH installed (which ssh, which sshd), if not install it (sudo apt-get install ssh)

Step 4: Create SSH Certificates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
su - hduser
Password:
hduser@@@@-Lenovo-Z50-70:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
8c:0d:ca:68:6e:f0:91:f5:ba:f0:a3:d9:a5:b0:7f:71 hduser@@@@-Lenovo-Z50-70
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|    . .          |
|   = o =         |
|. = o o S        |
| = . .. E        |
|  * . .o         |
| . B.+.          |
|  +o*o           |
+-----------------+

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 5: Check if the ssh works.

1
2
3
4
5
6
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 91:51:2b:33:cd:bc:65:45:ca:4c:e2:51:9d:1e:93:f2.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '
localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 15.04 (GNU/Linux 3.16.0-55-generic x86_64)

Step 6: Get binaries from Hadoop site (as of 02/19, current version is 2.7, but I got 2.6 for other reasons) and go through the process of setting up

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ cd /usr/local
$ sudo tar xzf hadoop-2.6.4.tar.gz
$ sudo mv hadoop-2.6.4 hadoop
$ sudo chown -R hduser:hadoop hadoop

vi .bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"


$ which javac
/usr/bin/javac

 readlink -f /usr/bin/javac
/usr/lib/jvm/java-8-oracle/bin/javac

$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Update JAVA_HOME
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

Next Update core-site.xml
The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

There’re three HDFS properties which contain hadoop.tmp.dir in their values

1. dfs.name.dir: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name.
2. dfs.data.dir: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data.
3. fs.checkpoint.dir: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<configuration>
   <property>
       <name>hadoop.tmp.dir</name>
       <value>/app/hadoop/tmp</value>
       <description>A base for other temporary directories.</description>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:54310</value>
        <description>The name of the default file system.  A URI whose
        scheme and authority determine the FileSystem implementation.  The
        uri's scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class.  The uri's authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

Next need to edit the mapred-site.xml, which is used to specify which framework is being used for MapReduce. Edit this XML file to add the following

1
2
3
4
5
6
7
8
9
10
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
 </property>
</configuration>

Next to edit the hdfs-site.xml to specify the namenode and the datanode. Create two directories which will contain the namenode and the datanode for this Hadoop installation.

1
2
3
4
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
@@@-Lenovo-Z50-70:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
@@@-Lenovo-Z50-70:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>

Next step, format filesystem

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
hduser@@@@-Lenovo-Z50-70:~$ hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

16/02/21 09:47:52 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = @@@-Lenovo-Z50-70/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.6.4
STARTUP_MSG:   classpath = ...
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6; compiled by 'jenkins' on 2016-02-12T09:45Z
STARTUP_MSG:   java = 1.8.0_66
************************************************************/
16/02/21 09:47:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
16/02/21 09:47:52 INFO namenode.NameNode: createNameNode [-format]
16/02/21 09:47:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-4ad4548d-6cb6-492c-9d38-ee5f665143c2
16/02/21 09:47:53 INFO namenode.FSNamesystem: No KeyProvider found.
16/02/21 09:47:53 INFO namenode.FSNamesystem: fsLock is fair:true
16/02/21 09:47:53 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
16/02/21 09:47:53 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
16/02/21 09:47:53 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
16/02/21 09:47:53 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Feb 21 09:47:53
16/02/21 09:47:53 INFO util.GSet: Computing capacity for map BlocksMap
16/02/21 09:47:53 INFO util.GSet: VM type       = 64-bit
16/02/21 09:47:53 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
16/02/21 09:47:53 INFO util.GSet: capacity      = 2^21 = 2097152 entries
16/02/21 09:47:53 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
16/02/21 09:47:53 INFO blockmanagement.BlockManager: defaultReplication         = 1
16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxReplication             = 512
16/02/21 09:47:53 INFO blockmanagement.BlockManager: minReplication             = 1
16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
16/02/21 09:47:53 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
16/02/21 09:47:53 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
16/02/21 09:47:54 INFO namenode.FSNamesystem: fsOwner             = hduser (auth:SIMPLE)
16/02/21 09:47:54 INFO namenode.FSNamesystem: supergroup          = supergroup
16/02/21 09:47:54 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/02/21 09:47:54 INFO namenode.FSNamesystem: HA Enabled: false
16/02/21 09:47:54 INFO namenode.FSNamesystem: Append Enabled: true
16/02/21 09:47:54 INFO util.GSet: Computing capacity for map INodeMap
16/02/21 09:47:54 INFO util.GSet: VM type       = 64-bit
16/02/21 09:47:54 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
16/02/21 09:47:54 INFO util.GSet: capacity      = 2^20 = 1048576 entries
16/02/21 09:47:54 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/02/21 09:47:54 INFO util.GSet: Computing capacity for map cachedBlocks
16/02/21 09:47:54 INFO util.GSet: VM type       = 64-bit
16/02/21 09:47:54 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
16/02/21 09:47:54 INFO util.GSet: capacity      = 2^18 = 262144 entries
16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
16/02/21 09:47:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
16/02/21 09:47:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
16/02/21 09:47:54 INFO util.GSet: Computing capacity for map NameNodeRetryCache
16/02/21 09:47:54 INFO util.GSet: VM type       = 64-bit
16/02/21 09:47:54 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/02/21 09:47:54 INFO util.GSet: capacity      = 2^15 = 32768 entries
16/02/21 09:47:54 INFO namenode.NNConf: ACLs enabled? false
16/02/21 09:47:54 INFO namenode.NNConf: XAttrs enabled? true
16/02/21 09:47:54 INFO namenode.NNConf: Maximum size of an xattr: 16384
16/02/21 09:47:54 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1807438273-127.0.1.1-1456069674855
16/02/21 09:47:55 INFO common.Storage: Storage directory /usr/local/hadoop_store/hdfs/namenode has been successfully formatted.
16/02/21 09:47:55 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
16/02/21 09:47:55 INFO util.ExitUtil: Exiting with status 0
16/02/21 09:47:55 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at @@@-Lenovo-Z50-70/127.0.1.1
************************************************************/
hduser@@@@-Lenovo-Z50-70:~$

Next starting hadoop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
hduser@@@@-Lenovo-Z50-70:/usr/local/hadoop/sbin$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
16/02/21 10:23:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-@@@-Lenovo-Z50-70.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-@@@-Lenovo-Z50-70.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 91:51:2b:33:cd:bc:65:45:ca:4c:e2:51:9d:1e:93:f2.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '
0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-@@@-Lenovo-Z50-70.out
16/02/21 10:23:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-@@@-Lenovo-Z50-70.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-@@@-Lenovo-Z50-70.out

Check for running status

1
2
3
4
5
6
7
hduser@@@@-Lenovo-Z50-70:/usr/local/hadoop/sbin$ jps
6624 NodeManager
6016 NameNode
6162 DataNode
6503 ResourceManager
6345 SecondaryNameNode
6941 Jps

Access the web interface through — http://localhost:50070

Leave a Reply

Your email address will not be published. Required fields are marked *