Step 1: Install Java or validate Java install
Step 2: Create user and group in the system
1 2 3 4 5 6 7 8 9 10 | sudo addgroup hadoop [sudo] password for xxx: Adding group `hadoop' (GID 1001) ... Done. sudo adduser --ingroup hadoop hduser Adding user `hduser' ... Adding new user `hduser' (1001) with group `hadoop' ... Creating home directory `/home/hduser' ... Copying files from `/etc/skel' ... |
Step 3. Validate if SSH installed (which ssh, which sshd), if not install it (sudo apt-get install ssh)
Step 4: Create SSH Certificates
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | su - hduser Password: hduser@@@@-Lenovo-Z50-70:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 8c:0d:ca:68:6e:f0:91:f5:ba:f0:a3:d9:a5:b0:7f:71 hduser@@@@-Lenovo-Z50-70 The key's randomart image is: +---[RSA 2048]----+ | | | | | . . | | = o = | |. = o o S | | = . .. E | | * . .o | | . B.+. | | +o*o | +-----------------+ $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys |
Step 5: Check if the ssh works.
1 2 3 4 5 6 | $ ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is 91:51:2b:33:cd:bc:65:45:ca:4c:e2:51:9d:1e:93:f2. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. Welcome to Ubuntu 15.04 (GNU/Linux 3.16.0-55-generic x86_64) |
Step 6: Get binaries from Hadoop site (as of 02/19, current version is 2.7, but I got 2.6 for other reasons) and go through the process of setting up
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | $ cd /usr/local $ sudo tar xzf hadoop-2.6.4.tar.gz $ sudo mv hadoop-2.6.4 hadoop $ sudo chown -R hduser:hadoop hadoop vi .bashrc export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" $ which javac /usr/bin/javac readlink -f /usr/bin/javac /usr/lib/jvm/java-8-oracle/bin/javac $ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh Update JAVA_HOME # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/java-8-oracle $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp |
Next Update core-site.xml
The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
There’re three HDFS properties which contain hadoop.tmp.dir in their values
1. dfs.name.dir: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name.
2. dfs.data.dir: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data.
3. fs.checkpoint.dir: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration> |
Next need to edit the mapred-site.xml, which is used to specify which framework is being used for MapReduce. Edit this XML file to add the following
1 2 3 4 5 6 7 8 9 10 | <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> |
Next to edit the hdfs-site.xml to specify the namenode and the datanode. Create two directories which will contain the namenode and the datanode for this Hadoop installation.
1 2 3 4 | $ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode @@@-Lenovo-Z50-70:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode @@@-Lenovo-Z50-70:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store $ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> </configuration> |
Next step, format filesystem
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | hduser@@@@-Lenovo-Z50-70:~$ hadoop namenode -format DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 16/02/21 09:47:52 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = @@@-Lenovo-Z50-70/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.4 STARTUP_MSG: classpath = ... STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6; compiled by 'jenkins' on 2016-02-12T09:45Z STARTUP_MSG: java = 1.8.0_66 ************************************************************/ 16/02/21 09:47:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 16/02/21 09:47:52 INFO namenode.NameNode: createNameNode [-format] 16/02/21 09:47:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Formatting using clusterid: CID-4ad4548d-6cb6-492c-9d38-ee5f665143c2 16/02/21 09:47:53 INFO namenode.FSNamesystem: No KeyProvider found. 16/02/21 09:47:53 INFO namenode.FSNamesystem: fsLock is fair:true 16/02/21 09:47:53 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000 16/02/21 09:47:53 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true 16/02/21 09:47:53 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000 16/02/21 09:47:53 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Feb 21 09:47:53 16/02/21 09:47:53 INFO util.GSet: Computing capacity for map BlocksMap 16/02/21 09:47:53 INFO util.GSet: VM type = 64-bit 16/02/21 09:47:53 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB 16/02/21 09:47:53 INFO util.GSet: capacity = 2^21 = 2097152 entries 16/02/21 09:47:53 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false 16/02/21 09:47:53 INFO blockmanagement.BlockManager: defaultReplication = 1 16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxReplication = 512 16/02/21 09:47:53 INFO blockmanagement.BlockManager: minReplication = 1 16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxReplicationStreams = 2 16/02/21 09:47:53 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000 16/02/21 09:47:53 INFO blockmanagement.BlockManager: encryptDataTransfer = false 16/02/21 09:47:53 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000 16/02/21 09:47:54 INFO namenode.FSNamesystem: fsOwner = hduser (auth:SIMPLE) 16/02/21 09:47:54 INFO namenode.FSNamesystem: supergroup = supergroup 16/02/21 09:47:54 INFO namenode.FSNamesystem: isPermissionEnabled = true 16/02/21 09:47:54 INFO namenode.FSNamesystem: HA Enabled: false 16/02/21 09:47:54 INFO namenode.FSNamesystem: Append Enabled: true 16/02/21 09:47:54 INFO util.GSet: Computing capacity for map INodeMap 16/02/21 09:47:54 INFO util.GSet: VM type = 64-bit 16/02/21 09:47:54 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB 16/02/21 09:47:54 INFO util.GSet: capacity = 2^20 = 1048576 entries 16/02/21 09:47:54 INFO namenode.NameNode: Caching file names occuring more than 10 times 16/02/21 09:47:54 INFO util.GSet: Computing capacity for map cachedBlocks 16/02/21 09:47:54 INFO util.GSet: VM type = 64-bit 16/02/21 09:47:54 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB 16/02/21 09:47:54 INFO util.GSet: capacity = 2^18 = 262144 entries 16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033 16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0 16/02/21 09:47:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000 16/02/21 09:47:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled 16/02/21 09:47:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis 16/02/21 09:47:54 INFO util.GSet: Computing capacity for map NameNodeRetryCache 16/02/21 09:47:54 INFO util.GSet: VM type = 64-bit 16/02/21 09:47:54 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB 16/02/21 09:47:54 INFO util.GSet: capacity = 2^15 = 32768 entries 16/02/21 09:47:54 INFO namenode.NNConf: ACLs enabled? false 16/02/21 09:47:54 INFO namenode.NNConf: XAttrs enabled? true 16/02/21 09:47:54 INFO namenode.NNConf: Maximum size of an xattr: 16384 16/02/21 09:47:54 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1807438273-127.0.1.1-1456069674855 16/02/21 09:47:55 INFO common.Storage: Storage directory /usr/local/hadoop_store/hdfs/namenode has been successfully formatted. 16/02/21 09:47:55 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 16/02/21 09:47:55 INFO util.ExitUtil: Exiting with status 0 16/02/21 09:47:55 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at @@@-Lenovo-Z50-70/127.0.1.1 ************************************************************/ hduser@@@@-Lenovo-Z50-70:~$ |
Next starting hadoop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | hduser@@@@-Lenovo-Z50-70:/usr/local/hadoop/sbin$ ./start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 16/02/21 10:23:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-@@@-Lenovo-Z50-70.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-@@@-Lenovo-Z50-70.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. ECDSA key fingerprint is 91:51:2b:33:cd:bc:65:45:ca:4c:e2:51:9d:1e:93:f2. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-@@@-Lenovo-Z50-70.out 16/02/21 10:23:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-@@@-Lenovo-Z50-70.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-@@@-Lenovo-Z50-70.out |
Check for running status
1 2 3 4 5 6 7 | hduser@@@@-Lenovo-Z50-70:/usr/local/hadoop/sbin$ jps 6624 NodeManager 6016 NameNode 6162 DataNode 6503 ResourceManager 6345 SecondaryNameNode 6941 Jps |
Access the web interface through — http://localhost:50070