How to Install Hadoop on Windows

Introduction

There are multiple ways you can install Hadoop on Windows but most of them require installing a virtual machine or using docker containers to run Cloudera or HDP images on them. Although these methods are effective they require considerably high hardware configurations. In this post, we have laid out a detailed step by step guide to set up and configure Hadoop on a lightweight windows machine along with a small demonstration of putting a local file into HDFS.

This post covers the steps to install Hadoop 2.9.1 on Windows 10 using its binaries. You can install Hadoop on your local machine using its source code also. It requires building the source using Apache Maven and Windows SDK. We’ll cover that method in one of the future posts.

Step by Step Guide

We are going to perform quite a few steps here so I recommend to set aside some time and do these steps very patiently and carefully. There are many manual steps and any miss can lead to a failure or a learning opportunity – depending upon whether you see a glass half full or half empty. Brace yourself!

Download Hadoop 2.9.1 binaries

To download the binaries, go to Apache.org and search for Hadoop 2.9.1 binaries or click here to go directly to the download page. You should get the hadoop-2.9.1.tar.gz file.

For plenty of obvious reasons, you may want to organize your installations properly. So, create a separate folder where you’ll be unpacking the binaries. In this post, we’ll create ‘C:/BigData/hadoop-2.9.1’ folder and refer that further on, but you can choose whatever makes sense for you.

Don’t give any spaces in the folder names. If there are spaces in the folder then some of the variables will not expand properly.

Unpack the tar.gz in C:/BigData/hadoop-2.9.1 folder. If you don’t have a software to unpack a tar.gz then you can download 7-zip to do so. Note that some standard unzip Software may yield ‘Path too long’ error. One of the way to get around those errors is to install Cygwin with a standard tar package, and then run “tar –xvf <file name>” from Windows/Cygwin Command prompt. Once the binaries are unpacked you should see below files and folders.

Hadoop Folder Structure
Hadoop Folder Structure

Download Windows compatible binaries

Go to this GitHub Repo and download the bin folder as a zip as shown below. Extract the zip and copy all the files present under bin folder to C:\BigData\hadoop-2.9.1\bin. Replace the existing files as well.

GitHub Repository
GitHub Repository

Create folders for datanode and namenode

Goto C:/BigData/hadoop-2.9.1 and create a folder ‘data’. Inside the ‘data’ folder create two folders ‘datanode’ and ‘namenode’. Your files on HDFS will reside under the datanode folder.

Hadoop Namenode and Datanode
Hadoop Namenode and Datanode

Set Hadoop Environment Variables

Hadoop requires following environment variables to be set.

  • HADOOP_HOME=”C:\BigData\hadoop-2.9.1″
  • HADOOP_BIN=”C:\BigData\hadoop-2.9.1\bin”
  • JAVA_HOME=<Root of your JDK installation>”

To set these variables, navigate to My Computer or This PC. Right click -> Properties -> Advanced System settings -> Environment variables. Click New to create a new environment variables.

Set Windows Environment Variables
Windows Environment Variables
Set Windows Environment Variables
Windows Environment Variables

If you don’t have JAVA 1.8 installed then you’ll need to download and install it first. If JAVA_HOME environment variable is already set then check whether the path has any spaces in it (ex: C:\Program Files\Java\…). Spaces in the JAVA_HOME path will lead you to issues. There is a trick to get around it. Replace ‘Program Files‘ to ‘Progra~1‘ in the variable value. Ensure that the version of Java is 1.8 and JAVA_HOME is pointing to JDK 1.8.

Set Hadoop Environment Variables
Set Hadoop Environment Variables

Edit PATH Environment Variable

Edit Windows PATH Variable
Set PATH variable

Click on New and Add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sbin to your PATH one by one.

Set Windows PATH Variable
Set Windows PATH Variable

Now that we have set the environment variables, we need to validate them. Open a new Windows Command prompt and run echo command on each variable to confirm they are assigned the desired values.

echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%

If the variables are not initialized yet then it can probably be because you are testing them in an old session. Make sure you have opened a new command prompt to test them.

Configure Hadoop

Once environment variables are set up, we need to configure Hadoop by editing the following configurations files.

  • hadoop-env.cmd
  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml

Edit hadoop-env.cmd

First, let’s configure the Hadoop environment file. Open C:\BigData\hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add below content at the bottom

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin

Edit core-site.xml

Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop-2.9.1\etc\hadoop\core-site.xml and below content within <configuration> </configuration> tags.

<configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://0.0.0.0:19000</value>
   </property> 
</configuration>

Edit hdfs-site.xml

After editing core-site.xml, you need to set replication factor and the location of namenode and datanodes. Open C:\BigData\hadoop-2.9.1\etc\hadoop\hdfs-site.xml and below content within <configuration> </configuration> tags.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>C:\BigData\hadoop-2.9.1\data\namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>C:\BigData\hadoop-2.9.1\data\datanode</value>
   </property>
</configuration>

Edit mapred-site.xml

Finally, let’s configure properties for the Map-Reduce framework. Open C:\BigData\hadoop-2.9.1\etc\hadoop\mapred-site.xml and below content within <configuration> </configuration> tags. If you don’t see mapred-site.xml then open mapred-site.xml.template file and rename it to mapred-site.xml

<configuration>
   <property>
      <name>mapreduce.job.user.name</name>
      <value>%USERNAME%</value>
   </property>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
   <property>
      <name>yarn.apps.stagingDir</name>
      <value>/user/%USERNAME%/staging</value>
   </property>
   <property>
      <name>mapreduce.jobtracker.address</name>
      <value>local</value>
   </property>
</configuration>

Check if C:\BigData\hadoop-2.9.1\etc\hadoop\slaves file is present, if it’s not then create one and add localhost in it and save it

Format Name Node

To format the Name Node, open a new Windows Command Prompt and run below command. It may give you some warnings, ignore them.

hadoop namenode -format
Format Hadoop Name Node
Format Hadoop Name Node

Launch Hadoop

Luckily, the hard part of setting up and configuring Hadoop is over now. Great! Now let’s jump right on to the good part, which is launching Hadoop on your machine.

Open a new Windows Command prompt, make sure to Run it as an Administrator to avoid permission errors. Once opened, execute start-all.cmd command. Since we have added %HADOOP_HOME%\sbin to the PATH variable, you can run this command from any folder. If you haven’t done so then goto %HADOOP_HOME%\sbin folder and run the command.

Start Hadoop Daemons
Start Hadoop Daemons

It will open 4 new windows cmd terminals for 4 daemon processes, namely namenode, datanode, nodemanager, and resourcemanager. Don’t close these windows, minimize them. Closing the windows will terminate the daemons. You can run them in the background if you don’t like to see these windows.

Start Hadoop Deamons
Start Hadoop Deamons

Hadoop Web UI

Lastly, let’s monitor to see how are Hadoop daemons are doing. Not to mention you can use the Web UI for all kind of administrative and monitoring purposes. Open your browser and get started.

Resource Manager

Open localhost:8088 to open Resource Manager

Hadoop Resource Manager Web UI
Hadoop Resource Manager Web UI

Node Manager

Open localhost:8042 to open Node Manager

Hadoop Node Manager Web UI
Hadoop Node Manager Web UI

Name Node

Open localhost:50070 to checkout the health of Name Node

Hadoop Name Node Web UI
Hadoop Name Node Web UI

Data Node

Open localhost:50075 to checkout Data Node

Hadoop Data Node Web UI
Hadoop Data Node Web UI

Finally, working with HDFS

Finally, we are going to put a small file in HDFS using hdfs command line tool. Not to mention, there are plenty of ways to bring data to HDFS. Tools like Apache Sqoop, Flume, Kafka, Spark are well known. If you want to try Apache Spark and read/write Json or Parquet files then you can refer this step by step guide Getting started with Apache Spark.

Open a new Windows Command Prompt and run below commands. I had already created a sample.txt test file in my local file system.

hdfs dfs -ls /
hdfs dfs -mkdir /test
hdfs dfs -copyFromLocal Sample.txt /test
hdfs dfs -ls /test
hdfs dfs -cat /test/Sample.txt
HDFS Basic Commands
HDFS Commands

Congratulations! You have successfully installed Hadoop!

Congratulation! You have successfully installed Hadoop. Although there is a fair chance that you’d have run into issues, most likely it would be due to a small miss or an incompatible software version. Please carefully validate all the steps again and verify you have the right software versions. There are a lot of manual steps involved and it is pretty common to miss one or two. If you still can’t get Hadoop up and running, please describe your issue in the comments below. Don’t be shy! 🙂

Finally, don’t forget to share it with your friends and colleagues. Subscribe to Exit Condition and follow our social media pages to get regular updates. Thanks for stopping by. Happy Learning!