Introduction
There are multiple ways you can install Hadoop on Windows but most of them require installing a virtual machine or using docker containers to run Cloudera or HDP images on them. Although these methods are effective they require considerably high hardware configurations. In this post, we have laid out a detailed step by step guide to set up and configure Hadoop on a lightweight windows machine along with a small demonstration of putting a local file into HDFS.
This post covers the steps to install Hadoop 2.9.1 on Windows 10 using its binaries. You can install Hadoop on your local machine using its source code also. It requires building the source using Apache Maven and Windows SDK. We’ll cover that method in one of the future posts.
Step by Step Guide
We are going to perform quite a few steps here so I recommend to set aside some time and do these steps very patiently and carefully. There are many manual steps and any miss can lead to a failure or a learning opportunity – depending upon whether you see a glass half full or half empty. Brace yourself!
Download Hadoop 2.9.1 binaries
To download the binaries, go to Apache.org and search for Hadoop 2.9.1 binaries or click here to go directly to the download page. You should get the hadoop-2.9.1.tar.gz file.
For plenty of obvious reasons, you may want to organize your installations properly. So, create a separate folder where you’ll be unpacking the binaries. In this post, we’ll create ‘C:/BigData/hadoop-2.9.1’ folder and refer that further on, but you can choose whatever makes sense for you.
Don’t give any spaces in the folder names. If there are spaces in the folder then some of the variables will not expand properly.
Unpack the tar.gz in C:/BigData/
Download Windows compatible binaries
Go to this GitHub Repo and download the bin folder as a zip as shown below. Extract the zip and copy all the files present under bin folder to C:\BigData\hadoop-2.9.1\bin. Replace the existing files as well.
Create folders for datanode and namenode
Goto C:/BigData/hadoop-2.9.1 and create a folder ‘data’. Inside the ‘data’ folder create two folders ‘datanode’ and ‘namenode’. Your files on HDFS will reside under the datanode folder.
Set Hadoop Environment Variables
Hadoop requires following environment variables to be set.
- HADOOP_HOME=”C:\BigData\hadoop-2.9.1″
- HADOOP_BIN=”C:\BigData\hadoop-2.9.1\bin”
- JAVA_HOME=<Root of your JDK installation>”
To set these variables, navigate to My Computer or This PC. Right click -> Properties -> Advanced System settings -> Environment variables. Click New to create
If you don’t have JAVA 1.8 installed then you’ll need to download and install it first. If JAVA_HOME environment variable is already set then check whether the path has any spaces in it (ex: C:\Program Files\Java\…). Spaces in the JAVA_HOME path will lead you to issues. There is a trick to get around it. Replace ‘Program Files‘ to ‘Progra~1‘ in the variable value. Ensure that the version of Java is 1.8 and JAVA_HOME is pointing to JDK 1.8.
Edit PATH Environment Variable
Click on New and Add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sbin to your PATH one by one.
Now that we have set the environment variables, we need to validate them. Open a new Windows Command prompt and run echo command on each variable to confirm they are assigned the desired values.
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
If the variables are not initialized yet then it can probably be because you are testing them in an old session. Make sure you have opened a new command prompt to test them.
Configure Hadoop
Once environment variables are set up, we need to configure Hadoop by editing the following configurations files.
- hadoop-env.cmd
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
Edit hadoop-env.cmd
First, let’s configure the Hadoop environment file. Open C:\BigData\hadoop-2.9.1\etc\hadoop\hadoop-env.cmd and add below content at the bottom
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Edit core-site.xml
Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop-2.9.1\etc\hadoop\core-site.xml and below content within <configuration> </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Edit hdfs-site.xml
After editing core-site.xml, you need to set replication factor and the location of namenode and datanodes. Open C:\BigData\hadoop-2.9.1\etc\hadoop\hdfs-site.xml and below content within <configuration> </configuration> tags.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\BigData\hadoop-2.9.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\BigData\hadoop-2.9.1\data\datanode</value>
</property>
</configuration>
Edit mapred-site.xml
Finally, let’s configure properties for the Map-Reduce framework. Open C:\BigData\hadoop-2.9.1\etc\hadoop\mapred-site.xml and below content within <configuration> </configuration> tags. If you don’t see mapred-site.xml then open mapred-site.xml.template file and rename it to mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name>
<value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
Check if C:\BigData\hadoop-2.9.1\etc\hadoop\slaves file is present, if it’s not then create one and add localhost in it and save it
Format Name Node
To format the Name Node, open a new Windows Command Prompt and run below command. It may give you some warnings, ignore them.
hadoop namenode -format
Launch Hadoop
Luckily, the hard part of setting up and configuring Hadoop is over now. Great! Now let’s jump right on to the good part, which is launching Hadoop on your machine.
Open a new Windows Command prompt, make sure to Run it as an Administrator to avoid permission errors. Once opened, execute start-all.cmd command. Since we have added %HADOOP_HOME%\sbin to the PATH variable, you can run this command from any folder. If you haven’t done so then goto %HADOOP_HOME%\sbin folder and run the command.
It will open 4 new windows cmd terminals for 4 daemon processes, namely namenode, datanode, nodemanager, and resourcemanager. Don’t close these windows, minimize them. Closing the windows will terminate the daemons. You can run them in the background if you don’t like
Hadoop Web UI
Lastly, let’s monitor to see how are Hadoop daemons are doing. Not to mention you can use the Web UI for all kind of administrative and monitoring purposes. Open your browser and get started.
Resource Manager
Open localhost:8088 to open Resource Manager
Node Manager
Open localhost:8042 to open Node Manager
Name Node
Open localhost:50070 to checkout the health of Name Node
Data Node
Open localhost:50075 to checkout Data Node
Finally, working with HDFS
Finally, we are going to put a small file in HDFS using hdfs command line tool. Not to mention, there are plenty of ways to bring data to HDFS. Tools like Apache Sqoop, Flume, Kafka, Spark are well known. If you want to try Apache Spark and read/write Json or Parquet files then you can refer this step by step guide Getting started with Apache Spark.
Open a new Windows Command Prompt and run below commands. I had already created a sample.txt test file in my local file system.
hdfs dfs -ls /
hdfs dfs -mkdir /test
hdfs dfs -copyFromLocal Sample.txt /test
hdfs dfs -ls /test
hdfs dfs -cat /test/Sample.txt
Congratulations! You have successfully installed Hadoop!
Congratulation! You have successfully installed Hadoop. Although there is a fair chance that you’d have run into issues, most likely it would be due to a small miss or an incompatible software version. Please carefully validate all the steps again and verify you have the right software versions. There are a lot of manual steps involved and it is pretty common to miss one or two. If you still can’t get Hadoop up and running, please describe your issue in the comments below. Don’t be shy! 🙂
Finally, don’t forget to share it with your friends and colleagues. Subscribe to Exit Condition and follow our social media pages to get regular updates. Thanks for stopping by. Happy Learning!