In this post, we will walk you through the step by step guide to install Apache Spark on Windows, and give you an overview of Scala and PySpark shells. We’ll also write a small program to create RDD, read & write Json and Parquet files on local File System as well as HDFS, and last but not the least, we’ll cover an introduction of the Spark Web UI.
In simple words, Apache Spark is an Open Source cluster computing Framework. Spark is primarily used for Processing large volumes of data. As the industry is moving away from traditional forms of ETL, Spark has proved to an increasingly popular candidate for your Data processing needs.
I recommend installing Hadoop on your machine before installing Spark. You can refer this Steps by step guide to install Hadoop to get Hadoop up and running on your machine.
Install Apache Spark
Let’s get started and install Apache Spark on your machine. Follow the below steps.
Download Spark binaries
Spark binaries are available on Apache’s download page for free. You can just google ‘download apache spark‘ and it should land to you the download page. Alternatively, you can click this download link to get there and follow the below steps.
- Select Spark Release as 2.3.1 or higher
- Select Package Type as ‘Pre-built for Apache Hadoop 2.7 or later‘
- Click on the download link – the link to the tgz file
- Click on the link of the very first mirror site (the one suggested by ASF). It will start the download of the tgz file.
You can validate the integrity of the downloaded file using the PGP signature (.asc file) or a hash (.md5 or a .sha file). This validation is not in the purview of this post and it’s not absolutely necessary either. But, if you prefer to do it, the steps are mentioned on the download page itself.
Unpack Spark Binaries
You may want to structure your installations to avoid the clutter in your system. It’s always better to keep your system well organized to keep related files in the same folder tree and to be able to locate things quickly in the future. The standard we are following throughout the site for tools/software related to Big Data EcoSystem is to create a BigData (Without spaces) folder in your C: drive. But that’s just our recommendation. You can choose what makes sense for you. Once the parent (BigData) folder is created copy the downloaded Spark binaries (the tgz file) in BigData folder.
Unpack the .tgz in C:/BigData/spark-3.1.0 folder. If you don’t have a software to unpack a tgz then you can download 7-zip to do so. Notice that the file is a tgz, meaning the files are bundled in a tarball and then gun-zipped (.gz). So when you unzip it for the first time it will yield a tar file. You’ll need to unzip the tar file as well to unpack it completely. Alternatively, you can install Cygwin with a standard tar package, and then run “tar -xzvf <file name>” from Windows/Cygwin Command prompt. Once binaries are unpacked you should see below files and folders.
Set Spark Environment Variables
Once you have the Spark binaries downloaded and unpacked, the next step is to setup environment variables. To create/edit environment variables in Windows go to My Computer (This PC) -> Right click -> Properties -> Advanced System settings -> Environment variables. Click New to create a new environment variable.
Set SPARK_HOME environment variable
Create a new variable called SPARK_HOME and point it to the root of your Spark installation. In our
Add Spark to PATH variable
You’d want to invoke PySpark and Spark-shell from anywhere in your terminal. To do that, you need to add Spark’s bin folder to the PATH environment variable. Locate the Path variable and click on Edit. Add %SPARK_HOME%\bin to Path as shown below and click OK.
Launch Scala and PySpark Shell
Spark can run locally as well as on cluster. There are various APIs for Spark development written in languages like Scala, Python, Java, R, etc, and they all provide the same capabilities. Shell provides a very useful command line interface to learn, explore, prototype, test or to just play around with data and various functionalities of Spark. If you are new to Spark, the Spark shell is going to be your best companion. Here’s how you can launch Spark-Shell (Scala) or PySpark (Python) shells.
Launch Spark Shell (Scala)
Open Windows Command Prompt (Start -> Run -> Cmd). Type spark-shell and hit enter. You’ll be able to launch spark-shell from any location (any of your OS directory) as we have already added spark/bin to the Path.
spark-shell opens a Scala shell for Spark. You can run Spark commands in spark-shell, but with Scala semantics and syntax. To exit the shell, type ‘
Launch PySpark Shell
Open Windows Command Prompt (Start -> Run -> Cmd). Type pyspark and hit enter. You’ll be able to launch PySpark from any location (any of your OS directory) as we have already added spark/bin to the Path.
PySpark opens a Python shell for Spark (aka PySpark). Similar to spark-shell you can run Spark commands in PySpark, but with Python semantics and syntax. Needless to say, you can run any Python commands as well in the PySpark shell. To exit PySpark type ‘exit()‘ and hit enter.
Print SparkContext and Application Name
Let’s cut right to the chase and run below commands either in Scala shell or PySpark shell and see what you get.
sc.version sc.master sc.appName
If you are able to run above commands without any errors then you are good so far. Below is o/p from PySpark, you should get something similar.
What we did here is we printed the version of Spark you are using and we printed a couple of properties set at SparkContext level. If you are somewhat familiar with Spark then chances are you already know what SparkContext is. If you don’t then, for now, all you need to know is, it’s the main entry point to your application and it lets you set configuration parameters for the application. Typically, when you are writing a Spark program you’ll need to explicitly create a SparkContext but if you are on spark-shell or PySpark shell then Shell creates it for you within a variable name ‘sc‘.
Create RDDs and Data Frames
RDD stands for Resilient Distributed Dataset. In simpler words, it is an in-memory data container. It is resilient because it’s fault tolerant – redundant copies of RDD are created and maintained. It’s distributed as it is spread across multiple nodes into multiple partitions. You can calibrate the Replication Factor and No. of Partitions to calibrate it according to your hardware capacity and application needs.
Here’s how we create an RDD out of text file, print the content and count the number of lines. Note that the folder separator in file name has to be a forward slash ‘/’ and not the windows standard ‘\’.
Create RDDs using Scala
Run below commands on Scala shell to create an RDD.
val empRDD = sc.textFile("C:/Users/User/Documents/Work/Data/Employees.txt") empRDD.count() empRDD.collect()
Create RDDs using PySpark
Run below commands to create an RDD using PySpark shell.
empRDD = sc.textFile("C:/Users/User/Documents/Work/Data/Employees.txt") empRDD.count() empRDD.collect()
Create Data Frames
Now that you know how to create RDDs. Let’s create a DataFrame. We’ll read a JSON file into a DataFrame and then write it to a Parquet file. Here’s how
people = spark.read.json("C:/Users/User/Documents/Work/Data/People.json") people.show() people.write.parquet("C:/Users/User/Documents/Work/Data/People.parquet")
If you were able to read Json file and write it to a Parquet file successfully then you should have a parquet folder created in your destination directory.
Read and Write files on HDFS
In the above examples, we have read and written the file on the local file system. But you can do the same things on HDFS i.e read from HDFS and write to HDFS or read from Local FS and write to HDFS or vice versa. To access HDFS while reading or writing a file you need tweak your command slightly.
While specifying the location of source/destination, mention the HDFS namespace.
spark.read.json("hdfs://<Hadoop's default FS/<Directory>/<file.json>")
If you don’t know what’s Hadoop’s default FS is then run below command to query the Hadoop configuration.
hdfs getconf -confkey fs.defaultFS
Note that if you have configured Spark to read/write from HDFS by default then you’ll need to mention the local FS namespace to read/write files from your local file system. Like below
The WebUI provides a web interface to monitor Spark jobs, evaluate DAG (Directed Acyclic Graph), check how the job is divided into different stages, which part is running in parallel and which in a sequential manner, number of cores being utilized, what are the environment variables set, etc. It gives you a pictorial as well as numeric view (lots of metrics in there) of your Job life cycle.
To access the WebUI, open a browser of your choice and open localhost:4040
WEB UI Ports
The port 4040 is the default port allocated for WebUI, however, if you are running multiple shells then they will be assigned different ports – 4041, 4041, etc. You can confirm the allotted port while launching Scala shell or PySpark shell. Go to the screenshot where we launched the Scala shell or PySpark shell somewhere above, read it carefully. You’ll find the assigned port there!
Below is how WebUI would look like. Open it in your browser, and go through all the tabs, menus, links and for that matter click on whatever is clickable. You’ll get many insights from this. If you have executed the commands to read/write files earlier and if you haven’t closed your shells yet then you’ll see how those jobs were executed on the Web UI.
Finally, hope you find this post useful. Ran into any issues? Want to suggest an edit? Want us to write about a specific topic? We’d love to hear from you. Please leave your feedback in the comments below. Seriously, don’t be shy. 🙂