Reference Manual: 2023

Installation of Apache Spark on Ubuntu VM Server on Azure

Create VMs

First step is to create VMs on Azure, I am not going into the details of that. Once done, please try to remember the User ID and password.

Installation of Putty

In order to login to the box, we need to install the putty for windows, i used the link below

http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

After installation, open putty and enter the public IP address as below and it will open a screen where it will ask the username and password. I have not set up password less authentication yet on the VMs

Download Spark into desired folder

For that we need to first create the folder where we want the binary download of spark to get downloaded

plasparkadmin@13.65.203.139's password:

Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-72-generic x86_64)

* Documentation: https://help.ubuntu.com/

System information as of Tue Apr 11 18:55:11 UTC 2017

System load: 0.24 Memory usage: 0% Processes: 87

Usage of /: 40.9% of 1.94GB Swap usage: 0% Users logged in: 0

Graph this data and manage this system at:

https://landscape.canonical.com/

Get cloud support with Ubuntu Advantage Cloud Guest:

http://www.ubuntu.com/business/services/cloud

0 packages can be updated.

0 updates are security updates.

Your Hardware Enablement Stack (HWE) is supported until April 2019.

The programs included with the Ubuntu system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

applicable law.

plasparkadmin@pal-dev-vm-01:~$ cd ..

plasparkadmin@pal-dev-vm-01:/home$ ls

plasparkadmin

plasparkadmin@pal-dev-vm-01:/home$ cd plasparkadmin/

plasparkadmin@pal-dev-vm-01:~$ mkdir work

plasparkadmin@pal-dev-vm-01:~$ ls

work

plasparkadmin@pal-dev-vm-01:~$ chmod 777 -R work

plasparkadmin@pal-dev-vm-01:~$ ls

work

plasparkadmin@pal-dev-vm-01:~$ cd work

plasparkadmin@pal-dev-vm-01:~/work$ wget http://d3kbcqa49mib13.cloudfront.net/sp ark-2.1.0-bin-hadoop2.7.tgz

--2017-04-12 16:55:57-- http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-ha doop2.7.tgz

Resolving d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)... 54.23 0.5.90, 54.230.5.79, 54.230.5.12, ...

Connecting to d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)|54.2 30.5.90|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 195636829 (187M) [application/x-tar]

Saving to: ‘spark-2.1.0-bin-hadoop2.7.tgz’

100%[======================================>] 195,636,829 26.2MB/s in 7.7s

2017-04-12 16:56:05 (24.1 MB/s) - ‘spark-2.1.0-bin-hadoop2.7.tgz’ saved [1956368 29/195636829]

Now extract the downloaded file as follows

plasparkadmin@pal-dev-vm-01:~/work$ ls

spark-2.1.0-bin-hadoop2.7.tgz

plasparkadmin@pal-dev-vm-01:~/work$ tar -xvf spark-2.1.0-bin-hadoop2.7.tgz

Download Java

Go to the Oracle site and download JDK 8, in my case from the ssh I did

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u121-b13/e9e7ea248e2c4826b92b3f075a80e441/jdk-8u121-linux-x64.tar.gz"

The above command should download and install the latest version of JAVA.

extract it with the command : tar xzf jdk-8u121-linux-x64.tar.gz

Setting Environment Paths

We need to set up JAVA_HOME, SPARK_HOME AND HADOOP_HOME as below from the command line

plasparkadmin@pal-dev-vm-01:~$ export JAVA_HOME=/home/plasparkadmin/work/jdk1.8.0_121

plasparkadmin@pal-dev-vm-01:~$ export PATH=$JAVA_HOME/bin:$PATH

plasparkadmin@pal-dev-vm-01:~$ echo $JAVA_HOME

/home/plasparkadmin/work/jdk1.8.0_121

plasparkadmin@pal-dev-vm-01:~$ export SPARK_HOME=/home/plasparkadmin/work/spark- 2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ export PATH=$SPARK_HOME/bin:$PATH

plasparkadmin@pal-dev-vm-01:~$ echo $SPARK_HOME

/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ export HADOOP_HOME=/home/plasparkadmin/work/spark -2.1.0-bin-hadoop2.7

plasparkadmin@pal-dev-vm-01:~$ echo $HADOOP_HOME

/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7

It might be prudent to set up the above paths into .profile along with .bashrc

Command : vi ~/.profile

After opening scroll to the end of the file and press CNTRL + I, this will open the file in insert mode.

Type the paths in and then press ESC when done. This will change the file from Insert mode to normal mode.

To save and exit the file type = :x

To quit without saving type= :q

Install Python Tools

To make python development interactive, we might need to install some python development tools like iPython and Jupytar notebook. Here is a site below which had detailed instructions on how to get that done and I was able to follow that with ease.

https://www.digitalocean.com/community/tutorials/how-to-set-up-a-jupyter-notebook-to-run-ipython-on-ubuntu-16-04

Listing out the main commands. At first try to update the ubuntu version from terminal with the command :

$ sudo apt-get update

Please note that on first time run of the above command I faced an issue and the error was like below

E: Could not get lock /var/lib/dpkg/lock - open (11 Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?

I was able to resolve the above error with the following commands

Remove your /var/lib/dpkg/lock file and force package reconfiguration.

sudo rm /var/lib/dpkg/lock
sudo dpkg --configure -a

It should work after this.

And the steps are explained above which I copied from stackoverflow.

Once your ubuntu is updated now install pip and python dev as follows

$ sudo apt-get -y install python-pip python-dev

Once done you can check the installation location

$pip --version

pip 8.1.1 from /usr/lib/python2.7/dist-packages (python 2.7)

$python --version

Python 2.7.12

$whereis python2.7

python2: /usr/bin/python2.7-config /usr/bin/python2 /usr/bin/python2.7 /usr/lib/python2.7 /etc/python2.7 /usr/local/lib/python2.7 /usr/include/python2.7 /usr/share/man/man1/python2.1.gz

once done, we can proceed with installation of ipython and notebook

Since we are working from ssh so notebook might not work, but here are the steps below for its installation.

$ sudo apt-get -y install ipython ipython-notebook

Now we can move on to installing Jupyter Notebook:


sudo -H pip install jupyter

Depending on what version of pip is in the Ubuntu apt-get repository, you might get the following error when trying to install Jupyter:



You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

If so, you can use pip to upgrade pip to the latest version:


sudo -H pip install --upgrade pip

Upgrade pip, and then try installing Jupyter again:


sudo -H pip install jupyter

Try out by typing

$ jupyter notebook

Installation done

Setting up worker node/cluster

At first we will install open-ssh server on the worker node so that it can remotely logged into from the master node. So open a terminal on the worker node and then type the following installation commands

# On Worker nodes, we install SSH Server so that we can access this node from Master node

sudo apt-get install openssh-server

After installing the SSH server on worker node, generate a key from the master so that master can access the worker node without asking for a password.

On the master node, we generate rsa key for remote access to the worker node.

# On Master node, we generate a rsa key for remote access

ssh-keygen

After the above command just keep pressing enter until the pass phrase gets generated

kghosh@DVY1L32-ubuntu-1:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/kghosh/.ssh/id_rsa):
Created directory '/home/kghosh/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/kghosh/.ssh/id_rsa.
Your public key has been saved in /home/kghosh/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:B8GWGJ+vSRKSr6L+Y+QeBvGr5Rljie4Crr+Lztc4csM kghosh@DVY1L32-ubuntu-1
The key's randomart image is:
+---[RSA 2048]----+
|      .+..       |
|     ...+o       |
| . o ..+        |
| o o . o       |
| . . o S o      |
|. o.o. o +       |
|o.=X+   o        |
|+=*E=.           |
|XO@B+            |
+----[SHA256]-----+

The Terminal will list out the passphrase file where it got saved as in the screen shot above.

We need to now copy each the passphrase to each of the worker node machines and the generic command for that is as below

# To access Worker ncd odes via SSH without providing password (just use our rsa key), we need to copy our public key to each Worker node

ssh-copy-id -i ~/.ssh/id_rsa.pub <username_on_remote_machine>@<IP_address_of_that_remote_machine>

You can get the Ip address of the remote machine by typing ifconfig from the terminal. So in our case that command is as follows

ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.64

Output-
kghosh@DVY1L32-ubuntu-1:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.164
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/kghosh/.ssh/id_rsa.pub"
The authenticity of host '192.168.1.164 (192.168.1.164)' can't be established.
ECDSA key fingerprint is SHA256:jpsnVaUNquTNRqiuKqLGGyR3AYTp/tqIneCJf5ZWcDI.
Are you sure you want to continue connecting (yes/no)? y
Please type 'yes' or 'no': yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
kghosh@192.168.1.164's password:

Number of key(s) added: 1

Next, copy this file from the master to the the slave machine using

scp <source> <destination>

To copy a file from B to A while logged into B:

scp .ssh/id_rsa.pub kghosh@192.168.1.164:/home/kaushik/.ssh/id_rsa.pub

and then from slave machine run:

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ chmod 644 ~/.ssh/authorized_keys

Spark Configuration

Go to the config folder of spark in the master machine and copy the file slaves.template to slaves and specify the ip address of the slave machine

plasparkadmin@pal-dev-vm-01:~/work$ cd spark-2.1.0-bin-hadoop2.7/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ls

bin data jars licenses python README.md sbin

conf examples LICENSE NOTICE R RELEASE yarn

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd conf/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls

docker.properties.template metrics.properties.template spark-env.sh.template

fairscheduler.xml.template slaves.template

log4j.properties.template spark-defaults.conf.template

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ cp slaves.template slaves

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls

docker.properties.template slaves

fairscheduler.xml.template slaves.template

log4j.properties.template spark-defaults.conf.template

metrics.properties.template spark-env.sh.template

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ vi slaves

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$

Worker Configurations

For each worker from the master copy the jdk and spark binary folder recursively using scp command

plasparkadmin@pal-dev-vm-01:~/work$ scp -r jdk1.8.0_121/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

plasparkadmin@pal-dev-vm-01:~/work$ scp -r spark-2.1.0-bin-hadoop2.7/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

Once the copy is done, we will need to setup the Java Home path and the Spark and Hadoop Home paths just like we did in the master.

Just like in the master we can directly write set the paths in the terminal on the worker.

Python Dev Tools on the Worker

Install python on the worker, so that we can submit pyspark jobs.

$ sudo apt-get update

$ sudo apt-get -y install python-pip python-dev

Installing Remote Desktop to Master and Minimum GUI

Most the steps are copied from the link in the reference. These are the steps I did below to enable remote desktop and load the spark master ui.

First thing to enable is the Port for Remote Desktop Connection on Azure. So login to Azure portal and under the virtual network of the Master VM, Enable Port TCP :3389.

Install xfce, use:

Copy

Code

#sudo apt-get install xubuntu-desktop

Then enable xfce, use:

Copy

Code

#echo xfce4-session >~/.xsession

For Ubuntu to install xrdp use:

Copy

Code

#sudo apt-get install xrdp

Edit the config file /etc/xrdp/startwm.sh, use:

Copy

Code

#sudo vi /etc/xrdp/startwm.sh

Add line xfce4-session before the line /etc/X11/Xsession.

Restart xrdp service, use:

Copy

Code

#sudo service xrdp restart

Connect your Linux VM from a Windows machine

In a Windows machine, start the remote desktop client(Remote Desktop Connection), input your Linux VM DNS name, or go to Dashboard of your VM in Azure classic portal and click Connect to connect your Linux VM, you will see below login window:

Login with the user & password of your Linux VM, and enjoy the Remote Desktop from your Microsoft Azure Linux VM right now!

Sample Python Application and PySpark Submission

SSH into the master and navigate to the spark home directory. Once you are there create a project folder, in our case its called kaushikpla. On the terminal type

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd kaushikpla/

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ vi plaPythonApp.py

Copy paste the below pyspark code from local computer

from pyspark import SparkConf

from pyspark import SparkContext

from pyspark.sql import SQLContext

from pyspark.sql.types import *

def main():

sc = SparkContext()

sqlContext = SQLContext(sc)

eventPath = "wasb://sparkjob@pladevstorage.blob.core.windows.net/input/events.log"

eventsJson = sqlContext.read.json(eventPath)

resultDF = eventsJson.groupBy(['EventDateTime', 'EventTypeName', 'PN']).count()

resultDF.coalesce(1).write.format('csv').options(header='true').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmcsv')

resultDF.coalesce(1).write.format('json').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmjson')

if __name__ == "__main__":

main()

Once that is done, go back to the spark home directory and submit the job with the below command.

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --master spark://10.0.0.4:7077 kaushikpla/plaPythonApp.py

Install Microsoft SQL JDBC Driver

In order to connect to SQL Server, we need to install Microsoft SQL JDBC Driver. So download the latest driver from the link below

https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

Download the file : sqljdbc_6.0.8112.100_enu.tar.gz and unpack it using the following command

plasparkadmin@pal-dev-vm-01:~/work$ gzip -d sqljdbc_6.0.8112.100_enu.tar.gz
plasparkadmin@pal-dev-vm-01:~/work$ tar -xf sqljdbc_6.0.8112.100_enu.tar

Once its extracted, we proceed to adding it to the classpath of spark and write a sample code to test.

To add the driver class path, I added the *.jar file to the $SPARK_HOME\conf\spark-defaults.conf

If there are already entries then put a comma and then add the entry below

/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

So my final spark-defaults.conf looked like below

spark.jars=/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/hadoop-azure-2.7.0.jar,/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/azure-storage-2.0.0.jar,/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

scp the sqljdbc folder to the workers and add the same class path in each workers and restart the cluster.

Testing Connection to SQL Server

Write the below sample application and save it and run it with the execution command below

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --driver-class-path /home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar --master spark://10.0.0.4:7077 kaushikpla/plaRollupToSQL.py

please note above that I have added the driver class path in the execution command because it was not able to find the driver without that.

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

def main():
conf = (SparkConf().setAppName("data_import"))
sc = SparkContext(conf = conf)
#sc = SparkContext()
sqlContext = SQLContext(sc)
jdbcDF = sqlContext.read.format("jdbc").option("url", "jdbc:sqlserver://svr.database.windows.net;databaseName=dbname").option("dbtable", "dbo.mytable").option("user", "username").option("password", "password").load()
# Displays the content of the DataFrame to stdout ...first 10 rows
jdbcDF.show(10)

if __name__ == "__main__":
main()

Writing to Azure SQL Database

Just like you can read from Azure SQL, you can also write to Azure SQL directly from the dataframe, a sample code as below

viewpath = "wasb://ap1rpt@dawstopla.blob.core.windows.net/events_dt.csv"
df = edf.select('ET', 'EventCount').groupBy('ET').agg(func.sum("EventCount")).withColumnRenamed('sum(EventCount)', 'EventCount')
df.withColumn('EventDate', lit(eventDate)).coalesce(1).write.format("csv").mode("overwrite").options(header='true').save(viewpath)
df.withColumn('EventDate', lit(eventDate)).write.jdbc(url="jdbc:sqlserver://server.database.windows.net;databaseName=mydb", table="dbo.mytable", mode="append", properties={"user": "username", "password":"password"})

*For the above to wotk, please make sure the select columns and the database column names in the database are same. mode = "append" will mean it will append to the existing data that is there in the database.

Python Module for MSSQL Client

There is a python module for MsSQL client : http://www.pymssql.org/en/stable/intro.html. To install that module do this below

sudo apt-get install freetds-dev

sudo pip install pymssql

Once installed we will run a spark job to write and read from database. I have not used it yet.

Reference Manual

Search This Blog

Wednesday, June 21, 2023

Apache Spark Installation on Azure Ubuntu VM

Installation of Apache Spark on Ubuntu VM Server on Azure

Create VMs

Installation of Putty

Download Spark into desired folder

Download Java

Setting Environment Paths

Install Python Tools

Setting up worker node/cluster

Spark Configuration

Worker Configurations

Python Dev Tools on the Worker

Installing Remote Desktop to Master and Minimum GUI

Connect your Linux VM from a Windows machine

Sample Python Application and PySpark Submission

Install Microsoft SQL JDBC Driver

Testing Connection to SQL Server

Writing to Azure SQL Database

Python Module for MSSQL Client