Search This Blog

Showing posts with label PySpark. Show all posts
Showing posts with label PySpark. Show all posts

Wednesday, June 21, 2023

Apache Spark Installation on Azure Ubuntu VM

Installation of Apache Spark on Ubuntu VM Server on Azure

Create VMs

First step is to create VMs on Azure, I am not going into the details of that. Once done, please try to remember the User ID and password.

Installation of Putty

In order to login to the box, we need to install the putty for windows, i used the link below

http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

After installation, open putty and enter the public IP address as below and it will open a screen where it will ask the username and password. I have not set up password less authentication yet on the VMs


Download Spark into desired folder

For that we need to first create the folder where we want the binary download of spark to get downloaded

login as: plasparkadmin
plasparkadmin@13.65.203.139's password:
plasparkadmin@13.65.203.139's password:
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-72-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Tue Apr 11 18:55:11 UTC 2017

  System load: 0.24              Memory usage: 0%   Processes:       87
  Usage of /:  40.9% of 1.94GB   Swap usage:   0%   Users logged in: 0

  Graph this data and manage this system at:
    https://landscape.canonical.com/

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

0 packages can be updated.
0 updates are security updates.

Your Hardware Enablement Stack (HWE) is supported until April 2019.


The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

plasparkadmin@pal-dev-vm-01:~$ cd ..
plasparkadmin@pal-dev-vm-01:/home$ ls
plasparkadmin
plasparkadmin@pal-dev-vm-01:/home$ cd plasparkadmin/
plasparkadmin@pal-dev-vm-01:~$ mkdir work
plasparkadmin@pal-dev-vm-01:~$ ls
work
plasparkadmin@pal-dev-vm-01:~$ chmod 777 -R work
plasparkadmin@pal-dev-vm-01:~$ ls
work
plasparkadmin@pal-dev-vm-01:~$ cd work
plasparkadmin@pal-dev-vm-01:~/work$ wget http://d3kbcqa49mib13.cloudfront.net/sp                           ark-2.1.0-bin-hadoop2.7.tgz
--2017-04-12 16:55:57--  http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-ha                           doop2.7.tgz
Resolving d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)... 54.23                           0.5.90, 54.230.5.79, 54.230.5.12, ...
Connecting to d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)|54.2                           30.5.90|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 195636829 (187M) [application/x-tar]
Saving to: ‘spark-2.1.0-bin-hadoop2.7.tgz’

100%[======================================>] 195,636,829 26.2MB/s   in 7.7s

2017-04-12 16:56:05 (24.1 MB/s) - ‘spark-2.1.0-bin-hadoop2.7.tgz’ saved [1956368                           29/195636829]

Now extract the downloaded file as follows

plasparkadmin@pal-dev-vm-01:~/work$ ls
spark-2.1.0-bin-hadoop2.7.tgz
plasparkadmin@pal-dev-vm-01:~/work$ tar -xvf spark-2.1.0-bin-hadoop2.7.tgz

Download Java

Go to the Oracle site and download JDK 8, in my case from the ssh I did 

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u121-b13/e9e7ea248e2c4826b92b3f075a80e441/jdk-8u121-linux-x64.tar.gz"

The above command should download and install the latest version of JAVA.
extract it with the command : tar xzf jdk-8u121-linux-x64.tar.gz

Setting Environment Paths

We need to set up JAVA_HOME, SPARK_HOME AND HADOOP_HOME as below from the command line

plasparkadmin@pal-dev-vm-01:~$ export JAVA_HOME=/home/plasparkadmin/work/jdk1.8.0_121
plasparkadmin@pal-dev-vm-01:~$ export PATH=$JAVA_HOME/bin:$PATH
plasparkadmin@pal-dev-vm-01:~$ echo $JAVA_HOME
/home/plasparkadmin/work/jdk1.8.0_121
plasparkadmin@pal-dev-vm-01:~$ export SPARK_HOME=/home/plasparkadmin/work/spark-           2.1.0-bin-hadoop2.7
plasparkadmin@pal-dev-vm-01:~$ export PATH=$SPARK_HOME/bin:$PATH
plasparkadmin@pal-dev-vm-01:~$ echo $SPARK_HOME
/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7
plasparkadmin@pal-dev-vm-01:~$ export HADOOP_HOME=/home/plasparkadmin/work/spark         -2.1.0-bin-hadoop2.7
plasparkadmin@pal-dev-vm-01:~$ echo $HADOOP_HOME
/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7

It might be prudent to set up the above paths into .profile along with .bashrc

Command : vi ~/.profile
After opening scroll to the end of the file and press CNTRL + I, this will open the file in insert mode.

Type the paths in and then press ESC when done. This will change the file from Insert mode to normal mode.

To save and exit the file type = :x
To quit without saving type= :q

Install Python Tools 

To make python development interactive, we might need to install some python development tools like iPython and Jupytar notebook. Here is a site below which had detailed instructions on how to get that done and I was able to follow that with ease.

https://www.digitalocean.com/community/tutorials/how-to-set-up-a-jupyter-notebook-to-run-ipython-on-ubuntu-16-04

Listing out the main commands. At first try to update the ubuntu version from terminal with the command :  

$ sudo apt-get update

Please note that on first time run of the above command I faced an issue and the error was like below

E: Could not get lock /var/lib/dpkg/lock - open (11 Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?  

I was able to resolve the above error with the following commands


Remove your /var/lib/dpkg/lock file and force package reconfiguration.
sudo rm /var/lib/dpkg/lock
sudo dpkg --configure -a
It should work after this.
And the steps are explained above which I copied from stackoverflow.

Once your ubuntu is updated now install pip and python dev as follows

$ sudo apt-get -y install python-pip python-dev

Once done you can check the installation location

$pip --version
pip 8.1.1 from /usr/lib/python2.7/dist-packages (python 2.7)

$python --version
Python 2.7.12

$whereis python2.7
python2: /usr/bin/python2.7-config /usr/bin/python2 /usr/bin/python2.7 /usr/lib/python2.7 /etc/python2.7 /usr/local/lib/python2.7 /usr/include/python2.7 /usr/share/man/man1/python2.1.gz

once done, we can proceed with installation of ipython and notebook

Since we are working from ssh so notebook might not work, but here are the steps below for its installation.

$ sudo apt-get -y install ipython ipython-notebook

Now we can move on to installing Jupyter Notebook:
  • sudo -H pip install jupyter
Depending on what version of pip is in the Ubuntu apt-get repository, you might get the following error when trying to install Jupyter:
You are using pip version 8.1.1, however version 8.1.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command.

If so, you can use pip to upgrade pip to the latest version:
  • sudo -H pip install --upgrade pip
Upgrade pip, and then try installing Jupyter again:
  • sudo -H pip install jupyter
Try out by typing

$ jupyter notebook

Installation done

Setting up worker node/cluster

Login to the worker vm and install the open ssh server with the following commands
At first we will install open-ssh server on the worker node so that it can remotely logged into from the master node. So open a terminal on the worker node and then type the following installation commands

# On Worker nodes, we install SSH Server so that we can access this node from Master node
sudo apt-get install openssh-server

After installing the SSH server on worker node, generate a key from the master so that master can access the worker node without asking for a password.

On the master node, we generate rsa key for remote access to the worker node.

# On Master node, we generate a rsa key for remote access
ssh-keygen

After the above command just keep pressing enter until the pass phrase gets generated

kghosh@DVY1L32-ubuntu-1:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/kghosh/.ssh/id_rsa):
Created directory '/home/kghosh/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/kghosh/.ssh/id_rsa.
Your public key has been saved in /home/kghosh/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:B8GWGJ+vSRKSr6L+Y+QeBvGr5Rljie4Crr+Lztc4csM kghosh@DVY1L32-ubuntu-1
The key's randomart image is:
+---[RSA 2048]----+
|      .+..       |
|     ...+o       |
| .  o ..+        |
|  o  o . o       |
| . .  o S o      |
|. o.o. o +       |
|o.=X+   o        |
|+=*E=.           |
|XO@B+            |
+----[SHA256]-----+

The Terminal will list out the passphrase file where it got saved as in the screen shot above.

We need to now copy each the passphrase to each of the worker node machines and the generic command for that is as below
# To access Worker ncd odes via SSH without providing password (just use our rsa key), we need to copy our public key to each Worker node
ssh-copy-id -i ~/.ssh/id_rsa.pub <username_on_remote_machine>@<IP_address_of_that_remote_machine>



You can get the Ip address of the remote machine by typing ifconfig from the terminal. So in our case that command is as follows

ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.64

Output-
kghosh@DVY1L32-ubuntu-1:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub kghosh@192.168.1.164
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/kghosh/.ssh/id_rsa.pub"
The authenticity of host '192.168.1.164 (192.168.1.164)' can't be established.
ECDSA key fingerprint is SHA256:jpsnVaUNquTNRqiuKqLGGyR3AYTp/tqIneCJf5ZWcDI.
Are you sure you want to continue connecting (yes/no)? y
Please type 'yes' or 'no': yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
kghosh@192.168.1.164's password:

Number of key(s) added: 1

Next, copy this file from the master to the the slave machine using 
scp <source> <destination>
To copy a file from B to A while logged into B:
scp .ssh/id_rsa.pub kghosh@192.168.1.164:/home/kaushik/.ssh/id_rsa.pub
and then from slave machine run:
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ chmod 644 ~/.ssh/authorized_keys

Spark Configuration 

Go to the config folder of spark in the master machine and copy the file slaves.template to slaves and specify the ip address of the slave machine

plasparkadmin@pal-dev-vm-01:~/work$ cd spark-2.1.0-bin-hadoop2.7/
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ls
bin   data      jars     licenses  python  README.md  sbin
conf  examples  LICENSE  NOTICE    R       RELEASE    yarn
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd conf/
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls
docker.properties.template  metrics.properties.template   spark-env.sh.template
fairscheduler.xml.template  slaves.template
log4j.properties.template   spark-defaults.conf.template
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ cp slaves.template slaves
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ ls
docker.properties.template   slaves
fairscheduler.xml.template   slaves.template
log4j.properties.template    spark-defaults.conf.template
metrics.properties.template  spark-env.sh.template
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ vi slaves
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$ vi slaves
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7/conf$

Worker Configurations

For each worker from the master copy the jdk and spark binary folder recursively using scp command

plasparkadmin@pal-dev-vm-01:~/work$ scp -r jdk1.8.0_121/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

plasparkadmin@pal-dev-vm-01:~/work$ scp -r spark-2.1.0-bin-hadoop2.7/ plasparkadmin@13.85.14.23:/home/plasparkadmin/work/

Once the copy is done, we will need to setup the Java Home path and the Spark and Hadoop Home paths just like we did in the master.

Just like in the master we can directly write set the paths in the terminal on the worker.

Python Dev Tools on the Worker

Install python on the worker, so that we can submit pyspark jobs.

$ sudo apt-get update
$ sudo apt-get -y install python-pip python-dev

Installing Remote Desktop to Master and Minimum GUI

Most the steps are copied from the link in the reference. These are the steps I did below to enable remote desktop and load the spark master ui.

First thing to enable is the Port for Remote Desktop Connection on Azure. So login to Azure portal and under the virtual network of the Master VM, Enable Port TCP :3389.

Login to the master node using putty, first install the version of ubuntu desktop that support rdp and xfce.

Install xfce, use:
Copy
Code
#sudo apt-get install xubuntu-desktop
Then enable xfce, use:
Copy
Code
#echo xfce4-session >~/.xsession
For Ubuntu to install xrdp use:
Copy
Code
#sudo apt-get install xrdp
Edit the config file /etc/xrdp/startwm.sh, use:
Copy
Code
#sudo vi /etc/xrdp/startwm.sh   
Add line xfce4-session before the line /etc/X11/Xsession.
Restart xrdp service, use:
Copy
Code
#sudo service xrdp restart

Connect your Linux VM from a Windows machine

In a Windows machine, start the remote desktop client(Remote Desktop Connection), input your Linux VM DNS name, or go to Dashboard of your VM in Azure classic portal and click Connect to connect your Linux VM, you will see below login window:
image
Login with the user & password of your Linux VM, and enjoy the Remote Desktop from your Microsoft Azure Linux VM right now!

Sample Python Application and PySpark Submission

SSH into the master and navigate to the spark home directory. Once you are there create a project folder, in our case its called kaushikpla. On the terminal type

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ cd kaushikpla/
plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ vi plaPythonApp.py

Copy paste the below pyspark code from local computer

from pyspark import SparkConf
from pyspark import SparkContext 
from pyspark.sql import SQLContext
from pyspark.sql.types import *

def main():
    sc = SparkContext()
    sqlContext = SQLContext(sc)

    eventPath = "wasb://sparkjob@pladevstorage.blob.core.windows.net/input/events.log"
    eventsJson = sqlContext.read.json(eventPath)
    
    resultDF = eventsJson.groupBy(['EventDateTime', 'EventTypeName', 'PN']).count()
    resultDF.coalesce(1).write.format('csv').options(header='true').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmcsv')
    resultDF.coalesce(1).write.format('json').save('wasb://sparkjob@pladevstorage.blob.core.windows.net/outputvmjson')
    
    

if __name__ == "__main__":
    main()

Once that is done, go back to the spark home directory and submit the job with the below command.

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --master spark://10.0.0.4:7077 kaushikpla/plaPythonApp.py

Install Microsoft SQL JDBC Driver

In order to connect to SQL Server, we need to install Microsoft SQL JDBC Driver. So download the latest driver from the link below

https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Download the file : sqljdbc_6.0.8112.100_enu.tar.gz and unpack it using the following command

plasparkadmin@pal-dev-vm-01:~/work$ gzip -d sqljdbc_6.0.8112.100_enu.tar.gz
plasparkadmin@pal-dev-vm-01:~/work$ tar -xf sqljdbc_6.0.8112.100_enu.tar
 
Once its extracted, we proceed to adding it to the classpath of spark and write a sample code to test.
To add the driver class path, I added the *.jar file to the $SPARK_HOME\conf\spark-defaults.conf

If there are already entries then put a comma and then add the entry below
/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

So my final spark-defaults.conf looked like below
spark.jars=/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/hadoop-azure-2.7.0.jar,/home/plasparkadmin/work/spark-2.1.0-bin-hadoop2.7/lib/azure-storage-2.0.0.jar,/home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar

scp the sqljdbc folder to the workers and add the same class path in each workers and restart the cluster.

Testing Connection to SQL Server

Write the below sample application and save it and run it with the execution command below

plasparkadmin@pal-dev-vm-01:~/work/spark-2.1.0-bin-hadoop2.7$ ./bin/spark-submit --driver-class-path /home/plasparkadmin/work/sqljdbc_6.0/enu/jre8/sqljdbc42.jar --master spark://10.0.0.4:7077 kaushikpla/plaRollupToSQL.py

please note above that I have added the driver class path in the execution command because it was not able to find the driver without that.

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

def main():
    conf = (SparkConf().setAppName("data_import"))
    sc = SparkContext(conf = conf)
    #sc = SparkContext()
    sqlContext = SQLContext(sc)
    jdbcDF = sqlContext.read.format("jdbc").option("url", "jdbc:sqlserver://svr.database.windows.net;databaseName=dbname").option("dbtable", "dbo.mytable").option("user", "username").option("password", "password").load()
    # Displays the content of the DataFrame to stdout ...first 10 rows
    jdbcDF.show(10)

if __name__ == "__main__":
    main()

Writing to Azure SQL Database

Just like you can read from Azure SQL, you can also write to Azure SQL directly from the dataframe, a sample code as below

viewpath = "wasb://ap1rpt@dawstopla.blob.core.windows.net/events_dt.csv"
df = edf.select('ET', 'EventCount').groupBy('ET').agg(func.sum("EventCount")).withColumnRenamed('sum(EventCount)', 'EventCount')
df.withColumn('EventDate', lit(eventDate)).coalesce(1).write.format("csv").mode("overwrite").options(header='true').save(viewpath)
df.withColumn('EventDate', lit(eventDate)).write.jdbc(url="jdbc:sqlserver://server.database.windows.net;databaseName=mydb", table="dbo.mytable", mode="append", properties={"user": "username", "password":"password"})

*For the above to wotk, please make sure the select columns and the database column names in the database are same. mode = "append" will mean it will append to the existing data that is there in the database.

Python Module for MSSQL Client

There is a python module for MsSQL client : http://www.pymssql.org/en/stable/intro.html. To install that module do this below

sudo apt-get install freetds-dev
sudo pip install pymssql

Once installed we will run a spark job to write and read from database. I have not used it yet.